0% found this document useful (0 votes)
154 views

Unit-V Pig Programming

Uploaded by

Paleti Sunitha
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
154 views

Unit-V Pig Programming

Uploaded by

Paleti Sunitha
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 123

UNIT-5

PIG: Hadoop
Programming made
easier
Syllab
us
Pig: Hadoop Programming Made Easier
 Admiring the Pig Architecture,

 going with the Pig Latin Application Flow

 working through the ABCs of Pig Latin

 Evaluating Local and Distributed Modes of


Running Pig Scripts
 Checking out the Pig Script Interfaces

 Scripting with Pig Latin


Objectiv
es
 To admire the Pig Architecture,

 To go with the Pig Latin Application Flow,

 To work through the ABCs of Pig Latin,

 To evaluate Local and Distributed Modes of Running Pig


Scripts
 To check out the Pig Script Interfaces,

 To script with Pig Latin


Outcom
es
At the end of the course the student will be able to
 Admire the Pig Architecture

 Go with the Pig Latin Application Flow

 Work through the ABCs of Pig Latin

 Evaluate Local and Distributed Modes of Running Pig


Scripts
 Check out the Pig Script Interfaces

 Script with Pig Latin


Hadoop Ecosystem
What is
PIG
 Pig is a Scripting platformor tool that runs
on Hadoop Clusters, designed to process and
analyze large datasets.
(or)
 Pig is a high-level platform for creating
MapReduce programs used with Hadoop
 It enables to write complex data transformations
without Knowing Java
 It provides a high-level of abstraction for
processing over the MapReduce.
Contd..
• Pig was initially developed at Yahoo! to allow
people using Apache Hadoop to focus more on
analyzing large data sets and spend less time
having to write mapper and reducer programs.
• Like actual pigs, who eat almost anything, the Pig
programming language is designed to handle any
kind of data—That’s why the name, Pig!
Pig Components
 Two Main components of the Apache Pig tool are:
1. Pig Latin – A language
2. Pig Engine-A runtime
Environment
 It provides a high-level scripting language,
known as Pig Latin which is used to develop the
data analysis codes.
 To process the data which is stored in the HDFS,
the programmers will write the scripts using the
Pig Latin Language.
 This Pig Latin language provides various operators
using which programmers can develop their own
functions for reading,writing and processing data.
Contd..
 A Pig Latin program consists of a series of operations
or transformations which are applied to the input data to
produce output.
 These operations describe a data flow which is translated
into an executable representation, by Pig execution
environment.
 Underneath, results of these transformations are series of
MapReduce jobs which a programmer is unaware of. So, in
a way, Pig allows the programmer to focus on data rather
than the nature of execution.
 Internally Pig Engine(a component of Apache Pig)
accepts the pig latin scrpits and converted all
these scripts into a specific map and reduce task
Contd..
 Pig operates on various types of data like:
Structured, Semi-structured and Unstructured
 The result of Pig always stored in HDFS
Need of Pig
• While performing any MapReduce tasks, there is a
case Programmers who are not so good at Java
normally used to struggle to work with Hadoop.
Thus, we can say, Pig is a boon for all such
programmers
• Because, Without having to type complex codes in
Java, using Pig Latin, programmers can perform
MapReduce tasks easily.
 One limitation of MapReduce is that the
development cycle is very long. So, Writing the
reducer and mapper, compiling packaging
the code, submitting the job and retrieving the
output is a time-consuming task.
Contd..
 Apache Pig reduces the time of development
using the multi-query approach
 It also helps in reduce the length of codes.
 Let’s understand it with an example. Here an
operation that would require us to type 200 lines of
code (LoC) in Java can be easily done by typing as less
as just 10 LoC in Apache Pig. Hence, it shows, Pig
reduces the development time by almost 16 times.
 Programmers who have SQL knowledge needed less
effort to learn Pig. Because Pig Latin is SQL-like
language
 It offers many built-in operators, in order to support
data operations such as joins, filters, ordering, and
many more
Evolution of Pig:
 Earlier in 2006, Apache Pig was developed by
Yahoo’s researchers.
 At that time, the main idea to develop
Pig was to execute the MapReduce jobs on
extremely large datasets.
 In the year 2007, it moved to Apache Software
Foundation(ASF) which makes it an open source
project.
 The first version(0.1) of Pig came in the year 2008.
 The latest version of Apache Pig is 0.18 which came
in the year 2017.
Features of Apache
Pig:
Pig is an Open source project and it was
developed by Yahoo having the following features:
 Rich Set of operators:-For performing several
operations Apache Pig provides rich sets of
operators like the filters, join, sort, etc.
 Handles Heterogeneity of Data: Pig can handle
the analysis of both structured and unstructured
data
 Ease of Programming: Pig Latin is similar to SQL
and it is easy to write a Pig script if you are good at
SQL. Especially for SQL-programmer, Apache Pig is
a boon.
 Create User-defined Functions: Apache Pig is
extensible so that you can make your own user-
defined functions and process.
Features of Apache Pig:
(Contd.)
 Short Development time as the code is
simpler
 Extensibility − Using the existing operators,
users can develop their own functions to read,
process, and write data
 Multi Query approach- Apache Pig uses multi-
query approach. Basically, this reduces the length
of the codes to a great extent.
 No need for compilation
Here, we do not require any compilation. Since
every Apache Pig operator is converted internally
into a MapReduce job on execution.
Features of Apache Pig:
(Contd.)
 Optimization opportunities − The tasks in
Apache Pig optimize their execution
automatically, so the programmers need to focus
only on semantics of the language.
 Optional Schema - However, the schema is
optional, in Apache Pig. Hence, without designing
a schema we can store data. So, values are
stored as $01, $02 …so on.
Difference between Pig and
MapReduce
Difference between SQL & PIG
Admiring Pig
Architecture
 The language used to analyze data in Hadoop
using Pig is known as Pig Latin.
 It is a high level data processing language which
provides a rich set of data types and operators to
perform various operations on the data.
 To perform a particular task Programmers using Pig,
programmers need to write a Pig script using the
Pig Latin language, and execute them using any of
the execution mechanisms (Grunt Shell, UDFs,
Embedded).
 After execution, these scripts will go through a
series of transformations applied by the Pig
Framework, to produce the desired output.
 Internally, Apache Pig converts these scripts into a
series of MapReduce jobs, and thus, it makes the
 The architecture of Apache Pig is shown

below:
Apache Pig
Components
As shown in the figure, there are various components
in the Apache Pig framework. Let us take a look at
the major components.
Parser
 Initially the Pig Scripts are handled by the Parser. It
checks the syntax of the script, does type checking,
and other miscellaneous checks.
 The output of the parser will be a DAG (directed
acyclic graph), which represents the Pig Latin
statements and logical operators.
 In the DAG, the logical operators of the script are
represented as the nodes and the data flows are
represented as edges
Optimiz
er
 After the output from the parser is retrieved, a
logical plan for DAG is passed to a logical
optimizer. The optimizer is responsible for
carrying out the logical optimizations such as
projection and pushdown.
Compiler
 The compiler compiles the logical plan sent by
the optimizer. The compiler compiles the
optimized logical plan into a series of MapReduce
jobs.
Execution engine
 After the logical plan is converted to MapReduce
jobs, these jobs are sent to Hadoop in a properly
Grunt shell
• Grunt shell is a shell command.
• The Grunts shell of Apace pig is mainly used to
write pig Latin scripts. Pig script can be executed
with grunt shell which is native shell provided by
Apache pig to execute pig queries
• We can invoke shell commands
using sh and fs.

Shell Commands:
sh Command
• we can invoke any shell commands from the
Grunt shell, using the sh command. But make
sure, we cannot execute the commands that are
a part of the shell environment (ex − cd), using
Cont..

Syntax
The syntax of the sh command is:
grunt> sh shell command parameters

fs Command
• Moreover, we can invoke any fs Shell
commands from the Grunt shell by using
the fs command.
• The syntax of fs command is:
grunt> sh File System command
parameters
Cont..
File System Commands:
Command Description
cat prints the contents of one or more files
cd changes the current directory
copyFromLocal copies a local file to a Hadoop file system
copyToLocal copies a file or directory from HDFS to local file
system
cp copies a file or a directory to another directory
fs Accesses Hadoop ‘s file system shell
ls Lists file
mkdir Creates new directory
mv Move a file/directory to another directory
pwd prints the path of the current working directory
rm Deletes a file or a directory
Utility Commands:

• clear : clear the screen


– grunt> clear
• help : Provides help about the commands.
• history : Displays a list of statements executed / used so
far since the Grunt sell is invoked .
• set : Used to show/assign values to keys used in Pig.
• quit : You can quit from the Grunt shell.
• exec/run: Can execute Pig scripts
– grunt> exec [–param param_name =
param_value] [–param_file file_name] script
• kill : kill a job from the Grunt shell , grunt> kill JobId
Pig Latin Data Model
• Pig’s data types make up the data model for how
Pig thinks of the structure of the data it is
processing.
• With Pig, the data model gets defined when the
data is loaded. Any data you load into Pig from
disk is going to have a particular schema and
structure.
• Pig needs to understand that structure, so when
you do the loading, the data automatically goes
through a mapping.
• The Pig data model is rich enough to handle most
anything thrown its way, including table- like
structures and nested hierarchical data
structures.
Pig Latin Data
Model(Cont..)
It consists of 4 types of data models as
follows
The data model of Pig Latin is fully nested and it allows
complex (non- atomic) data types such as Atom, Bag,map
Ato
and tuple.

m:
Any single value in Pig Latin, irrespective of their data type is
known as an Atom. It is stored as string and can be used as
string and number. Pig’s atomic values are scalar types that
appear in most programming languages like-int, long, float,
double, chararray, and bytearray are the atomic values of
Pig.
• A piece of data or a simple atomic value is known as a field.

Example − ‘raja’ or ‘30’

Tuple:
A tuple is a record that consists of sequence of fields, where
the fields can be of type. It is similar to a row in RDBMS table

E.g:- (raju, 30)


Bag
 A bag is an unordered set of tuples. In other words, a
collection of tuples (non-unique) is known as a bag. Each
tuple can have any number of fields (flexible schema). A
bag is represented by ‘{}’. It is similar to a table in RDBMS,
but unlike a table in RDBMS, it is not necessary that every
tuple contain the same number of fields or that the fields in
the same position (column) have the same type.

Example − {(Raja, 30), (Mohammad, 45)}


 A bag can be a field in a relation; in that context, it is known
as inner bag.

Example −

{Raja, 30, {9848022338, [email protected],}}


Map
 A map (or data map) is a set of key-value pairs. The key needs to
be of type chararray and should be unique. The value might be
of any type. It is represented by ‘[]’
 Example − [name#Raja, age#30]

Relation
 Pag Latin statements works with relations. A relation is a
outermost structure of data model and it is defined as a bag of
tuples
 The relations in Pig Latin are unordered (there is no guarantee
that tuples are processed in any particular order).
• A relation is a bag
• A bag is a collection of tuples
• A tuple is an ordered set of fields
Working through the ABCs of Pig Latin
• Pig Latin is the language for Pig programs.
• Pig translates the Pig Latin script into
MapReduce jobs that can be executed
within Hadoop cluster.
• Pig Latin development team followed
three key design principles:
Keep it simple:
Make it smart
Don’t limit development
Keep it Simple
• Pig Latin is an abstraction for MapReduce that
simplifies the creation of parallel programs on the
Hadoop cluster for data flows and analysis.
• Complex tasks may require a series of
interrelated data transformations — such series
are encoded as data flow sequences.
• Writing Pig Latin scripts instead of Java
MapReduce programs makes these programs
easier to write, understand, and maintain
because
a) you don’t have to write the job in Java,
b) you don’t have to think in terms of
MapReduce, and
c) you don’t need to come up with custom code
to
Make it smart
• Pig Latin Compiler transform a Pig Latin program
into a series of Java MapReduce jobs.
• The compiler can optimize the execution of these
Java MapReduce jobs automatically, allowing the
user to focus on semantics rather than on how to
optimize and access the data.
• For example, SQL is set up as a declarative query
that you use to access structured data stored in
an RDBMS. The RDBMS engine first translates the
query to a data access method and then looks at
the statistics and generates a series of data
access approaches. The cost-based optimizer
chooses the most efficient approach for execution
Don’t limit development
• Make Pig extensible so that developers can add
functions to address their particular business
problems.
• Traditional RDBMS data warehouses make use of
the ETL data processing pattern, where you extract
data from outside sources, transform it to fit your
operational needs, and then load it into the end
target, whether it’s an operational data store, a
data warehouse, or another variant of database.
• With big data, the language for Pig data flows goes
with ELT instead: Extract the data from your
various sources, load it into HDFS, and then
transform it as necessary to prepare the data for
further analysis
Going with the Pig Latin Application
Flow
 Pig Latin is a dataflow language, where you define
a data stream and a series of transformations
that are applied to the data as it flows through
your application
 This is in contrast to a control flow language (like
C or Java), where you write a series of
instructions.
 In control flow languages, we use constructs like
loops and conditional logic (like an if statement).
You won’t find loops and if statements in Pig Latin
 To realize working with pig is significantly easier
Working of Pig
The basic flow of a Pig program is:

 Load: First load (LOAD) the data we want to manipulate, that

data is stored in HDFS or local file system.

 For a Pig program to access the data, you first tell Pig what file

or files to use.

 For that task, you use the LOAD 'data_file' command.

 Here, 'data_file' can specify either an HDFS file or a directory.

 If a directory is specified, all files in that directory are loaded

into the program

 If the data is stored in a file format that isn’t natively

accessible to Pig,

you can optionally add the USING function to the LOAD


 Transform: We run the data through a set of transformations

that are translated into a set of Map and Reduce tasks.

 The transformation logic is where all the data manipulation

happens. Here, you can FILTER out rows that aren’t of

interest, JOIN two sets of data files, GROUP data to build

aggregations, ORDER results, and do much, much more

 Dump: Finally, you dump (DUMP) the results to the screen or

Store (STORE) the results in a file somewhere


 You would typically use the DUMP command to send the output to

the screen when you debug your programs. When your program

goes into production, you simply change the DUMP call to a STORE

call so that any results from running your programs are stored in a

file for further processing or analysis


Pig Latin
Statements
•Pig Latin is a data flow language used by Apache Pig to
analyze the data in Hadoop.
•While processing data using Pig Latin, statements
are the basic constructs.
•A Pig Latin statement is an operator that takes a relation
as input and produces another relation as output.
•This definition applies to all pig latin operators except LOAD
and STORE command which read data from and write data to
the file system.
•These statements work with relations. They include
expressions and schemas. Every statement ends with a
semicolon (;).
•We will perform various operations using operators provided
by Pig Latin, through statements.
 Except LOAD and STORE, while performing all other
operations, Pig Latin statements take a relation as input
and produce another relation as output.
 Pig Latin statements are generally organized in the following
manner:
• A LOAD statement reads data from the file system
• A series of Transformations statements process
the data
• A STORE statement writes output to the file system
OR
• A DUMP statement displays output to the screen
Preparing Data (student_data.txt)
Pig Latin statement to load data to
Apache Pig
 In MapReduce mode, Pig reads (loads) data from
HDFS and stores the results back in HDFS.
Therefore, let us start HDFS and create the
following sample data in HDFS

grunt> Relation_name= LOAD 'student_data.txt'


USING PigStorage(',') as ( id:int,
firstname:chararray, lastname:chararray,
phone:chararray, city:chararray );

PigStorage() function:
 It loads and stores data as structured text files. It
takes a delimiter using which each entity of a
tuple is separated, as a parameter. By default, it
Pig Data
Types
• Pig Data types defines the data model of how pig thinks
the
structure of the data that it is processing
• In pig, the data model gets defined when the data is loaded
and it have a particular schema and structure.
• Pig model is rich enough to handle most of the structures
like hierarchal data and table-like structures.
• Pig data types are broken into two categories:

1.Scalar types (Simplex types)

2.Complex types
• Scalar types contain single value of types where as complex
Scalar
Types
 int- Represents a signed 32-bit integer. Example : 8
 long-Represents a signed 64-bit integer. Example : 5L

 float-Represents a signed 32-bit floating point. Example : 5.5F

 double-Represents a 64-bit floating point. Example : 10.5

 chararray- Represents a character array (string) in Unicode UTF-8


format. Example : ‘apachepig’
 Bytearray- Represents a Byte array (blob).

 Boolean- Represents a Boolean value. Example : true/ false.

 Datetime-Represents a date-time.

Example : 1970-01-01T00:00:00.000+00:00
 Biginteger-Represents a Java BigInteger. Example : 60708090709

 Bigdecimal-Represents a Java BigDecimal

Example : 185.98376256272893883
Null Values
 Values for all the above data types can be NULL.
 Apache Pig treats null values in a similar way as
SQL does.
 A null can be an unknown value or a non-existent
value.
 It is used as a placeholder for optional values.

 These nulls can occur naturally or can be the


result of an operation.
Complex
Types
Tuple-
• A tuple is an ordered set of fields.
• Tuple have fixed length and it have collection datatypes and
also containing multiple fields
Example : (raja, 30)
Syntax:
( field [, field …] )
Terms:
( ) - A tuple is enclosed in parentheses ( ).
Field -A piece of data. A field can be any data type (including tuple
and bag

Bag-
A bag containing collection of tuples which are unordered, Bag
constants are constructed using braces, with tuples in the bag
separated by com-
mas
Syntax: Inner bag
{ tuple [, tuple …] }
Terms:
{ } - An inner bag is enclosed in curly brackets { }
tuple - A tuple
Example :
• {(raju,30),(Mohhammad,45)}
Keys Points about Bag:
• A bag can have duplicate tuples.
• A bag can have tuples with differing numbers of fields. However, if
Pig tries to access a field that does not exist, a null value is
substituted.
• A bag can have tuples with fields that have different data types.
However, for Pig to effectively process bags, the schemas of the
tuples within those bags should be the same. For example, if half of
the tuples include chararray fields and while the other half include
float fields, only half of the tuples will participate in any kind of
computation because the chararray fields will be converted to null.
• Bags have two forms: outer bag (or relation) and inner bag
Example: Outer Bag
In this example A is a relation or bag of tuples. You can think of this bag as an
outer bag.
A = LOAD 'data' as (f1:int, f2:int, f3:int);
DUMP A;
Output:
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)

Example: Inner Bag


Now, suppose we group relation A by the first field to form relation X.
In this example X is a relation or bag of tuples. The tuples in relation X have
two fields. The first field is type int. The second field is type bag; you can think
of this bag as an inner bag.
X = GROUP A BY f1;
DUMP X;
Output:
(1,{(1,2,3)})
(4,{(4,2,1),
(4,3,3)})
(8,{(8,3,4)})
Map-
• A Map is a set of key-value pairs.
• Key values within a relation must be unique
Syntax (<> denotes optional)
• [ key#value <, key#value …> ]

Example : In this example the map includes two key value


pairs
[ ‘name’#’Raju’, ‘age’#30]
Pig Operators
(or)
Data Transformations in Pig
Pig Latin – Arithmetic
Operators
 + Addition − Adds values on either side of the operator
 − Subtraction − Subtracts right hand operand
from left hand operand
 * Multiplication − Multiplies values on either side of the
operator
 / Division − Divides left hand operand by right hand
operand
 % Modulus − Divides left hand operand by right hand
operand and returns remainder
 ? : Bincond − Evaluates the Boolean operators. It has three
operands as shown below. variable x = (expression) ?
value1 if true : value2 if false.
Pig Latin – Comparison
Operators
 == Equal − Checks if the values of two operands are equal or
not; if yes, then the condition becomes true.

(a = b) is not true
 != Not Equal − Checks if the values of two operands are
equal or not. If the values are not equal, then condition
becomes true.

(a != b) is true.
 > Greater than − Checks if the value of the left operand is
greater than the value of the right operand. If yes, then the
condition becomes true.

(a > b) is not true.


 < Less than − Checks if the value of the left operand is less
 >= Greater than or equal to − Checks if the value of the
left operand is greater than or equal to the value of the right
operand. If yes, then the condition becomes true.

Eg. (a >= b) is not true.


 <= Less than or equal to − Checks if the value of the left
operand is less than or equal to the value of the right
operand. If yes, then the condition becomes true.

Eg. (a <= b) is true.


 matches Pattern matching − Checks whether the string
in the left-
hand side matches with the constant in the right-hand side.

Eg. f1 matches '.*tutorial.*'


Pig Latin – Type Construction
Operators

 () Tuple constructor operator − This operator is


used to construct a tuple.
Eg.(Raju, 30)
 {} Bag constructor operator − This operator is
used to construct a bag.
Eg. {(Raju, 30), (Mohammad, 45)}
 [] Map constructor operator − This operator is
used to construct a tuple.
Eg. [name#Raja, age#30]
Pig Latin – Relational
Operations

Loading and Storing


 LOAD- To Load the data from the file system (local/HDFS)
into a relation.
 The load statement consists of two parts divided by the
“=” operator.
 On the left-hand side, we need to mention the name of
the relation where we want to store the data, and on the
right-hand side, we have to define how we store the
data.
 Given below is the syntax of the Load operator.

Relation_name = LOAD 'Input file path’ [USING function


as schema];
Cont..
Component Description
Relation_name The relation in which we want to store the data.
Input file path Mention the HDFS directory where the file is stored
function A function from the set of load functions
provided by Apache Pig (BinStorage,
JsonLoader, PigStorage, TextLoader).
schema Define the schema of the data

 We can define the required schema as follows:


(column1 : data type, column2 : data type,
column3 : data type);
 Note: We load the data without specifying the schema. In
that case, the columns will be addressed as $01, $02,
etc…
Cont..

• grunt> student = LOAD


‘hdfs://localhost:9000/pig_data/student
_data.txt'USING PigStorage(',’) as
( id:int, firstname:chararray,
lastname:chararray, phone:chararray,
city:chararray );
The PigStorage() function:
• It loads and stores data as structured text files. It takes a
delimiter using which each entity of a tuple is separated,
as a parameter. By default, it takes ‘\t’ as a parameter.
Pig Latin – Relational
Operations (Cont.)
 STORE- To save a relation to the file system
(local/HDFS).
Syntax:- STORE relation-name into ‘file-path’
 Example:

grunt> STORE student INTO ‘


hdfs://localhost:9000/pig_Output/ ' USING
PigStorage (',');
Load and Store functions:

Used to determine how the data goes in and comes out of Pig
Cont..
Cont..
Filtering
Operators
 FILTER

 DISTINCT

 FOREACH, GENERATE

 STREAM
FILTER
 It is used to select required rows from a relation based on a
condition
Syntax:- relation-name1= FILTER relation-name2 By
Condition;
Example:
• grunt>
A = LOAD ‘student_data.txt‘ USING PigStorage(',’) as
( id:int, firstname:chararray, lastname:chararray,
phone:chararray, city:chararray );
B= FILTER A BY city==‘chennai’
DUMP B
DISTINCT
• To remove duplicate rows from a relation.
Syntax:-
relation-name1= DISTINCT relation-name2;
Example:
• grunt>
A = LOAD ‘student_data.txt‘ USING PigStorage(',’) as
( id:int, firstname:chararray,
lastname:chararray,
phone:chararray, city:chararray );
B= DISTINCT A
DUMP B
FOREACH, GENERATE
 To generate data transformations based on columns
of data.
Syntax:
Relation_name2= FOREACH relation-name1
GENERATE (required data)
Example:
• grunt>
A = LOAD ‘student_data.txt‘ USING PigStorage(',’) as
( id:int, firstname:chararray, lastname:chararray,
phone:chararray, city:chararray );
B= FOREACH A GENERATE id,fisrstname,city;
DUMP B
STREAM
• The stream operator allows transforming data in a relation
using an external program or script.
• This is possible because hadoop Mapreduce supports
“streaming”
Example:
C = STREAM A through ‘cut –f 2’;
Which use the Unix cut command to extract the second filed of
each tuple in A
Grouping and
Joining
 Group:

The group operator is used to group the data in one relation. It collects
the data having same key

Syntax:- relation-name2= group relation-name1 by key


 Group all- This command is used to aggregate all tuples into a
single group

Syntax:- relation-name2= group relation-name1 all by key


 COGROUP -To group the data in two or more relations.

Syntax:- relation-name3= group relation-name1


by key, relation-name2 by key,…../
 CROSS -To create the cross product of two or more relations.

Syntax:- relation-name1= CROSS relation-name2, relation-name3….;


JOIN
It is used to combine records from two or more
relations. It is of following types:
1.Inner Join
2.Outer Join
Inner Join:
An inner join returns those rows of tables whose join
predicate is matched
Syntax:
relation-name3 = JOIN relation-name1 by column_name,
relation-name2 by column_name;
Outer Join:
It returns all the rows from at least one of the relations. It can
be carried out in three ways:
 Left Outer Join
 Right Outer Join
 Full Outer Join
Left Outer Join:- returns all the rows from the
left table, even if there are no matches in the
right relation
Syntax:-
relation-name1 = JOIN relation-name2 By key
LEFT OUTER, relation-name3 By key;
Right Outer Join:- returns all the rows from the
right table, even if there are no matches in the left
relation
Syntax:-
relation-name1 = JOIN relation-name2 By key
RIGHT OUTER, relation-name3 By key;
FULL Outer Join:- returns all the rows of the
table, even
there is no match
Syntax:-
Sorti
ng
 ORDER BY -To arrange a relation in a sorted order
based on one or more fields (ascending or
descending).

Syntax: relation-name2=ORDER relation-name1


By key(ASC/DEC)
LIMIT -To get a limited number of tuples from a
relation. Syntax:
relation-name2= LIMIT relation-name1 required-
no.of-tuples
Combining and
Splitting
 UNION -To combine two or more relations into a single
relation.

Syntax:

relation-name3 = UNION relation-name1, relation-name2;

 SPLIT -To split a single relation into two or more


relations.

Syntax:

SPLIT relation-name1 INTO relation-name2 IF(condition1)


Diagnostic
Operators
 DUMP - The Dump operator is used to run the Pig Latin
statements and display the results on the screen. It is
generally used for debugging Purpose
Syntax:- grunt> DUMP relation-name;
Example: DUMP student;

 Once you execute the above Pig Latin statement, it will


start a
MapReduce job to read data from HDFS.
 DESCRIBE –Used to view the schema of a relation or alias

Syntax:- DESCRIBE relation-name;

Example:grunt> DESCRIBE student;

Output: student: { id: int, firstname: chararray, lastname:


Diagnostic
Operators(Cont..)

 EXPLAIN -To view the logical, physical, or MapReduce


execution plans to compute a relation.
 This is helpful to know how pig is compiling each commands
into mapreduce scripts.

Syntax:- EXPLAIN relation-name;

Example: EXPLAIN student;


 Logical Plan – The logical plan contains the pipeline of the
operators it needs to be executed
Diagnostic
Operators(Cont..)
• Physical Plan- It specifies how the logical operators
converted into backend specific physical operators
• Mapreduce Execution Plan – How the physical operators are
grouped together to form the mapreduce jobs

 ILLUSTRATE –
 To view the step by step execution of a pig script.
 If we need to test a script with small sample of data then
we use it.
Syntax: ILLUSTRATE relation-name
Example : ILLUSTRATE student;
Boolean
Operators
 and- AND operation

 or- OR operation

 not- NOT operation

 Pig does not support a boolean data type. However, the result
of a boolean expression (an expression that includes boolean
and comparison operators) is always of type boolean (true or
false).

Cast Operators
 Cast operators enable you to cast or convert data from one
type to
another, as long as conversion is supported (see the table
above).
 For example, suppose you have an integer field, myint, which
 A= LOAD 'data' AS (f1:int,f2:int,f3:int);
 DUMP A;

 O/P: (1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3)

 B = GROUP A BY f1;
 DUMP B;

 (1,{(1,2,3)}) (4,{(4,2,1),(4,3,3)}) (7,{(7,2,5)}) (8,{(8,3,4),


(8,4,3)})

 DESCRIBE B;
 O/P: B: {group: int, A: {f1: int,f2: int,f3: int}}

 X = FOREACH B GENERATE group, (chararray)COUNT(A) AS


total;

 (1,1) (4,2) (7,1) (8,2)


Disambiguate
Operator
 Use the disambiguate operator ( :: ) to identify field names
after JOIN, COGROUP, CROSS, or FLATTEN operators.
 In this example, to disambiguate y, use A::y or B::y. In cases
where there is no ambiguity, such as z, the :: is not necessary
but is still supported.

A = load 'data1' as (x,

y); B = load 'data2' as

(x, y, z); C = join A by

x, B by x;

D = foreach C
generate y; -- which y?
Flatten
Operator
 The FLATTEN operator looks like a UDF syntactically, but it is
actually an operator that changes the structure of tuples and
bags in a way that a UDF cannot. Flatten un-nests tuples as
well as bags. The idea is the same, but the operation and
result is different for each type of structure.
 For tuples, flatten substitutes the fields of a tuple in place of the
tuple. For example, consider a relation that has a tuple of the
form (a, (b, c)). The expression GENERATE $0, flatten($1), will
cause that tuple to become (a, b, c).
grunt> cat empty.bag
grunt> A = LOAD 'empty.bag' AS (b : bag{}, i : int);
grunt> B = FOREACH A GENERATE flatten(b), i;
grunt> DUMP
B; grunt>
Built-in

Functions
AVG:- Compute the average of numeric values in a single
column of a bag
Syntax:- CONCAT(expression)
• COUNT:- Compute the Number of elements in a bag
Syntax:- COUNT(expression)
• DIFF:- Compare two fields in a table
Syntax:- DIFF(expression, expression)
• CONCAT:- Concatenates two expressions of identical types
Syntax:- CONCAT(expression, expression)
Example:-FOREACH A GENERATE concat(first_name,
second_name)
• MAX:- Computes the maximum of the numeric values or
chararrays in a single colun bag. It requires a preceding
GROUP statement for group maximum.
Syntax:- MAX(expression)
• MIN:- Computes the minimum of the numeric values or chararrays in a
single column bag.

Syntax:- MIN(expression)
• SIZE:- Used to compute the number of elements based on the data type.
It includes
null values also.

Syntax:- SIZE(expression)

Example:- FOREACH A GENERATE size(name);


• SUM:- Computes the sum of numeric values in a single column bag

Syntax:- SUM(expression)
• TOKENIZE:- Splits a string in a single tuple(which contains group of words)
and outputs a bag of words

Syntax:- TOKENIZE(expression)

• ISEMPTY:- check if a Bag or Map is

empty or not Syntax:- ISEMPTY(expression)


Comments

• Pig Latin has two types of comment operators:


SQL-style single-line comments (--) and Java-
style multiline comments (/* */).
For example:
--this is a single-line comment
A = load 'foo';
/* * This is a multiline comment. */
B = load /* a comment in the middle */'bar';
WordCount Program using Pig
Latin
Steps involved to find the number of
occurrences of the words in a file using the pig
script
Cont..
• but we have to convert it into multiple rows like
below

(This)
(is)
(a)
(hadoop)
(class)
(hadoop)
(is)
(a)
(bigdata)
(technology)
Cont..
Cont..
Complete Pig Script for WordCount Program

Note:
You can see just with 5 lines of pig program, we have solved the word
count problem very easily.
Calculate maximum recorded temperature by year
for weather dataset in Pig Latin:

A = load 'weather' using PigStorage(',') as


(year:chararray, temp:int);
B = group A by year;
c= foreach B generate group, MAX(A.temp);
store c into 'wout.txt';
Using Pig Latin, Order the movies based on rating
and display the results.

A = load 'movie' using PigStorage(',') as


(id:int,name:chararray,year:int,rating:double,
duration:int);
B = distinct A;
C = order B by rating;
DUMP C;
Pig Execution Modes (or) Evaluating local and
Distributed modes of Running pig Script
Apache pig can be run in two modes:

1. Local Mode(Local Execution Environment)

2. Hadoop Mode(Distributed Execution Environment)

Local Mode:
 In this mode all the files are installed and run from your
local host and local file system
 Executes in a single JVM

 No need of Hadoop or HDFS

 This mode is generally used for developing and testing


pig logic.
 If you’re using a small set of data or test your code, then
local mode could be faster than going through the
MapReduce Infrastructure.
 To start the local mode of execution, the following
command is used.

~$ PIG –x local
MapReduce Mode (or) Distributed Mode
• In this mode, Apache Pig will take the input from HDFS
paths only,
and after processing data it will put output files on top of
HDFS
• In MapReduce mode of Execution,Pig translates
queries into MapReduce jobs and runs them on a
Hadoop Cluster
• In this mode, whenever we execute the pig latin statements
to process the data, a Mapreduce job is invoked in the back-
end to perform a particular operation on the data that
exists in the HDFS.
• MapReduce mode with the fully distributed cluster is useful
Syntax:

To start the local mode of execution, the following

command is used.
• PIG –x mapreduce (or) PIG
Apache Pig Execution Mechanisms(Pig Script
Interfaces)
Apache Pig scripts can be executed in three ways,
namely, interactive mode, batch mode, and embedded mode.

Interactive Mode (Grunt shell) − Grunt acts as a


command interpreter.
 It is a Pig’s Interactive shell which is used to execute all pig
scripts.
 Simply say that, Interactive means coding and executing the
script, line by line.
 You can run Apache Pig in interactive mode using the Grunt
shell. In this shell, you can enter the Pig Latin statements and
get the output (using Dump operator). This method is useful
for initial development.
Batch Mode (Script)
 In Batch mode, all scripts are coded in a single file with the
extension .pig and the file is directly executed
 This mode allows a single file containing Pig Latin
commands, identified by the .pigsuffix (FlightData.pig, for
example).
 Ending your Pig program with the .pigextension is a
convention but not required.
 The commands are interpreted by the Pig Latin compiler and
executed in the order determined by the Pig optimizer.
Cont..

• Embedded Mode (UDF) − Apache Pig


provides the provision of defining our own
functions (User Defined Functions) in
programming languages such as Java, and
using them in our script.
• It is useful to execute pig programs from a
java program
Applications of
Apache Pig:
 For exploring large datasets Pig Scripting is used.
 Provides the supports across large data-sets for
Ad-hoc queries.
 In the prototyping of large data-sets processing
algorithms.
 Required to process the time sensitive data loads.
 For collecting large amounts of datasets in form of
search logs and web log processing (i.e. error
logs).
 Used where the analytical insights are needed
using the sampling.
Pig User Defined Functions(UDF)
• To specify custom processing, Pig provides support for user-
defined functions (UDFs). Thus, Pig allows us to create our
own functions. Currently, Pig UDFs can be implemented using
the following programming languages: -
– Java
– Python
– Jython
– JavaScript
– Ruby
• Groovy
• Among all the languages, Pig provides the most extensive
support for Java functions
Example of Pig UDF
• In Pig:
– All UDFs must extend "org.apache.pig.EvalFunc"
– All functions must override the "exec" method.
• In Apache Pig, we also have a Java repository for
UDF’s named Piggybank. Using Piggybank, we can
access Java UDF’s written by other users, and
contribute our ownUDF’s.
Create a simple EVAL Function to convert the provided string to
uppercase
package com.hadoop;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;

public class TestUpper extends EvalFunc<String> {


public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try{
String str = (String)input.get(0);
return str.toUpperCase();
} catch(Exception e){
throw new IOException("Caught exception processing input row ", e);
}
} }
Tutorial Questions
1. a) Consider the Departmental Stores data file (stores.txt) in the following
format customerName, deptName, purchaseAmount.
i) Write a Pig script to list total sales per departmental store.
ii) Write a Pig script to list total sales per
customer.
b) Explain the following operators in Pig Latin.
i) flatten operator ii) Relational operators
2. a) Explain the architecture of Apache Pig with neat sketch.
b) Explain about the complex data types in Pig Latin.
3. a) Explain about Pig Latin data model and its data types.
b) Write about the three key design principles of Pig Latin
c) Write about Apache Pig execution modes and mechanism.
4. a) Write the major differences between Apache Pig and SQL
b) List and Explain various operators of Pig Latin.
5. a) Explain the principles to be considered while writing the Pig Scripts
b) Describe two modes for running scripts in Pig
6. a) How can you run the Pig scripts in Local and Distributed mode
b) Write the syntax of a Pig program with suitable example.
7. a)Discuss in brief about the operators supported by PIG with respect to data
access and debugging operations.
b) Explain in brief about the scripting in PIG with suitable example.
8. a) Draw and explain architecture of APACHE PIG in detail.
b) Discuss how Pig data model will help in effective data flow
9. a) List any five commands of pig script.
b) Discuss Pig Latin Application Flow
10 a) Discuss the various data types in Pig.
b) Write a word count program in Pig to count the occurrence of similar words
in a file.
11. a) How the pig programs can be packaged and explain the modes of running a
pig script with a neat sketch.
b) List and explain the relational operators in Pig.
12. a) Write the general PIG Latin program/flow organization.
B) Consider The student data File (st.txt), Data in the following format Name,
District, age, gender
i) Write a PIG script to Display Names of all female students
ii) Write a PIG script to find the number of Students from Prakasham District
iii) Write a PIG script to Display District wise count of all male students.

13. a) List the relational operators in Pig?


b) What are the components of Pig Execution Environment?
Example: Pig Script_1
Consider the Departmental Stores data file
(stores.txt) in the following format:
customerName, deptName, purchaseAmount.
i) Write a Pig script to list total sales per
departmental store.
ii) Write a Pig script to list total sales per
customer.
Example: stores.txt
• customerName deptName PurchaseAmount
A S 3.3
A S 4.7
B S 1.2
B T 3.4
C Z 1.1
C T 5.5
D R 1.1
List total sales per department stores:
• grunt>
data = LOAD 'Documents/stores.txt' using PigStorage(',') as
(customerName:chararray, deptName:chararray,
purchaseAmount:float);

grp = (GROUP data BY deptName)

result=FOREACH grp GENERATE group, COUNT(data.deptName) ,


(FLOAT)SUM(data.purcahseAmount)

DUMP result;
Output
• The output has dept. store, customer count,
total sales.

( R,1,1.1)
( S,3,9.2)
( T,2,8.9)
( Z,1,1.1)
List total Sales per customer
• grunt>
data = LOAD 'Documents/stores.txt' using PigStorage(',') as
(customerName:chararray, deptName:chararray,
purchaseAmount:float);

grp = (GROUP data BY customerName )

result=FOREACH grp GENERATE group,


COUNT(data.customerName) ,
(FLOAT)SUM(data.purchaseAmount) ;

DUMP result;
output
• The output has customer id, total transactions
per customer, total sales.

(A,2,8.0)
(B,2,4.6000004)
(C,2,6.6)
(D,1,1.1)
Example: stores.txt
• cust_Id dstore spent
A S 3.3
A S 4.7
B S 1.2
B T 3.4
C Z 1.1
C T 5.5
D R 1.1
Example Pig Script-2
Consider The student data File (st.txt), Data in the
following format Name, District, age, gender
i) Write a PIG script to Display Names of all
female students
ii) Write a PIG script to find the number of
Students from Prakasham District
iii) Write a PIG script to Display District wise
count of all male students.
st.txt
SID Name District Age Gender
1 Raju Guntur 28 Male
2 Prudhvi West Godavari 29 Male
3 Indra Prakasham28 Male
4 Ramana Prakasham27 Male
5 Nagarjuna Nellore 29 Male
6 Ravindra Krishna 30 Male
7 Jyothi Guntur 27 Female
8 Lahari West Godavari 26 Female
9 Hema Prakasham27 Female
Write a PIG script to Display Names of all female students

• grunt>
data = LOAD 'Documents/st.txt' using PigStorage(',')
as (sid:int, name:chararray, district:chararray,
age:int, gender: chararray);

fdata = FILTER data by gender ==‘female’

result=FOREACH fdata GENERATE name;

DUMP result;
Write a PIG script to find the number of Students from Prakasham District

• grunt>
data = LOAD 'Documents/st.txt' using PigStorage(',')
as (sid:int, name:chararray, district:chararray, age:int,
gender: chararray);

fdata = FILTER data by district ==‘prakasham’


std_count = FOREACH fdata GENERATE group,
COUNT(fdata);

DUMP std_count;
Write a PIG script to Display District wise count of all male students.

• grunt>
data = LOAD 'Documents/st.txt' using PigStorage(',') as (sid:int,
name:chararray, district:chararray, age:int, gender: chararray);

fdata = FILTER data by gender ==‘male’;

stdgrp=GROUP fdata by location;


std_count = FOREACH stdgrp GENERATE group,
COUNT(fdata);

DUMP std_count;
Using Join
Operation
customers = LOAD 'customer' USING
PigStorage(',')as (id:int,
name:chararray, age:int,
address:chararray, salary:int);
orders = LOAD 'orders' USING PigStorage(',')
as
(oid:int, date:chararray, customer_id:int,
amount:int);
customer_orders = JOIN customers BY id,
orders BY customer_id;
dump customer_orders;
To Exercise more problems on
Pig Scripts-go to below link
• http://howt2talkt2apachepig.blogspot.com/

You might also like