Basic
Basic
Table of contents
1 Conventions........................................................................................................................ 2
2 Reserved Keywords............................................................................................................2
3 Case Sensitivity.................................................................................................................. 3
4 Data Types and More........................................................................................................ 4
5 Arithmetic Operators and More....................................................................................... 27
6 Relational Operators.........................................................................................................46
7 UDF Statements............................................................................................................... 88
1 Conventions
Conventions for the syntax and code examples in the Pig Latin Reference Manual are
described here.
Convention Description Example
2 Reserved Keywords
Pig reserved keywords are listed here.
-- A assert, and, any, all, arrange, as, asc, AVG
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 2
Pig Latin Basics
-- G generate, group
-- H help
-- J join
-- K kill
-- N not, null
-- Q quit
-- U union, using
-- V, W, X, Y, Z void
3 Case Sensitivity
The names (aliases) of relations and fields are case sensitive. The names of Pig Latin
functions are case sensitive. The names of parameters (see Parameter Substitution) and all
other Pig Latin keywords (see Reserved Keywords) are case insensitive.
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 3
Pig Latin Basics
4.1 Identifiers
Identifiers include the names of relations (aliases), fields, variables, and so on. In Pig,
identifiers start with a letter and can be followed by any number of letters, digits, or
underscores.
Valid identifiers:
A
A123
abc_123_BeX_
Invalid identifiers:
_A123
abc_$
A!B
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 4
Pig Latin Basics
A Pig relation is a bag of tuples. A Pig relation is similar to a table in a relational database,
where the tuples in the bag correspond to the rows in a table. Unlike a relational table,
however, Pig relations don't require that every tuple contain the same number of fields or that
the fields in the same position (column) have the same type.
Also note that relations are unordered which means there is no guarantee that tuples are
processed in any particular order. Furthermore, processing may be parallelized in which case
tuples are not processed according to any total ordering.
Relations are referred to by name (or alias). Names are assigned by you as part of the Pig
Latin statement. In this example the name (alias) of the relation is A.
You an assign an alias to another alias. The new alias can be used in the place of the original
alias to refer the original relation.
Positional notation $0 $1 $2
(generated by system)
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 5
Pig Latin Basics
As shown in this example when you assign names to fields (using the AS schema clause) you
can still refer to the fields using positional notation. However, for debugging purposes and
ease of comprehension, it is better to use field names.
In this example an error is generated because the requested column ($3) is outside of the
declared schema (positional notation begins with $0). Note that the error is caught before the
statements are executed.
As noted, the fields in a tuple can be any data type, including the complex data types: bags,
tuples, and maps.
• Use the schemas for complex data types to name fields that are complex data types.
• Use the dereference operators to reference and work with fields that are complex data
types.
In this example the data file contains tuples. A schema for complex data types (in this case,
tuples) is used to load the data. Then, dereference operators (the dot in t1.t1a and t2.$0) are
used to access the fields in the tuples. Note that when you assign names to fields you can still
refer to these fields using positional notation.
cat data;
(3,8,9) (4,5,6)
(1,4,7) (3,7,5)
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 6
Pig Latin Basics
(2,5,8) (9,5,8)
DUMP A;
((3,8,9),(4,5,6))
((1,4,7),(3,7,5))
((2,5,8),(9,5,8))
DUMP X;
(3,4)
(1,3)
(2,9)
Complex Types
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 7
Pig Latin Basics
• If a schema is defined as part of a load statement, the load function will attempt to
enforce the schema. If the data does not conform to the schema, the loader will generate a
null value or an error.
• If an explicit cast is not supported, an error will occur. For example, you cannot cast a
chararray to int.
• If Pig cannot resolve incompatible types through implicit casts, an error will occur.
For example, you cannot add chararray and float (see the Types Table for addition and
subtraction).
4.3.2 Tuple
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 8
Pig Latin Basics
4.3.2.1 Syntax
( field [, field …] )
4.3.2.2 Terms
4.3.2.3 Usage
You can think of a tuple as a row with one or more fields, where each field can be any
data type and any field may or may not have data. If a field has no data, then the following
happens:
• In a load statement, the loader will inject null into the tuple. The actual value that is
substituted for null is loader specific; for example, PigStorage substitutes an empty field
for null.
• In a non-load statement, if a requested field is missing from a tuple, Pig will inject null.
4.3.2.4 Example
4.3.3 Bag
{ tuple [, tuple …] }
4.3.3.2 Terms
tuple A tuple.
4.3.3.3 Usage
In this example A is a relation or bag of tuples. You can think of this bag as an outer bag.
X = GROUP A BY f1;
DUMP X;
(1,{(1,2,3)})
(4,{(4,2,1),(4,3,3)})
(8,{(8,3,4)})
4.3.4 Map
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 10
Pig Latin Basics
4.3.4.2 Terms
4.3.4.3 Usage
4.3.4.4 Example
Pig Latin operators and functions interact with nulls as shown in this table.
Operator Interaction
Comparison operator: If either the string being matched against or the string
matches defining the match is null, the result is null.
% modulo
? : bincond
CASE : case
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 11
Pig Latin Basics
Null operator: If the tested value is not null, returns true; otherwise,
is not null returns false (see Null Operators).
Cast operator Casting a null from one type to another type results in
a null.
For Boolean subexpressions, note the results when nulls are used with these operators:
• FILTER operator – If a filter expression results in null value, the filter does not pass them
through (if X is null, !X is also null, and the filter will reject both).
• Bincond operator – If a Boolean subexpression results in null value, the resulting
expression is null (see the interactions above for Arithmetic operators)
In this example of an outer join, if the join key is missing from a table it is replaced by null.
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 12
Pig Latin Basics
Like any other expression, null constants can be implicitly or explicitly cast.
In this example both a and null will be implicitly cast to double.
In this example both a and null will be cast to int, a implicitly, and null explicitly.
As noted, nulls can be the result of an operation. These operations can produce null values:
• Division by zero
• Returns from user defined functions (UDFs)
• Dereferencing a field that does not exist.
• Dereferencing a key that does not exist in a map. For example, given a map, info,
containing [name#john, phone#5551212] if a user tries to use info#address a null is
returned.
• Accessing a field that does not exist in a tuple.
cat data;
2 3
4
7 8 9
DUMP A;
(,2,3)
(4,,)
(7,8,9)
DUMP B;
(,2)
(4,)
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 13
Pig Latin Basics
(7,8)
As noted, nulls can occur naturally in the data. If nulls are part of the data, it is the
responsibility of the load function to handle them correctly. Keep in mind that what is
considered a null value is loader-specific; however, the load function should always
communicate null values to Pig by producing Java nulls.
The Pig Latin load functions (for example, PigStorage and TextLoader) produce null values
wherever data is missing. For example, empty strings (chararrays) are not loaded; instead,
they are replaced by nulls.
PigStorage is the default load function for the LOAD operator. In this example the is not null
operator is used to filter names with null values.
When using the GROUP operator with a single relation, records with a null group key are
grouped together.
X = group A by age;
dump X;
(18,{(joe,18,2.5)})
(,{(sam,,3.0),(bob,,3.5)})
When using the GROUP (COGROUP) operator with multiple relations, records with a null
group key from different relations are considered different and are grouped separately. In the
example below note that there are two tuples in the output corresponding to the null group
key: one that contains tuples from relation A (but not relation B) and one that contains tuples
from relation B (but not relation A).
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 14
Pig Latin Basics
The JOIN operator - when performing inner joins - adheres to the SQL standard and
disregards (filters out) null values. (See also Drop Nulls Before a Join.)
4.5 Constants
Pig provides constant representations for all data types except bytearrays.
Constant Example Notes
int 19
long 19L
biginteger 19211921192119211921BI
bigdecimal 192119211921.192119211921BD
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 15
Pig Latin Basics
The data type definitions for tuples, bags, and maps apply to constants:
• A tuple can contain fields of any data type
• A bag is a collection of tuples
• A map key must be a chararray; a map value can be any data type
Complex constants (either with or without values) can be used in the same places scalar
constants can be used; that is, in FILTER and GENERATE statements.
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 16
Pig Latin Basics
4.6 Expressions
In Pig Latin, expressions are language constructs used with the FILTER, FOREACH,
GROUP, and SPLIT operators as well as the eval functions.
Expressions are written in conventional mathematical infix notation and are adapted to the
UTF-8 character set. Depending on the context, expressions can include:
• Any Pig data type (simple data types, complex data types)
• Any Pig operator (arithmetic, comparison, null, boolean, dereference, sign, and cast)
• Any Pig built in function.
• Any user defined function (UDF) written in Java.
In Pig Latin,
• An arithmetic expression could look like this:
X = GROUP A BY f2*f3;
• A string expression could look like this, where a and b are both chararrays:
Star expressions ( * ) can be used to represent all the fields of a tuple. It is equivalent to
writing out the fields explicitly. In the following example the definition of B and C are
exactly the same, and MyUDF will be invoked with exactly the same arguments in both
cases.
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 17
Pig Latin Basics
A common error when using the star expression is shown below. In this example, the
programmer really wants to count the number of elements in the bag in the second field:
COUNT($1).
G = GROUP A BY $0;
C = FOREACH G GENERATE COUNT(*)
There are some restrictions on use of the star expression when the input schema is unknown
(null):
• For GROUP/COGROUP, you can't include a star expression in a GROUP BY column.
• For ORDER BY, if you have project-star as ORDER BY column, you can’t have any
other ORDER BY column in that statement.
Project-range ( .. ) expressions can be used to project a range of columns from input. For
example:
• .. $x : projects columns $0 through $x, inclusive
• $x .. : projects columns through end, inclusive
• $x .. $y : projects columns through $y, inclusive
If the input relation has a schema, you can refer to columns by alias rather than by column
position. You can also combine aliases and column positions in an expression; for example,
"col1 .. $5" is valid.
Project-range can be used in all cases where the star expression ( * ) is allowed.
Project-range can be used in the following statements: FOREACH, JOIN, GROUP,
COGROUP, and ORDER BY (also when ORDER BY is used within a nested FOREACH
block).
A few examples are shown here:
.....
grunt> F = foreach IN generate (int)col0, col1 .. col3;
grunt> describe F;
F: {col0: int,col1: bytearray,col2: bytearray,col3: bytearray}
.....
.....
grunt> SORT = order IN by col2 .. col3, col0, col4 ..;
.....
.....
J = join IN1 by $0 .. $3, IN2 by $0 .. $3;
.....
.....
g = group l1 by b .. c;
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 18
Pig Latin Basics
.....
There are some restrictions on the use of project-to-end form of project-range (eg "x .. ")
when the input schema is unknown (null):
• For GROUP/COGROUP, the project-to-end form of project-range is not allowed.
• For ORDER BY, the project-to-end form of project-range is supported only as the last
sort column.
.....
grunt> describe IN;
Schema for IN unknown.
Boolean expressions can be made up of UDFs that return a boolean value or boolean
operators (see Boolean Operators).
Tuple expressions form subexpressions into tuples. The tuple expression has the form
(expression [, expression …]), where expression is a general expression. The simplest tuple
expression is the star expression, which represents all fields.
General expressions can be made up of UDFs and almost any operator. Since Pig does not
consider boolean a base type, the result of a general expression cannot be a boolean. Field
expressions are the simpliest general expressions.
4.7 Schemas
Schemas enable you to assign names to fields and declare types for fields. Schemas are
optional but we encourage you to use them whenever possible; type declarations result in
better parse-time error checking and more efficient code execution.
Schemas for simple types and complex types can be used anywhere a schema definition is
appropriate.
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 19
Pig Latin Basics
Schemas are defined with the LOAD, STREAM, and FOREACH operators using the AS
clause. If you define a schema using the LOAD operator, then it is the load function that
enforces the schema (see LOAD and User Defined Functions for more information).
Known Schema Handling
Note the following:
• You can define a schema that includes both the field name and field type.
• You can define a schema that includes the field name only; in this case, the field type
defaults to bytearray.
• You can choose not to define a schema; in this case, the field is un-named and the field
type defaults to bytearray.
If you assign a name to a field, you can refer to that field using the name or by positional
notation. If you don't assign a name to a field (the field is un-named) you can only refer to
the field using positional notation.
If you assign a type to a field, you can subsequently change the type using the cast operators.
If you don't assign a type to a field, the field defaults to bytearray; you can change the default
type using the cast operators.
Unknown Schema Handling
Note the following:
• When you JOIN/COGROUP/CROSS multiple relations, if any relation has an unknown
schema (or no defined schema, also referred to as a null schema), the schema for the
resulting relation is null.
• If you FLATTEN a bag with empty inner schema, the schema for the resulting relation is
null.
• If you UNION two relations with incompatible schema, the schema for resulting relation
is null.
• If the schema is null, Pig treats all fields as bytearray (in the backend, Pig will determine
the real type for the fields dynamically)
See the examples below. If a field's data type is not specified, Pig will use bytearray to
denote an unknown type. If the number of fields is not known, Pig will derive an unknown
schema.
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 20
Pig Latin Basics
If you do DESCRIBE on B, you will see a single column of type double. This is because Pig
makes the safest choice and uses the largest numeric type when the schema is not know. In
practice, the input data could contain integer values; however, Pig will cast the data to double
and make sure that a double result is returned.
If the schema of a relation can’t be inferred, Pig will just use the runtime data as is and
propagate it through the pipeline.
With LOAD and STREAM operators, the schema following the AS keyword must be
enclosed in parentheses.
In this example the LOAD statement includes a schema definition for simple data types.
With FOREACH operators, the schema following the AS keyword must be enclosed in
parentheses when the FLATTEN operator is used. Otherwise, the schema should not be
enclosed in parentheses.
In this example the FOREACH statement includes FLATTEN and a schema for simple data
types.
In this example the FOREACH statement includes a schema for simple expression.
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 21
Pig Latin Basics
In this example the FOREACH statement includes a schemas for multiple fields.
Simple data types include int, long, float, double, chararray, bytearray, boolean, datetime,
biginteger and bigdecimal.
4.7.3.1 Syntax
(alias[:type]) [, (alias[:type]) …] )
4.7.3.2 Terms
4.7.3.3 Examples
cat student;
John 18 4.0
Mary 19 3.8
Bill 20 3.9
Joe 18 3.8
DESCRIBE A;
A: {name: chararray,age: int,gpa: float}
DUMP A;
(John,18,4.0F)
(Mary,19,3.8F)
(Bill,20,3.9F)
(Joe,18,3.8F)
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 22
Pig Latin Basics
In this example field "gpa" will default to bytearray because no type is declared.
cat student;
John 18 4.0
Mary 19 3.8
Bill 20 3.9
Joe 18 3.8
DESCRIBE A;
A: {name: chararray,age: int,gpa: bytearray}
DUMP A;
(John,18,4.0)
(Mary,19,3.8)
(Bill,20,3.9)
(Joe,18,3.8)
4.7.5.1 Syntax
4.7.5.2 Terms
4.7.5.3 Examples
In this example the schema defines one tuple. The load statements are equivalent.
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 23
Pig Latin Basics
cat data;
(3,8,9)
(1,4,7)
(2,5,8)
DESCRIBE A;
A: {T: (f1: int,f2: int,f3: int)}
DUMP A;
((3,8,9))
((1,4,7))
((2,5,8))
cat data;
(3,8,9) (mary,19)
(1,4,7) (john,18)
(2,5,8) (joe,18)
DESCRIBE A;
A: {F: (f1: int,f2: int,f3: int),T: (t1: chararray,t2: int)}
DUMP A;
((3,8,9),(mary,19))
((1,4,7),(john,18))
((2,5,8),(joe,18))
4.7.6.1 Syntax
alias[:bag] {tuple}
4.7.6.2 Terms
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 24
Pig Latin Basics
4.7.6.3 Examples
In this example the schema defines a bag. The two load statements are equivalent.
cat data;
{(3,8,9)}
{(1,4,7)}
{(2,5,8)}
DESCRIBE A:
A: {B: {T: (t1: int,t2: int,t3: int)}}
DUMP A;
({(3,8,9)})
({(1,4,7)})
({(2,5,8)})
alias<:map> [ <type> ]
4.7.7.2 Terms
4.7.7.3 Examples
In this example the schema defines an untyped map (the map values default to bytearray).
The load statements are equivalent.
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 25
Pig Latin Basics
cat data;
[open#apache]
[apache#hadoop]
DESCRIBE A;
a: {M: map[ ]}
DUMP A;
([open#apache])
([apache#hadoop])
/* The MapLookup of a typed map will result in a datatype of the map value */
a = load '1.txt' as(map[int]);
b = foreach a generate $0#'key';
/* Schema for b */
b: {int}
You can define schemas for data that includes multiple types.
4.7.8.1 Example
There is a shortcut form to reference the relation on the previous line of a pig script or grunt
session:
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 26
Pig Latin Basics
5.1.1 Description
addition +
subtraction -
multiplication *
division /
case CASE WHEN THEN ELSE END CASE expression [ WHEN value
THEN value ]+ [ ELSE value ]?
END
CASE [ WHEN condition THEN
value ]+ [ ELSE value ]? END
Case operator is equivalent to
nested bincond operators.
The schemas for all the outputs
of the when/else branches should
match.
Use expressions only (relational
operators are not allowed).
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 27
Pig Latin Basics
5.1.1.1 Examples
DUMP A;
(10,1,{(2,3),(4,6)})
(10,3,{(2,3),(4,6)})
(10,6,{(2,3),(4,6),(5,7)})
In this example the modulo operator is used with fields f1 and f2.
DUMP X;
(10,1,0)
(10,3,1)
(10,6,4)
In this example the bincond operator is used with fields f2 and B. The condition is "f2 equals
1"; if the condition is true, return 1; if the condition is false, return the count of the number of
tuples in B.
DUMP X;
(1,1L)
(3,2L)
(6,3L)
In this example the case operator is used with field f2. The expression is "f2 % 2"; if the
expression is equal to 0, return 'even'; if the expression is equal to 1, return 'odd'.
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 28
Pig Latin Basics
bag error error error error error error error error error
tuple not yet error error error error error error error
bytearray cast as
double
bag error error error not yet not yet not yet not yet error error
tuple error error not yet not yet not yet not yet error error
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 29
Pig Latin Basics
bytearray cast as
double
bytearray error
5.2.1 Description
AND and
OR or
NOT not
The result of a boolean expression (an expression that includes boolean and comparison
operators) is always of type boolean (true or false).
5.2.1.1 Example
X = FILTER A BY (f1==8) OR (NOT (f2+f3 > f1)) OR (f1 IN (9, 10, 11));
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 30
Pig Latin Basics
5.3.1 Description
bag error error error error error error error error error
tuple error error error error error error error error error
map error error error error error error error error error
int error error error yes yes yes yes error error
long error error error yes yes yes yes error error
float error error error yes yes yes yes error error
double error error error yes yes yes yes error error
chararray error error error yes yes yes yes error yes
bytearray yes yes yes yes yes yes yes yes yes
boolean error error error error error error error yes error
5.3.1.1 Syntax
5.3.1.2 Terms
5.3.1.3 Usage
Cast operators enable you to cast or convert data from one type to another, as long as
conversion is supported (see the table above). For example, suppose you have an integer
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 31
Pig Latin Basics
field, myint, which you want to convert to a string. You can cast this field from int to
chararray using (chararray)myint.
Please note the following:
• A field can be explicitly cast. Once cast, the field remains that type (it is not
automatically cast back). In this example $0 is explicitly cast to int.
• Where possible, Pig performs implicit casts. In this example $0 is cast to int (regardless
of underlying data) and $1 is cast to double.
• When two bytearrays are used in arithmetic expressions or a bytearray expression is used
with built in aggregate functions (such as SUM) they are implicitly cast to double. If the
underlying data is really int or long, you’ll get better performance by declaring the type
or explicitly casting the data.
• Downcasts may cause loss of data. For example casting from long to int may drop bits.
5.3.2 Examples
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
B = GROUP A BY f1;
DUMP B;
(1,{(1,2,3)})
(4,{(4,2,1),(4,3,3)})
(7,{(7,2,5)})
(8,{(8,3,4),(8,4,3)})
DESCRIBE B;
B: {group: int,A: {f1: int,f2: int,f3: int}}
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 32
Pig Latin Basics
(8,2)
DESCRIBE X;
X: {group: int,total: chararray}
cat data;
(1,2,3)
(4,2,1)
(8,3,4)
DESCRIBE A;
a: {fld: bytearray}
DUMP A;
((1,2,3))
((4,2,1))
((8,3,4))
DESCRIBE B;
b: {(int,int,float)}
DUMP B;
((1,2,3))
((4,2,1))
((8,3,4))
cat data;
{(4829090493980522200L)}
{(4893298569862837493L)}
{(1297789302897398783L)}
DESCRIBE A;
A: {fld: bytearray}
DUMP A;
({(4829090493980522200L)})
({(4893298569862837493L)})
({(1297789302897398783L)})
DESCRIBE B;
B: {{(long)}}
DUMP B;
({(4829090493980522200L)})
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 33
Pig Latin Basics
({(4893298569862837493L)})
({(1297789302897398783L)})
cat data;
[open#apache]
[apache#hadoop]
[hadoop#pig]
[pig#grunt]
DESCRIBE A;
A: {fld: bytearray}
DUMP A;
([open#apache])
([apache#hadoop])
([hadoop#pig])
([pig#grunt])
DESCRIBE B;
B: {map[ ]}
DUMP B;
([open#apache])
([apache#hadoop])
([hadoop#pig])
([pig#grunt])
Pig allows you to cast the elements of a single-tuple relation into a scalar value. The tuple
can be a single-field or multi-field tulple. If the relation contains more than one tuple,
however, a runtime error is generated: "Scalar has more than one row in the output".
The cast relation can be used in any place where an expression of the type would make
sense, including FOREACH, FILTER, and SPLIT. Note that if an explicit cast is not used an
implict cast will be inserted according to Pig rules. Also, when the schema can't be inferred
bytearray is used.
The primary use case for casting relations to scalars is the ability to use the values of global
aggregates in follow up computations.
In this example the percentage of clicks belonging to a particular user are computed. For the
FOREACH statement, an explicit cast is used. If the SUM is not given a name, a position can
be used as well (userid, clicks/(double)C.$0).
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 34
Pig Latin Basics
B = group A all;
C = foreach B genertate SUM(A.clicks) as total;
D = foreach A generate userid, clicks/(double)C.total;
dump D;
In this example a multi-field tuple is used. For the FILTER statement, Pig performs an
implicit cast. For the FOREACH statement, an explicit cast is used.
5.4.1 Description
equal ==
not equal !=
5.4.2 Examples
Numeric Example
String Example
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 35
Pig Latin Basics
Matches Example
bag error error error error error error error error error error error error error
tuple boolean error error error error error error error error error error error
(see
Note
1)
map boolean error error error error error error error error error error
(see
Note
2)
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 36
Pig Latin Basics
biginteger booleanerror
bigdecimal boolean
Note 1: boolean (Tuple A is equal to tuple B if they have the same size s, and for all 0 <= i <
s A[i] == B[i])
Note 2: boolean (Map A is equal to map B if A and B have the same number of entries, and
for every key k1 in A with a value of v1, there is a key k2 in B with a value of v2, such that
k1 == k2 and v1 == v2)
bag error error error error error error error error error error error error error
tuple error error error error error error error error error error error error
map error error error error error error error error error error error
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 37
Pig Latin Basics
cast
as
chararray)
biginteger booleanerror
bigdecimal boolean
5.5.1 Description
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 38
Pig Latin Basics
• For bags, every element is put in the bag; if the element is not a tuple Pig will create a
tuple for it:
• Given this {$1, $2} Pig creates this {($1), ($2)} a bag with two tuples
... neither $1 and $2 are tuples so Pig creates a tuple around each item
• Given this {($1), $2} Pig creates this {($1), ($2)} a bag with two tuples
... since ($1) is treated as $1 (one cannot create a single element tuple using this
syntax), {($1), $2} becomes {$1, $2} and Pig creates a tuple around each item
• Given this {($1, $2)} Pig creates this {($1, $2)} a bag with a single tuple
... Pig creates a tuple ($1, $2) and then puts this tuple into the bag
5.5.2 Examples
Tuple Construction
Input (students):
joe smith 20 3.5
amy chen 22 3.2
leo allen 18 2.1
Output (results):
(joe smith,20)
(amy chen,22)
(leo allen,18)
Bag Construction
Input (students):
joe smith 20 3.5
amy chen 22 3.2
leo allen 18 2.1
Output (results):
{(joe smith,20)} {(joe smith),(20)}
{(amy chen,22)} {(amy chen),(22)}
{(leo allen,18)} {(leo allen),(18)}
Map Construction
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 39
Pig Latin Basics
Input (students):
joe smith 20 3.5
amy chen 22 3.2
leo allen 18 2.1
Output (results):
[joe smith#3.5]
[amy chen#3.2]
[leo allen#2.1]
5.6.1 Description
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 40
Pig Latin Basics
5.6.2 Examples
Tuple Example
Suppose we have relation A.
DUMP A;
(1,(1,2,3))
(2,(4,5,6))
(3,(7,8,9))
(4,(1,4,7))
(5,(2,5,8))
In this example dereferencing is used to retrieve two fields from tuple f2.
DUMP X;
(1,3)
(4,6)
(7,9)
(1,7)
(2,8)
Bag Example
Suppose we have relation B, formed by grouping relation A (see the GROUP operator for
information about the field names in relation B).
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
B = GROUP A BY f1;
DUMP B;
(1,{(1,2,3)})
(4,{(4,2,1),(4,3,3)})
(7,{(7,2,5)})
(8,{(8,3,4),(8,4,3)})
ILLUSTRATE B;
etc …
----------------------------------------------------------
| b | group: int | a: bag({f1: int,f2: int,f3: int}) |
----------------------------------------------------------
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 41
Pig Latin Basics
In this example dereferencing is used with relation X to project the first field (f1) of each
tuple in the bag (a).
DUMP X;
({(1)})
({(4),(4)})
({(7)})
({(8),(8)})
Tuple/Bag Example
Suppose we have relation B, formed by grouping relation A (see the GROUP operator for
information about the field names in relation B).
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
B = GROUP A BY (f1,f2);
DUMP B;
((1,2),{(1,2,3)})
((4,2),{(4,2,1)})
((4,3),{(4,3,3)})
((7,2),{(7,2,5)})
((8,3),{(8,3,4)})
((8,4),{(8,4,3)})
ILLUSTRATE B;
etc …
-------------------------------------------------------------------------------
| b | group: tuple({f1: int,f2: int}) | a: bag({f1: int,f2: int,f3: int}) |
-------------------------------------------------------------------------------
| | (8, 3) | {(8, 3, 4), (8, 3, 4)} |
-------------------------------------------------------------------------------
In this example dereferencing is used to project a field (f1) from a tuple (group) and a field
(f1) from a bag (a).
DUMP X;
(1,{(1)})
(4,{(4)})
(4,{(4)})
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 42
Pig Latin Basics
(7,{(7)})
(8,{(8)})
(8,{(8)})
Map Example
Suppose we have relation A.
DUMP A;
(1,[open#apache])
(2,[apache#hadoop])
(3,[hadoop#pig])
(4,[pig#grunt])
DUMP X;
(apache)
()
()
()
In cases where the schema is stored as part of the StoreFunc like PigStorage, JsonStorage,
AvroStorage or OrcStorage, users generally have to use an extra FOREACH before STORE
to rename the field names and remove the disambiguate operator from the names. To
automatically remove the disambiguate operator from the schema for the STORE operation,
the pig.store.schema.disambiguate Pig property can be set to "false". It is the responsibility of
the user to make sure that there is no conflict in the field names when using this setting.
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 43
Pig Latin Basics
5.9.1 Description
is null is null
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 44
Pig Latin Basics
5.9.2 Examples
The null operators can be applied to all data types (see Nulls and Pig Latin).
5.10.1 Description
5.10.2 Examples
bag error
tuple error
map error
int int
long long
float float
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 45
Pig Latin Basics
double double
chararray error
datetime error
biginteger biginteger
bigdecimal bigdecimal
6 Relational Operators
6.1 ASSERT
Assert a condition on the data.
6.1.1 Syntax
6.1.2 Terms
BY Required keyword.
6.1.3 Usage
Use assert to ensure a condition is true on your data. Processing fails if any of the records
voilate the condition.
6.1.4 Examples
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 46
Pig Latin Basics
(8,4,3)
Now, you can assert that a0 column in your data is >0, fail if otherwise
6.2 COGROUP
See the GROUP operator.
6.3 CROSS
Computes the cross product of two or more relations.
6.3.1 Syntax
6.3.2 Terms
6.3.3 Usage
Use the CROSS operator to compute the cross product (Cartesian product) of two or more
relations.
CROSS is an expensive operation and should be used sparingly.
6.3.4 Example
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 47
Pig Latin Basics
DUMP A;
(1,2,3)
(4,2,1)
DUMP B;
(2,4)
(8,9)
(1,3)
X = CROSS A, B;
DUMP X;
(1,2,3,2,4)
(1,2,3,8,9)
(1,2,3,1,3)
(4,2,1,2,4)
(4,2,1,8,9)
(4,2,1,1,3)
6.4 CUBE
Performs cube/rollup operations.
Cube operation computes aggregates for all possbile combinations of specified group by
dimensions. The number of group by combinations generated by cube for n dimensions will
be 2^n.
6.4.3 Syntax
alias = CUBE alias BY { CUBE expression | ROLLUP expression }, [ CUBE expression | ROLLUP
expression ] [PARALLEL n];
6.4.4 Terms
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 48
Pig Latin Basics
CUBE Keyword
BY Keyword
ROLLUP Keyword
6.4.5 Example
For a sample input tuple (car, 2012, midwest, ohio, columbus, 4000), the above query with
cube operation will output
(car,2012,4000)
(car,,4000)
(,2012,4000)
(,,4000)
Note the second column, ‘cube’ field which is a bag of all tuples that belong to ‘group’.
Also note that the measure attribute ‘sales’ along with other unused dimensions in load
statement are pushed down so that it can be referenced later while computing aggregates on
the measure, like in this case SUM(cube.sales).
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 49
Pig Latin Basics
For a sample input tuple (car, 2012, midwest, ohio, columbus, 4000), the above query with
rollup operation will output
(midwest,ohio,columbus,4000)
(midwest,ohio,,4000)
(midwest,,,4000)
(,,,4000)
If CUBE and ROLLUP operations are used together, the output groups will be the cross
product of all groups generated by cube and rollup operation. If there are m dimensions in
cube operations and n dimensions in rollup operation then overall number of combinations
will be (2^m) * (n+1).
For a sample input tuple (car, 2012, midwest, ohio, columbus, 4000), the above query with
cube and rollup operation will output
(car,2012,midwest,ohio,columbus,4000)
(car,2012,midwest,ohio,,4000)
(car,2012,midwest,,,4000)
(car,2012,,,,4000)
(car,,midwest,ohio,columbus,4000)
(car,,midwest,ohio,,4000)
(car,,midwest,,,4000)
(car,,,,,4000)
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 50
Pig Latin Basics
(,2012,midwest,ohio,columbus,4000)
(,2012,midwest,ohio,,4000)
(,2012,midwest,,,4000)
(,2012,,,,4000)
(,,midwest,ohio,columbus,4000)
(,,midwest,ohio,,4000)
(,,midwest,,,4000)
(,,,,,4000)
Since null values are used to represent subtotals in cube and rollup operation, in order
to differentiate the legitimate null values that already exists as dimension values, CUBE
operator converts any null values in dimensions to "unknown" value before performing cube
or rollup operation. For example, for CUBE(product,location) with a sample tuple (car,) the
output will be
(car,unknown)
(car,)
(,unknown)
(,)
6.5 DEFINE
See:
• DEFINE (UDFs, streaming)
• DEFINE (macros)
6.6 DISTINCT
Removes duplicate tuples in a relation.
6.6.1 Syntax
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 51
Pig Latin Basics
6.6.2 Terms
6.6.3 Usage
Use the DISTINCT operator to remove duplicate tuples in a relation. DISTINCT does not
preserve the original order of the contents (to eliminate duplicates, Pig must first sort the
data). You cannot use DISTINCT on a subset of fields; to do this, use FOREACH and
a nested block to first select the fields and then apply DISTINCT (see Example: Nested
Block).
6.6.4 Example
DUMP A;
(8,3,4)
(1,2,3)
(4,3,3)
(4,3,3)
(1,2,3)
X = DISTINCT A;
DUMP X;
(1,2,3)
(4,3,3)
(8,3,4)
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 52
Pig Latin Basics
6.7 FILTER
Selects tuples from a relation based on some condition.
6.7.1 Syntax
6.7.2 Terms
BY Required keyword.
6.7.3 Usage
Use the FILTER operator to work with tuples or rows of data (if you want to work with
columns of data, use the FOREACH...GENERATE operation).
FILTER is commonly used to select the data that you want; or, conversely, to filter out
(remove) the data you don’t want.
6.7.4 Examples
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
In this example the condition states that if the third field equals 3, then include the tuple with
relation X.
X = FILTER A BY f3 == 3;
DUMP X;
(1,2,3)
(4,3,3)
(8,4,3)
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 53
Pig Latin Basics
In this example the condition states that if the first field equals 8 or if the sum of fields f2 and
f3 is not greater than first field, then include the tuple relation X.
DUMP X;
(4,2,1)
(8,3,4)
(7,2,5)
(8,4,3)
6.8 FOREACH
Generates data transformations based on columns of data.
6.8.1 Syntax
6.8.2 Terms
Where:
The nested block is enclosed in opening and closing
brackets { … }.
The GENERATE keyword must be the last statement
within the nested block.
See Schemas
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 54
Pig Latin Basics
expression An expression.
AS Keyword
6.8.3 Usage
Use the FOREACH…GENERATE operation to work with columns of data (if you want to
work with tuples or rows of data, use the FILTER operation).
FOREACH...GENERATE works with relations (outer bags) as well as inner bags:
• If A is a relation (outer bag), a FOREACH statement could look like this.
X = FOREACH B {
S = FILTER A BY 'xyz';
GENERATE COUNT (S.$0);
}
In this example the asterisk (*) is used to project all fields from relation A to relation X.
Relation A and X are identical.
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 55
Pig Latin Basics
X = FOREACH A GENERATE *;
DUMP X;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
In this example two fields from relation A are projected to form relation X.
DUMP X;
(1,2)
(4,2)
(8,3)
(4,3)
(7,2)
(8,4)
In this example if one of the fields in the input relation is a tuple, bag or map, we can perform
a projection on that field (using a deference operator).
DUMP X;
(1,{(3)})
(4,{(6),(9)})
(8,{(9)})
DUMP X;
(1,{(1,2)})
(4,{(4,2),(4,3)})
(8,{(8,3),(8,4)})
In this example two fields in relation A are summed to form relation X. A schema is defined
for the projected field.
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 56
Pig Latin Basics
DESCRIBE X;
x: {f1: int}
DUMP X;
(3)
(6)
(11)
(7)
(9)
(12)
DUMP Y;
(11)
(12)
In this example the built in function SUM() is used to sum a set of numbers in a bag.
DUMP X;
(1,1)
(4,8)
(8,16)
DUMP X;
(1,1,2,3)
(4,4,2,1)
(4,4,3,3)
(8,8,3,4)
(8,8,4,3)
DUMP X;
(1,3)
(4,1)
(4,3)
(8,4)
(8,3)
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 57
Pig Latin Basics
Another FLATTEN example. Note that for the group '4' in C, there are two tuples in each
bag. Thus, when both bags are flattened, the cross product of these tuples is returned; that is,
tuples (4, 2, 6), (4, 3, 6), (4, 2, 9), and (4, 3, 9).
DUMP X;
(1,2,3)
(4,2,6)
(4,2,9)
(4,3,6)
(4,3,9)
(8,3,9)
(8,4,9)
Another FLATTEN example. Here, relations A and B both have a column x. When forming
relation E, you need to use the :: operator to identify which column x to use - either relation
A column x (A::x) or relation B column x (B::x). This example uses relation A column x
(A::x).
A FLATTEN example on a map type. Here we load an integer and map (of integer values)
into A. Then m gets flattened, and finally we are filtering the result to only include tuples
where the value among the un-nested map entries was 5.
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 58
Pig Latin Basics
This example shows a CROSS and FOREACH nested to the second level.
Suppose we have relations A and B. Note that relation B contains an inner bag.
DUMP A;
(www.ccc.com,www.hjk.com)
(www.ddd.com,www.xyz.org)
(www.aaa.com,www.cvn.org)
(www.www.com,www.kpt.net)
(www.www.com,www.xyz.org)
(www.ddd.com,www.xyz.org)
B = GROUP A BY url;
DUMP B;
(www.aaa.com,{(www.aaa.com,www.cvn.org)})
(www.ccc.com,{(www.ccc.com,www.hjk.com)})
(www.ddd.com,{(www.ddd.com,www.xyz.org),(www.ddd.com,www.xyz.org)})
(www.www.com,{(www.www.com,www.kpt.net),(www.www.com,www.xyz.org)})
In this example we perform two of the operations allowed in a nested block, FILTER and
DISTINCT. Note that the last statement in the nested block must be GENERATE. Also, note
the use of projection (PA = FA.outlink;) to retrieve a field. DISTINCT can be applied to a
subset of fields (as opposed to a relation) only within a nested block.
X = FOREACH B {
FA= FILTER A BY outlink == 'www.xyz.org';
PA = FA.outlink;
DA = DISTINCT PA;
GENERATE group, COUNT(DA);
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 59
Pig Latin Basics
DUMP X;
(www.aaa.com,0)
(www.ccc.com,0)
(www.ddd.com,1)
(www.www.com,1)
6.9 GROUP
Groups the data in one or more relations.
Note: The GROUP and COGROUP operators are identical. Both operators work with one
or more relations. For readability GROUP is used in statements involving one relation and
COGROUP is used in statements involving two or more relations. You can COGROUP up to
but no more than 127 relations at a time.
6.9.1 Syntax
alias = GROUP alias { ALL | BY expression} [, alias ALL | BY expression …] [USING 'collected' | 'merge']
[PARTITION BY partitioner] [PARALLEL n];
6.9.2 Terms
USING Keyword
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 61
Pig Latin Basics
6.9.3 Usage
The GROUP operator groups together tuples that have the same group key (key field).
The key field will be a tuple if the group key has more than one field, otherwise it will be
the same type as that of the group key. The result of a GROUP operation is a relation that
includes one tuple per group. This tuple contains two fields:
• The first field is named "group" (do not confuse this with the GROUP operator) and is
the same type as the group key.
• The second field takes the name of the original relation and is type bag.
• The names of both fields are generated by the system as shown in the example below.
6.9.4 Example
DESCRIBE A;
A: {name: chararray,age: int,gpa: float}
DUMP A;
(John,18,4.0F)
(Mary,19,3.8F)
(Bill,20,3.9F)
(Joe,18,3.8F)
Now, suppose we group relation A on field "age" for form relation B. We can use the
DESCRIBE and ILLUSTRATE operators to examine the structure of relation B. Relation
B has two fields. The first field is named "group" and is type int, the same as field "age" in
relation A. The second field is name "A" after relation A and is type bag.
B = GROUP A BY age;
DESCRIBE B;
B: {group: int, A: {name: chararray,age: int,gpa: float}}
ILLUSTRATE B;
etc ...
----------------------------------------------------------------------
| B | group: int | A: bag({name: chararray,age: int,gpa: float}) |
----------------------------------------------------------------------
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 62
Pig Latin Basics
DUMP B;
(18,{(John,18,4.0F),(Joe,18,3.8F)})
(19,{(Mary,19,3.8F)})
(20,{(Bill,20,3.9F)})
Continuing on, as shown in these FOREACH statements, we can refer to the fields in relation
B by names "group" and "A" or by positional notation.
DUMP C;
(18,2L)
(19,1L)
(20,1L)
DUMP C;
(18,{(John),(Joe)})
(19,{(Mary)})
(20,{(Bill)})
6.9.5 Example
DUMP A;
(r1,1,2)
(r2,2,1)
(r3,2,8)
(r4,4,4)
X = GROUP A BY f2*f3;
DUMP X;
(2,{(r1,1,2),(r2,2,1)})
(16,{(r3,2,8),(r4,4,4)})
6.9.6 Example
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 63
Pig Latin Basics
DUMP A;
(Alice,turtle)
(Alice,goldfish)
(Alice,cat)
(Bob,dog)
(Bob,cat)
DUMP B;
(Cindy,Alice)
(Mark,Alice)
(Paul,Bob)
(Paul,Jane)
In this example tuples are co-grouped using field “owner” from relation A and field “friend2”
from relation B as the key fields. The DESCRIBE operator shows the schema for relation X,
which has three fields, "group", "A" and "B" (see the GROUP operator for information about
the field names).
DESCRIBE X;
X: {group: chararray,A: {owner: chararray,pet: chararray},B: {friend1: chararray,friend2:
chararray}}
Relation X looks like this. A tuple is created for each unique key field. The tuple includes the
key field and two bags. The first bag is the tuples from the first relation with the matching
key field. The second bag is the tuples from the second relation with the matching key field.
If no tuples match the key field, the bag is empty.
(Alice,{(Alice,turtle),(Alice,goldfish),(Alice,cat)},{(Cindy,Alice),(Mark,Alice)})
(Bob,{(Bob,dog),(Bob,cat)},{(Paul,Bob)})
(Jane,{},{(Paul,Jane)})
6.9.7 Example
To use the Hadoop Partitioner add PARTITION BY clause to the appropriate operator:
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 64
Pig Latin Basics
A = LOAD 'input_data';
B = GROUP A BY $0 PARTITION BY org.apache.pig.test.utils.SimpleCustomPartitioner PARALLEL
2;
6.10 IMPORT
See IMPORT (macros)
6.11.1 Syntax
6.11.2 Terms
BY Keyword
USING Keyword
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 65
Pig Latin Basics
6.11.3 Usage
Use the JOIN operator to perform an inner, equijoin join of two or more relations based on
common field values. Inner joins ignore null keys, so it makes sense to filter them out before
the join.
Note the following about the GROUP/COGROUP and JOIN operators:
• The GROUP and JOIN operators perform similar functions. GROUP creates a nested set
of output tuples while JOIN creates a flat set of output tuples.
• The GROUP/COGROUP and JOIN operators handle null values differently (see Nulls
and JOIN Operator).
Self Joins
To perform self joins in Pig load the same data multiple times, under different aliases, to
avoid naming conflicts.
In this example the same data is loaded twice using aliases A and B.
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 66
Pig Latin Basics
6.11.4 Example
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
DUMP B;
(2,4)
(8,9)
(1,3)
(2,7)
(2,9)
(4,6)
(4,9)
DUMP X;
(1,2,3,1,3)
(4,2,1,4,6)
(4,3,3,4,6)
(4,2,1,4,9)
(4,3,3,4,9)
(8,3,4,8,9)
(8,4,3,8,9)
6.12.1 Syntax
6.12.2 Terms
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 67
Pig Latin Basics
BY Keyword
USING Keyword
6.12.3 Usage
Use the JOIN operator with the corresponding keywords to perform left, right, or full outer
joins. The keyword OUTER is optional for outer joins; the keywords LEFT, RIGHT and
FULL will imply left outer, right outer and full outer joins respectively when OUTER is
omitted. The Pig Latin syntax closely adheres to the SQL standard.
Please note the following:
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 68
Pig Latin Basics
• Outer joins will only work provided the relations which need to produce nulls (in the case
of non-matching keys) have schemas.
• Outer joins will only work for two-way joins; to perform a multi-way outer join, you will
need to perform multiple two-way outer join statements.
6.12.4 Examples
A = LOAD 'large';
B = LOAD 'tiny';
C= JOIN A BY $0 LEFT, B BY $0 USING 'replicated';
A = LOAD 'large';
B = LOAD 'small';
C= JOIN A BY $0 RIGHT, B BY $0 USING 'bloom';
6.13 LIMIT
Limits the number of output tuples.
6.13.1 Syntax
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 69
Pig Latin Basics
6.13.2 Terms
6.13.3 Usage
6.13.4 Examples
a = load 'a.txt';
b = group a all;
c = foreach b generate COUNT(a) as sum;
d = order a by $0;
e = limit d c.sum/100;
DUMP A;
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 70
Pig Latin Basics
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
In this example output is limited to 3 tuples. Note that there is no guarantee which three
tuples will be output.
X = LIMIT A 3;
DUMP X;
(1,2,3)
(4,3,3)
(7,2,5)
In this example the ORDER operator is used to order the tuples and the LIMIT operator is
used to output the first three tuples.
DUMP B;
(8,3,4)
(8,4,3)
(7,2,5)
(4,2,1)
(4,3,3)
(1,2,3)
X = LIMIT B 3;
DUMP X;
(8,3,4)
(8,4,3)
(7,2,5)
6.14 LOAD
Loads data from the file system.
6.14.1 Syntax
6.14.2 Terms
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 71
Pig Latin Basics
USING Keyword.
If the USING clause is omitted, the default load
function PigStorage is used.
AS Keyword.
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 72
Pig Latin Basics
6.14.3 Usage
Use the LOAD operator to load data from the file system.
6.14.4 Examples
Suppose we have a data file called myfile.txt. The fields are tab-delimited. The records are
newline-separated.
1 2 3
4 2 1
8 3 4
In this example the default load function, PigStorage, loads data from myfile.txt to form
relation A. The two LOAD statements are equivalent. Note that, because no schema is
specified, the fields are not named and all fields default to type bytearray.
A = LOAD 'myfile.txt';
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
In this example a schema is specified using the AS keyword. The two LOAD statements are
equivalent. You can use the DESCRIBE and ILLUSTRATE operators to view the schema.
DESCRIBE A;
a: {f1: int,f2: int,f3: int}
ILLUSTRATE A;
---------------------------------------------------------
| a | f1: bytearray | f2: bytearray | f3: bytearray |
---------------------------------------------------------
| | 4 | 2 | 1 |
---------------------------------------------------------
---------------------------------------
| a | f1: int | f2: int | f3: int |
---------------------------------------
| | 4 | 2 | 1 |
---------------------------------------
For examples of how to specify more complex schemas for use with the LOAD operator, see
Schemas for Complex Data Types and Schemas for Multiple Types.
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 73
Pig Latin Basics
6.15 NATIVE
Executes native MapReduce/Tez jobs inside a Pig script.
6.15.1 Syntax
alias1 = NATIVE 'native.jar' STORE alias2 INTO 'inputLocation' USING storeFunc LOAD 'outputLocation'
USING loadFunc AS schema [`params, ... `];
6.15.2 Terms
6.15.3 Usage
Use the NATIVE operator to run native MapReduce/Tez jobs from inside a Pig script.
The input and output locations for the MapReduce/Tez program are conveyed to Pig using
the STORE/LOAD clauses. Pig, however, does not pass this information (nor require that
this information be passed) to the MapReduce/Tez program. If you want to pass the input and
output locations to the MapReduce/Tez program you can use the params clause or you can
hardcode the locations in the MapReduce/Tez program.
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 74
Pig Latin Basics
6.15.4 Example
This example demonstrates how to run the wordcount MapReduce progam from Pig. Note
that the files specified as input and output locations in the NATIVE statement will NOT be
deleted by Pig automatically. You will need to delete them manually.
A = LOAD 'WordcountInput.txt';
B = NATIVE 'wordcount.jar' STORE A INTO 'inputDir' LOAD 'outputDir'
AS (word:chararray, count: int) `org.myorg.WordCount inputDir outputDir`;
6.16 ORDER BY
Sorts a relation based on one or more fields.
6.16.1 Syntax
6.16.2 Terms
6.16.3 Usage
Note: ORDER BY is NOT stable; if multiple records have the same ORDER BY key, the
order in which these records are returned is not defined and is not guarantted to be the same
from one run to the next.
In Pig, relations are unordered (see Relations, Bags, Tuples, Fields):
• If you order relation A to produce relation X (X = ORDER A BY * DESC;) relations A
and X still contain the same data.
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 75
Pig Latin Basics
• If you retrieve relation X (DUMP X;) the data is guaranteed to be in the order you
specified (descending).
• However, if you further process relation X (Y = FILTER X BY $0 > 1;) there is
no guarantee that the data will be processed in the order you originally specified
(descending).
Pig currently supports ordering on fields with simple types or by tuple designator (*). You
cannot order on fields with complex types or by expressions.
6.16.4 Examples
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
In this example relation A is sorted by the third field, f3 in descending order. Note that the
order of the three tuples ending in 3 can vary.
X = ORDER A BY a3 DESC;
DUMP X;
(7,2,5)
(8,3,4)
(1,2,3)
(4,3,3)
(8,4,3)
(4,2,1)
6.17 RANK
Returns each tuple with the rank within a relation.
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 76
Pig Latin Basics
6.17.1 Syntax
6.17.2 Terms
6.17.3 Usage
When specifying no field to sort on, the RANK operator simply prepends a sequential value
to each tuple.
Otherwise, the RANK operator uses each field (or set of fields) to sort the relation. The rank
of a tuple is one plus the number of different rank values preceding it. If two or more tuples
tie on the sorting field values, they will receive the same rank.
NOTE: When using the option DENSE, ties do not cause gaps in ranking values.
6.17.4 Examples
DUMP A;
(David,1,N)
(Tete,2,N)
(Ranjit,3,M)
(Ranjit,3,P)
(David,4,Q)
(David,4,Q)
(Jillian,8,Q)
(JaePak,7,Q)
(Michael,8,T)
(Jillian,8,Q)
(Jose,10,V)
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 77
Pig Latin Basics
In this example, the RANK operator does not change the order of the relation and simply
prepends to each tuple a sequential value.
B = rank A;
dump B;
(1,David,1,N)
(2,Tete,2,N)
(3,Ranjit,3,M)
(4,Ranjit,3,P)
(5,David,4,Q)
(6,David,4,Q)
(7,Jillian,8,Q)
(8,JaePak,7,Q)
(9,Michael,8,T)
(10,Jillian,8,Q)
(11,Jose,10,V)
In this example, the RANK operator works with f1 and f2 fields, and each one with different
sorting order. RANK sorts the relation on these fields and prepends the rank value to each
tuple. Otherwise, the RANK operator uses each field (or set of fields) to sort the relation. The
rank of a tuple is one plus the number of different rank values preceding it. If two or more
tuples tie on the sorting field values, they will receive the same rank.
dump C;
(1,Tete,2,N)
(2,Ranjit,3,M)
(2,Ranjit,3,P)
(4,Michael,8,T)
(5,Jose,10,V)
(6,Jillian,8,Q)
(6,Jillian,8,Q)
(8,JaePak,7,Q)
(9,David,1,N)
(10,David,4,Q)
(10,David,4,Q)
Same example as previous, but DENSE. In this case there are no gaps in ranking values.
dump C;
(1,Tete,2,N)
(2,Ranjit,3,M)
(2,Ranjit,3,P)
(3,Michael,8,T)
(4,Jose,10,V)
(5,Jillian,8,Q)
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 78
Pig Latin Basics
(5,Jillian,8,Q)
(6,JaePak,7,Q)
(7,David,1,N)
(8,David,4,Q)
(8,David,4,Q)
6.18 SAMPLE
Selects a random sample of data based on the specified sample size.
6.18.1 Syntax
6.18.2 Terms
6.18.3 Usage
Use the SAMPLE operator to select a random data sample with the stated sample size.
SAMPLE is a probabalistic operator; there is no guarantee that the exact same number of
tuples will be returned for a particular sample size each time the operator is used.
6.18.4 Example
X = SAMPLE A 0.01;
In this example, a scalar expression is used (it will sample approximately 1000 records from
the input).
a = LOAD 'a.txt';
b = GROUP a ALL;
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 79
Pig Latin Basics
6.19 SPLIT
Partitions a relation into two or more relations.
6.19.1 Syntax
SPLIT alias INTO alias IF expression, alias IF expression [, alias IF expression …] [, alias OTHERWISE];
6.19.2 Terms
IF Required keyword.
expression An expression.
6.19.3 Usage
Use the SPLIT operator to partition the contents of a relation into two or more relations based
on some expression. Depending on the conditions stated in the expression:
• A tuple may be assigned to more than one relation.
• A tuple may not be assigned to any relation.
6.19.4 Example
DUMP A;
(1,2,3)
(4,5,6)
(7,8,9)
DUMP X;
(1,2,3)
(4,5,6)
DUMP Y;
(4,5,6)
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 80
Pig Latin Basics
DUMP Z;
(1,2,3)
(7,8,9)
6.19.5 Example
In this example, the SPLIT and FILTER statements are essentially equivalent. However,
because SPLIT is implemented as "split the data stream and then apply filters" the SPLIT
statement is more expensive than the FILTER statement because Pig needs to filter and store
two data streams.
SPLIT input_var INTO output_var IF (field1 is not null), ignored_var IF (field1 is null);
-- where ignored_var is not used elsewhere
6.20 STORE
Stores or saves results to the file system.
6.20.1 Syntax
6.20.2 Terms
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 81
Pig Latin Basics
6.20.3 Usage
Use the STORE operator to run (execute) Pig Latin statements and save (persist) results to
the file system. Use STORE for production scripts and batch mode processing.
Note: To debug scripts during development, you can use DUMP to check intermediate
results.
6.20.4 Examples
In this example data is stored using PigStorage and the asterisk character (*) as the field
delimiter.
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
CAT myoutput;
1*2*3
4*2*1
8*3*4
4*3*3
7*2*5
8*4*3
In this example, the CONCAT function is used to format the data before it is stored.
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 82
Pig Latin Basics
DUMP B;
(a:1,b:2,c:3)
(a:4,b:2,c:1)
(a:8,b:3,c:4)
(a:4,b:3,c:3)
(a:7,b:2,c:5)
(a:8,b:4,c:3)
CAT myoutput;
a:1,b:2,c:3
a:4,b:2,c:1
a:8,b:3,c:4
a:4,b:3,c:3
a:7,b:2,c:5
a:8,b:4,c:3
6.21 STREAM
Sends data to an external script or program.
6.21.1 Syntax
6.21.2 Terms
THROUGH Keyword.
AS Keyword.
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 83
Pig Latin Basics
6.21.3 Usage
Use the STREAM operator to send data through an external script or program. Multiple
stream operators can appear in the same Pig script. The stream operators can be adjacent to
each other or have other operations in between.
When used with a command, a stream statement could look like this:
A = LOAD 'data';
When used with a cmd_alias, a stream statement could look like this, where mycmd is the
defined alias.
A = LOAD 'data';
Data guarantees are determined based on the position of the streaming operator in the Pig
script.
• Unordered data – No guarantee for the order in which the data is delivered to the
streaming application.
• Grouped data – The data for the same grouped key is guaranteed to be provided to the
streaming application contiguously
• Grouped and ordered data – The data for the same grouped key is guaranteed to be
provided to the streaming application contiguously. Additionally, the data within the
group is guaranteed to be sorted by the provided secondary key.
In addition to position, data grouping and ordering can be determined by the data itself.
However, you need to know the property of the data to be able to take advantage of its
structure.
A = LOAD 'data';
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 84
Pig Latin Basics
A = LOAD 'data';
B = GROUP A BY $1;
C = FOREACH B FLATTEN(A);
A = LOAD 'data';
B = GROUP A BY $1;
C = FOREACH B {
D = ORDER A BY ($3, $4);
GENERATE D;
}
6.22 UNION
Computes the union of two or more relations.
6.22.1 Syntax
6.22.2 Terms
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 85
Pig Latin Basics
6.22.3 Usage
Use the UNION operator to merge the contents of two or more relations. The UNION
operator:
• Does not preserve the order of tuples. Both the input and output relations are interpreted
as unordered bags of tuples.
• Does not ensure (as databases do) that all tuples adhere to the same schema or that they
have the same number of fields. In a typical scenario, however, this should be the case;
therefore, it is the user's responsibility to either (1) ensure that the tuples in the input
relations have the same schema or (2) be able to process varying tuples in the output
relation.
• Does not eliminate duplicate tuples.
Schema Behavior
The behavior of schemas for UNION (positional notation / data types) and UNION
ONSCHEMA (named fields / data types) is the same, except where noted.
Union on relations with two different sizes result in a null schema (union only):
A: (a1:long, a2:long)
B: (b1:long, b2:long, b3:long)
A union B: null
Union columns with incompatible types results in a failure. (See Types Table for addition
and subtraction for incompatible types.)
A: (a1:long)
B: (a1:chararray)
A union B: ERROR: Cannot cast from long to bytearray
Union columns of compatible type will produce an "escalate" type. The priority is:
• double > float > long > int > bytearray
• tuple|bag|map|chararray > bytearray
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 86
Pig Latin Basics
The alias of the first relation is always taken as the alias of the unioned relation field.
6.22.4 Example
DUMP A;
(1,2,3)
(4,2,1)
DUMP A;
(2,4)
(8,9)
(1,3)
X = UNION A, B;
DUMP X;
(1,2,3)
(4,2,1)
(2,4)
(8,9)
(1,3)
6.22.5 Example
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 87
Pig Latin Basics
DUMP U;
(11,12.0,)
(21,22.0,)
(11,,a)
(12,,b)
(13,,c)
7 UDF Statements
7.1.2 Terms
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 88
Pig Latin Basics
7.1.3 Usage
Use the DEFINE statement to assign a name (alias) to a UDF function or to a streaming
command.
Use DEFINE to specify a UDF function when:
• The function has a long package name that you don't want to include in a script,
especially if you call the function several times in that script.
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 89
Pig Latin Basics
• The constructor for the function takes string parameters. If you need to use different
constructor parameters for different calls to the function you will need to create multiple
defines – one for each parameter set.
Use DEFINE to specify a streaming command when:
• The streaming command specification is complex.
• The streaming command specification requires additional parameters (input, output, and
so on).
Serialization is needed to convert data from tuples to a format that can be processed by the
streaming application. Deserialization is needed to convert the output from the streaming
application back into tuples. PigStreaming is the default serialization/deserialization function.
Streaming uses the same default format as PigStorage to serialize/deserialize the data. If you
want to explicitly specify a format, you can do it as show below (see more examples in the
Examples: Input/Output section).
If you need an alternative format, you will need to create a custom serializer/deserializer by
implementing the following interfaces.
interface PigToStream {
/**
* Given a tuple, produce an array of bytes to be passed to the streaming
* executable.
*/
public byte[] serialize(Tuple t) throws IOException;
}
interface StreamToPig {
/**
* Given a byte array from a streaming executable, produce a tuple.
*/
public Tuple deserialize(byte[]) throws IOException;
/**
* This will be called on both the front end and the back
* end during execution.
*
* @return the {@link LoadCaster} associated with this object.
* @throws IOException if there is an exception during LoadCaster
*/
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 90
Pig Latin Basics
Use the ship option to send streaming binary and supporting files, if any, from the client node
to the compute nodes. Pig does not automatically ship dependencies; it is your responsibility
to explicitly specify all the dependencies and to make sure that the software the processing
relies on (for instance, perl or python) is installed on the cluster. Supporting files are shipped
to the task's current working directory and only relative paths should be specified. Any pre-
installed binaries should be specified in the PATH.
Only files, not directories, can be specified with the ship option. One way to work around
this limitation is to tar all the dependencies into a tar file that accurately reflects the
structure needed on the compute nodes, then have a wrapper for your script that un-tars the
dependencies prior to execution.
Note that the ship option has two components: the source specification, provided in the
ship( ) clause, is the view of your machine; the command specification is the view of the
actual cluster. The only guarantee is that the shipped files are available in the current working
directory of the launched job and that your current working directory is also on the PATH
environment variable.
Shipping files to relative paths or absolute paths is not supported since you might not have
permission to read/write/execute from arbitrary paths on the clusters.
Note the following:
• It is safe only to ship files to be executed from the current working directory on the task
on the cluster.
• Shipping files to relative paths or absolute paths is undefined and mostly will fail since
you may not have permissions to read/write/execute from arbitraty paths on the actual
clusters.
The ship option works with binaries, jars, and small datasets. However, loading larger
datasets at run time for every execution can severely impact performance. Instead, use the
cache option to access large files already moved to and available on the compute nodes. Only
files, not directories, can be specified with the cache option.
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 91
Pig Latin Basics
If the ship and cache options are not specified, Pig will attempt to auto-ship the binary in the
following way:
• If the first word on the streaming command is perl or python, Pig assumes that the binary
is the first non-quoted string it encounters that does not start with dash.
• Otherwise, Pig will attempt to ship the first string from the command line as long as it
does not come from /bin, /usr/bin, /usr/local/bin. Pig will determine this
by scanning the path if an absolute path is provided or by executing which. The paths
can be made configurable using the set stream.skippath option (you can use multiple set
commands to specify more than one path to skip).
If you don't supply a DEFINE for a given streaming command, then auto-shipping is turned
off.
Note the following:
• If Pig determines that it needs to auto-ship an absolute path it will not ship it at all since
there is no way to ship files to the necessary location (lack of permissions and so on).
• Pig will not auto-ship files in the following system directories (this is determined by
executing 'which <file>' command).
• To auto-ship, the file in question should be present in the PATH. So if the file is in the
current working directory then the current working directory should be in the PATH.
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 92
Pig Latin Basics
X = STREAM A THROUGH Y;
In this example user defined serialization/deserialization functions are used with the script.
X = STREAM A THROUGH Y;
In this example ship is used to send the script to the cluster compute nodes.
X = STREAM A THROUGH Y;
In this example cache is used to specify a file located on the cluster compute nodes.
X = STREAM A THROUGH Y;
In this example a command is defined for use with the STREAM operator.
A = LOAD 'data';
In this example the streaming stderr is stored in the _logs/<dir> directory of the job's output
directory. Because the job can have multiple streaming applications associated with it, you
need to ensure that different directory names are used to avoid conflicts. Pig stores up to 100
tasks per streaming job.
X = STREAM A THROUGH Y;
In this example a function is defined for use with the FOREACH …GENERATE operator.
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 93
Pig Latin Basics
REGISTER /src/myfunc.jar
A = LOAD 'students';
7.2.1 Syntax
REGISTER path;
7.2.2 Terms
path The path to the JAR file (the full location URI is
required). Do not place the name in quotes.
7.2.3 Usage
Pig Scripts
Use the REGISTER statement inside a Pig script to specify a JAR file or a Python/JavaScript
module. Pig supports JAR files and modules stored in local file systems as well as remote,
distributed file systems such as HDFS and Amazon S3 (see Pig Scripts).
Additionally, JAR files stored in local file systems can be specified as a glob pattern using
“*”. Pig will search for matching jars in the local file system, either the relative path (relative
to your working directory) or the absolute path. Pig will pick up all JARs that match the glob.
Command Line
You can register additional files (to use with your Pig script) via PIG_OPTS environment
variable using the -Dpig.additional.jars.uris option. For more information see User Defined
Functions.
7.2.4 Examples
In this example REGISTER states that the JavaScript module, myfunc.js, is located in the /
src directory.
REGISTER /src/myfunc.js;
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 94
Pig Latin Basics
A = LOAD 'students';
B = FOREACH A GENERATE myfunc.MyEvalFunc($0);
In this example additional JAR files are registered via PIG_OPTS environment variable.
export PIG_OPTS="-Dpig.additional.jars.uris=my.jar,your.jar"
In this example a JAR file stored in HDFS and a local JAR file are registered.
export PIG_OPTS="-Dpig.additional.jars.uris=hdfs://nn.mydomain.com:9020/myjars/
my.jar,file:///home/root/pig/your.jar"
Note, the legacy property pig.additional.jars which use colon as separator is still supported.
But we recommend to use pig.additional.jars.uris since colon is also used in URL scheme,
and thus we cannot use full scheme in the list. We will deprecate pig.additional.jar in future
releases.
This example shows how to specify a glob pattern using either a relative path or an absolute
path.
register /homes/user/pig/myfunc*.jar
register count*.jar
register jars/*.jar
7.3.1 Syntax
To download an Artifact (and its dependencies), you need to specify the artifact's group,
module and version following the syntax shown below. This command will download the Jar
specified and all its dependencies and load it into the classpath.
REGISTER ivy://group:module:version?querystring
7.3.2 Terms
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 95
Pig Latin Basics
7.3.3 Usage
The Register artifact command is an extension to the above register command used to
register a jar. In addition to registering a jar from a local system or from hdfs, you can now
specify the coordinates of the artifact and pig will download the artifact (and its dependencies
if needed) from the configured repository.
• Transitive
Transitive helps specifying if you need the dependencies along with the registering jar.
By setting transitive to false in the querystring we can tell pig to register only the artifact
without its dependencies. This will download only the artifact specified and will not
download the dependencies of the jar. The default value of transitive is true.
Syntax
REGISTER ivy://org:module:version?transitive=false
• Exclude
While registering an artifact if you wish to exclude some dependencies you can specify
them using the exclude key. Suppose you want to use a specific version of a dependent
jar which doesn't match the version of the jar when automatically fetched, then you could
exclude such dependencies by specifying a comma separated list of dependencies and
register the dependent jar separately.
Syntax
REGISTER ivy://org:module:version?exclude=org:mod,org:mod,...
• Classifier
Some maven dependencies need classifiers in order to be able to resolve. You can specify
them using a classifier key.
Syntax
REGISTER ivy://org:module:version?classifier=value
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 96
Pig Latin Basics
7.3.4 Examples
REGISTER ivy://org.apache.avro:avro:1.5.1
REGISTER ivy://org.apache.avro:avro:1.5.1?transitive=true
REGISTER ivy://org.apache.avro:avro:1.5.1?transitive=false
REGISTER ivy://org.apache.avro:avro:+
REGISTER ivy://org.apache.avro:avro:*
REGISTER ivy://org.apache.pig:pig:0.10.0?exclude=commons-cli:commons-
cli,commons-codec:commons-codec
• Specifying a classifier
REGISTER ivy://net.sf.json-lib:json-lib:2.4?classifier=jdk15
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 97
Pig Latin Basics
REGISTER ivy://:module:
Copyright © 2007 The Apache Software Foundation. All rights reserved. Page 98