PIG A Big Data Processor
PIG A Big Data Processor
Processor
What is
Pig?
• Apache Pig is an abstraction over MapReduce. It is
a tool/platform which is used to analyze larger sets
of data representing them as data flows.
• Pig is generally used with Hadoop; we can perform
all the data manipulation operations in Hadoop using
Apache Pig.
• To write data analysis programs, Pig provides a high-
level language known as Pig Latin.
• This language provides various operators using which
programmers can develop their own functions for
reading, writing, and processing data.
Apache Pig
• Atom
– Any single value in Pig Latin, irrespective of their
data, type is known as an Atom.
– It is stored as string and can be used as string
and number. int, long, float, double, chararray,
and bytearray are the atomic values of Pig.
– A piece of data or a simple atomic value is
known as a field.
– Example: ‘raja’ or ‘30’
Apache Pig – Elements
• Tuple
– A record that is formed by an ordered set of
fields is known as a tuple, the fields can be of
any type. A tuple is similar to a row in a table of
RDBMS.
– Example: (Raja, 30)
Apache Pig – Elements
• Bag
– A bag is an unordered set of tuples. In other words, a
collection of tuples (non-unique) is known as a bag. Each
tuple can have any number of fields (flexible schema). A
bag is represented by ‘{}’. It is similar to a table in
RDBMS, but unlike a table in RDBMS, it is not necessary
that every tuple contain the same number of fields or
that the fields in th same position (column) have the
same type.
– Example: {(Raja, 30), (Mohammad, 45)}
– A bag can be a field in a relation; in that context, it
is known as inner bag.
– Example: {Raja, 30, {9848022338, [email protected],
}}
Apache Pig – Elements
• Relation
– A relation is a bag of tuples. The relations in
Pig Latin are unordered (there is no guarantee
that tuples are processed in any particular
• order).
Map
– A map (or data map) is a set of key-value pairs.
The key needs to be of type chararray and
should be unique. The value might be of any
type. It is represented by ‘[]’
– Example: [name#Raja, age#30]
Installation of
PIG
Download
export PIG_HOME=/usr/lib/pig
export PATH=$PATH:$PIG_HOME/bin
source ~/.bashrc
Start the Pig
pig -x local
pig -x mapreduce
Grunt shell
Data Processing with
PIG
Example: movies_data.csv
1,Dhadakebaz,1986,3.2,7560
2,Dhumdhadaka,1985,3.8,6300
3,Ashi hi banva banvi,1988,4.1,7802
4,Zapatlela,1993,3.7,6022
5,Ayatya Gharat Gharoba,1991,3.4,5420
6,Navra Maza Navsacha,2004,3.9,4904
7,De danadan,1987,3.4,5623
8,Gammat Jammat,1987,3.4,7563
9,Eka peksha ek,1990,3.2,6244
10,Pachhadlela,2004,3.1,6956
Load data
• $ pig -x local
• grunt> movies = LOAD
'movies_data.csv' USING
PigStorage(',') as
(id,name,year,rating,duration)
• grunt> movies_greater_than_35 =
FILTER movies BY (float)rating > 3.5;
cat my_movies/part-m-00000
Load command
Mo
vie
W s sta
ith r
D'' ts
Mo
v
Th ies g
an r
2 h eate
ou r
rs
Describe
From
1985
To
2004
Limit
$ pig
x local
scriptfile.pig
Grunt mode
• You can also run pig scripts from grunt using run and
exec commands.
grunt> run scriptfile.pig
grunt> exec scriptfile.pig
Embedded mode
forts.pig
Output snapshot