Pig_2
Pig_2
PIG Latin
• Pig Latin is a data flow language used for exploring large data sets.
• Rapid development
• No Java is required.
• Its is a high-level platform for creating MapReduce programs used
with Hadoop.
• Pig was originally developed at Yahoo Research around 2006 for
researchers to have an ad-hoc way of creating and executing map-
reduce jobs on very large data sets. In 2007,it was moved into the
Apache Software Foundation
• Like actual pigs, who eat almost anything, the Pig programming
language is designed to handle any kind of data—hence the name!
What Pig Does
• To install pig
• untar the .gz file using tar –xvzf pig-0.13.0-bin.tar.gz
• To initialize the environment variables, export the following:
• export PIG_HADOOP_VERSION=20
(Specifies the version of hadoop that is running)
• export HADOOP_HOME=/home/(user-name)/hadoop-0.20.2
(Specifies the installation directory of hadoop to the environment
variable HADOOP_HOME. Typically defined as /home/user-
name/hadoop-version)
• export PIG_CLASSPATH=$HADOOP_HOME/conf
(Specifies the class path for pig)
• export PATH=$PATH:/home/user-name/pig-0.13.1/bin
(for setting the PATH variable)
• export JAVA_HOME=/usr
(Specifies the java home to the environment variable.)
PIG Modes
• Data type is a data storage format that can contain a specific type or
range of values.
– Scalar types
• Sample: int, long, double, chararray, bytearray
– Complex types
• Sample: Atom, Tuple, Bag, Map
• User can declare data type at load time as below.
– A= LOAD ‘test.data’ using PigStorage(',') AS (sno:chararray,
name: chararray, marks:long);
• If data type is not declared but script treats value as a certain type,
Pig will assume it is of that type and cast it.
– A= LOAD ‘test.data’ using PigStorage(',') AS (sno, name,
marks);
– B = FOREACH A GENERATE marks* 100; --marks cast to long
Data types continues…
• Pig Latin statements can span multiple lines and must end
with a semi-colon ( ; )
PIG The programming language
• In interactive mode, STORE acts like DUMP and will always trigger
execution (this includes the run command), but in batch mode it will not
(this includes the exec command).
•The reason for this is efficiency. In batch mode, Pig will parse the
whole script to see whether there are any optimizations that could be
made to limit the amount of data to be written to or read from disk.
Consider the following simple example:
• A = LOAD 'input/pig/multiquery/A';
• B = FILTER A BY $1 == 'banana';
• C = FILTER A BY $1 != 'banana';
• STORE B INTO 'output/b';
• STORE C INTO 'output/c';
When the Pig Latin interpreter sees the first line containing the LOAD
statement, it confirms that it is syntactically and semantically correct
and adds it to the logical plan, but it does not load the data from the file
(or even check whether the file exists).
The point is that it makes no sense to start any processing until the
whole flow is defined. Similarly, Pig validates the GROUP and
FOREACH…GENERATE statements, and adds them to the logical
plan without executing them. The trigger for Pig to start execution is the
DUMP statement. At that point, the logical plan is compiled into a
physical plan and executed.
Practice Session
Create a sample file
John,18,4.0
Mary,19,3.8
Bill,20,3.9
Joe,18,3.8
Save it as “student.txt”
FileA.txt
123
421
834
433
725
843
Move it to HDFS by using below command.
hadoop fs – put <localpath> <hdfspath>
Create another Sample File
FileB.txt
24
89
13
27
29
46
49
Move it to HDFS by using below command.
hadoop fs – put localpath hdfspath
Filter
• For tuples, flatten substitutes the fields of a tuple in place of the tuple.
• For example, consider a relation (a, (b, c)).
• GENERATE $0, flatten($1)
– (a, b, c).
Example: X = CROSS A, B;
(1, 2, 3, 2, 4)
(1, 2, 3, 8, 9)
(1, 2, 3, 1, 3)
(1, 2, 3, 2, 7)
(1, 2, 3, 2, 9)
(1, 2, 3, 4, 6)
(1, 2, 3, 4, 9)
(4, 2, 1, 2, 4)
(4, 2, 1, 8, 9)
SPLIT
i/p: CricketScore.txt
a = load '/user/cloudera/SampleDataFile/CricketScore.txt'
using PigStorage('\t');
b = foreach a generate $0, $1;
c = group b by $0;
d = foreach c generate group, max(b.$1);
dump d;
Sorting Data
Often you are not interested in the entire output but rather a
sample or top results. In such cases, LIMIT can yield a
much better performance as we push the limit as high as
possible to minimize the amount of data travelling through
the pipeline.
• Use the LIMIT Operator
– A = load 'myfile' as (t, u, v);
– B = order A by t;
– C = limit B 500;
Performance Tuning
If types are not specified in the load statement, Pig assumes the
type of double for numeric computations. A lot of the time, your
data would be much smaller, maybe, integer or long. Specifying
the real type will help with speed of arithmetic computation.
• Use Types
– --Query 1
• A = load 'myfile' as (t, u, v);
• B = foreach A generate t + u;
– --Query 2
• A = load 'myfile' as (t: int, u: int, v);
• B = foreach A generate t + u;
• The second query will run more efficiently than the first. In
some of our queries with see 2x speedup.
Performance Tuning
Now nulls will be dropped before the join. Since all null
keys go to a single reducer, if your key is null even a small
percentage of the time the gain can be significant.
Performance Tuning
Regular join
• sales = join transactions by (state, country), geography by (state,
country);
• One of the keys is much more common than others, and the data for
it is too large to fit in the memory.
• The two datasets are both sorted in ascending order by the join key.
• Datasets may already be sorted by the join key if that’s the order in
which data was entered or they have undergone sorting before the
join operation for other needs.
• When merge join receives the pre-sorted datasets, they are read
and compared on the map side, and as a result they run faster. Both
inner and outer join are available.
• transactions = load 'customer_transactions' as
( fname, lname, city, state, country, amount, tax);
• geography = load 'geo_data' as (state, country,
district, manager);
• sales = join transactions by (state, country),
geography by (state, country) using 'merge';
Thank You
• Question?
• Feedback?