0% found this document useful (0 votes)

7 views63 pages

Pig_2

Uploaded by

bhargavi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views63 pages

Pig_2

Uploaded by

bhargavi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 63

Apache Pig

PIG Latin

• Pig Latin is a data flow language used for exploring large data sets.

• Rapid development
• No Java is required.
• Its is a high-level platform for creating MapReduce programs used
with Hadoop.
• Pig was originally developed at Yahoo Research around 2006 for
researchers to have an ad-hoc way of creating and executing map-
reduce jobs on very large data sets. In 2007,it was moved into the
Apache Software Foundation
• Like actual pigs, who eat almost anything, the Pig programming
language is designed to handle any kind of data—hence the name!
What Pig Does

Pig was designed for performing a long series of data operations,

making it ideal for three categories of Big Data jobs:

• Extract-transform-load (ETL) data pipelines,

• Research on raw data, and
• Iterative data processing.
Features of PIG

• Provides support for data types – long, float, chararray, schemas

and functions
• Is extensible and supports User Defined Functions
• Schema not mandatory, but used when available
• Provides common operations like JOIN, GROUP, FILTER, SORT
When not to use PIG

• Really nasty data formats or complete unstructured data.

– Video Files
– Audio Files
– Image Files
– Raw human readable text
• PIG is slow compared to Map-Reduce
• When you need more power to optimize code.
PIG Use Case
PIG Components
I Install PIG

• To install pig
• untar the .gz file using tar –xvzf pig-0.13.0-bin.tar.gz
• To initialize the environment variables, export the following:
• export PIG_HADOOP_VERSION=20
(Specifies the version of hadoop that is running)
• export HADOOP_HOME=/home/(user-name)/hadoop-0.20.2
(Specifies the installation directory of hadoop to the environment
variable HADOOP_HOME. Typically defined as /home/user-
name/hadoop-version)
• export PIG_CLASSPATH=$HADOOP_HOME/conf
(Specifies the class path for pig)
• export PATH=$PATH:/home/user-name/pig-0.13.1/bin
(for setting the PATH variable)
• export JAVA_HOME=/usr
(Specifies the java home to the environment variable.)
PIG Modes

• Pig in Local mode

– No HDFS is required, All files run on local file system.
– Command: pig –x local
• Pig in MapReduce(hadoop) mode
– To run PIG scripts in MR mode, ensure you have access to
HDFS, By Default, PIG starts in MapReduce Mode.
– Command: pig –x mapreduce or pig
PIG Program Structure

• Grunt Shell or Interactive mode

– Grunt is an interactive shell for running PIG commands.

• PIG Scripts or Batch mode

– PIG can run a script file that contains PIG commands.
– E.g. PIG script.pig
Introducing data types

• Data type is a data storage format that can contain a specific type or
range of values.

– Scalar types
• Sample: int, long, double, chararray, bytearray

– Complex types
• Sample: Atom, Tuple, Bag, Map
• User can declare data type at load time as below.
– A= LOAD ‘test.data’ using PigStorage(',') AS (sno:chararray,
name: chararray, marks:long);

• If data type is not declared but script treats value as a certain type,
Pig will assume it is of that type and cast it.
– A= LOAD ‘test.data’ using PigStorage(',') AS (sno, name,
marks);
– B = FOREACH A GENERATE marks* 100; --marks cast to long
Data types continues…

Relation can be defined as follows:

• A field/Atom is a piece of data.
Ex:12.5 or hello world

• A tuple is an ordered set of fields.

EX: Tuple (12.5,hello world,-2)
It’s most often used as a row in a relation.
It’s represented by fields separated by commas, enclosed by
parentheses.
• A bag is a collection of tuples.
Bag {(12.5,hello world,-2),(2.87,bye world,10)}
A bag is an unordered collection of tuples.
A bag is represented by tuples separated by commas, all
enclosed by curly

• Map [key value]

A map is a set of key/value pairs.
Keys must be unique and be a string (chararray).
The value can be any type.
In sort ..

Relations, Bags, Tuples, Fields

Pig Latin statements work with relations, A relation can be defined as
follows:

• A relation is a bag (more specifically, an outer bag).

• A bag is a collection of tuples.
• A tuple is an ordered set of fields.
• A field is a piece of data.
PIG Latin Statements

• A Pig Latin statement is an operator that takes a relation as

input and produces another relation as output.

• This definition applies to all Pig Latin operators except

LOAD and STORE command which read data from and
write data to the file system.

• In PIG when a data element is null it means its unknown.

Data of any type can be null.

• Pig Latin statements can span multiple lines and must end
with a semi-colon ( ; )
PIG The programming language

• Pig Latin statements are generally organized in the

following manner:

– A LOAD statement reads data from the file system.

– A series of "transformation" statements process the
data.
– A STORE statement writes output to the file system;
OR

– A DUMP statement displays output to the screen

MULTIQUERY EXECUTION

•Because DUMP is a diagnostic tool, it will always trigger execution.

However, the STORE command is different.

• In interactive mode, STORE acts like DUMP and will always trigger
execution (this includes the run command), but in batch mode it will not
(this includes the exec command).

•The reason for this is efficiency. In batch mode, Pig will parse the
whole script to see whether there are any optimizations that could be
made to limit the amount of data to be written to or read from disk.
Consider the following simple example:
• A = LOAD 'input/pig/multiquery/A';
• B = FILTER A BY $1 == 'banana';
• C = FILTER A BY $1 != 'banana';
• STORE B INTO 'output/b';
• STORE C INTO 'output/c';

Relations B and C are both derived from A, so to save reading A twice,

Pig can run this script as a single MapReduce job by reading A once
and writing two output files from the job, one for each of B and C. This
feature is called multiquery execution.
Working with Data
File System Commands
Utility Commands
Logical vs. Physical Plan

When the Pig Latin interpreter sees the first line containing the LOAD
statement, it confirms that it is syntactically and semantically correct
and adds it to the logical plan, but it does not load the data from the file
(or even check whether the file exists).

The point is that it makes no sense to start any processing until the
whole flow is defined. Similarly, Pig validates the GROUP and
FOREACH…GENERATE statements, and adds them to the logical
plan without executing them. The trigger for Pig to start execution is the
DUMP statement. At that point, the logical plan is compiled into a
physical plan and executed.
Practice Session
Create a sample file

John,18,4.0
Mary,19,3.8
Bill,20,3.9
Joe,18,3.8

Save it as “student.txt”

Move it to HDFS by using below command.

hadoop fs – put <local path - filename> hdfspath
LOAD/DUMP/STORE

A = load 'student' using PigStorage(‘,’) AS

(name:chararray,age:int,gpa:float);
DESCRIBE A;
A: {name: chararray,age: int,gpa: float}
DUMP A;
(John,18,4.0)
(Mary,19,3.8)
(Bill,20,3.9)
(Joe,18,3.8)
store A into ‘/hdfspath’;
Group

Groups the data in one relations.

B = GROUP A BY age;
DUMP B;
(18,{(John,18,4.0),(Joe,18,3.8)})
(19,{(Mary,19,3.8)})
(20,{(Bill,20,3.9)})
Foreach…Generate

C = FOREACH B GENERATE group, COUNT(A);

DUMP C;
(18,2)
(19,1)
(20,1)

C = FOREACH B GENERATE $0, $1.name;

DUMP C;
(18,{(John),(Joe)})
(19,{(Mary)})
(20,{(Bill)})
Create Sample File

FileA.txt
123
421
834
433
725
843
Move it to HDFS by using below command.
hadoop fs – put <localpath> <hdfspath>
Create another Sample File

FileB.txt
24
89
13
27
29
46
49
Move it to HDFS by using below command.
hadoop fs – put localpath hdfspath
Filter

Definition: Selects tuples from a relation based on some condition.

FILTER is commonly used to select the data that you want; or, conversely, to
filter out (remove) the data you don’t want.
Examples
A = LOAD 'data' using PigStorage(‘,’) AS (a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
X = FILTER A BY a3 == 3;
DUMP X;
(1,2,3)
(4,3,3)
(8,4,3)
Co-Group

Definition: The GROUP and COGROUP operators are identical.

For readability GROUP is used in statements involving one
relation and COGROUP is used in statements involving two
or more relations. FileA FileB.txt
X = COGROUP A BY $0, B BY $0; 123 24
(1, {(1, 2, 3)}, {(1, 3)}) 421 89
(2, {}, {(2, 4), (2, 7), (2, 9)}) 834 13
(4, {(4, 2, 1), (4, 3, 3)}, {(4, 6),(4,4 9)})
33 27
(7, {(7, 2, 5)}, {}) 725 29
843 46
(8, {(8, 3, 4), (8, 4, 3)}, {(8, 9)})
49

• To see groups for which inputs have at least one tuple:

X = COGROUP A BY $0 INNER, B BY $0 INNER;
(1, {(1, 2, 3)}, {(1, 3)})
(4, {(4, 2, 1), (4, 3, 3)}, {(4, 6), (4, 9)})
Flatten Operator
• Flatten un-nests tuples as well as bags.

• For tuples, flatten substitutes the fields of a tuple in place of the tuple.
• For example, consider a relation (a, (b, c)).
• GENERATE $0, flatten($1)
– (a, b, c).

• For bags, flatten substitutes bags with new tuples.

• For Example, consider a bag ({(b,c),(d,e)}).
• GENERATE flatten($0),
– will end up with two tuples (b,c) and (d,e).

• When we remove a level of nesting in a bag, sometimes we cause a cross product to

happen.
• For example, consider a relation (a, {(b,c), (d,e)})
• GENERATE $0, flatten($1),
– it will create new tuples: (a, b, c) and (a, d, e).
JOIN

Definition: Performs join of two or more relations based on common field

values
Syntax:
X= JOIN A BY $0, B BY $0;

which is equivalent to:

X = COGROUP A BY $0 INNER, B BY $0 INNER;
Y = FOREACH X GENERATE FLATTEN(A), FLATTEN(B);

The result is: (1, {(1, 2, 3)}, {(1, 3)})

(1, 2, 3, 1, 3) (4, {(4, 2, 1), (4, 3, 3)}, {(4, 6),
(4, 2, 1, 4, 6) (4, 9)})
(4, 3, 3, 4, 6) (8, {(8, 3, 4), (8, 4, 3)}, {(8,
(4, 2, 1, 4, 9) 9)})
(4, 3, 3, 4, 9)
(8, 3, 4, 8, 9)
(8, 4, 3, 8, 9)
Distinct

Removes duplicate tuples in a relation.

X = FOREACH A GENERATE $2;
(3)
(1)
(4)
(3)
(5)
(3)
Y = DISTINCT X;
(1)
(3)
(4)
(5)
CROSS

• Computes the cross product of two or more relations.

Example: X = CROSS A, B;
(1, 2, 3, 2, 4)
(1, 2, 3, 8, 9)
(1, 2, 3, 1, 3)
(1, 2, 3, 2, 7)
(1, 2, 3, 2, 9)
(1, 2, 3, 4, 6)
(1, 2, 3, 4, 9)
(4, 2, 1, 2, 4)
(4, 2, 1, 8, 9)
SPLIT

Partitions a relation into two or more relations.

Example: A = LOAD 'data' AS (f1:int,f2:int,f3:int);
DUMP A; (1,2,3) (4,5,6) (7,8,9)

SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6);

DUMP X;
(1,2,3)
(4,5,6)
DUMP Y;
(4,5,6)
DUMP Z;
(1,2,3) (7,8,9)
Some more commands

• To select few columns from one dataset

– S1 = foreach a generate a1, a1;
• Simple calculation on dataset
– K = foreach A generate $1, $2, $1*$2;
• To display only 100 records
– B = limit a 100;
• To see the structure/Schema
– Describe A;
• To Union two datasets
– C = UNION A,B;
Word Count Program

Create a basic wordsample.txt file and move to

HDFS
x = load '/home/pgupta5/prashant/data.txt';
y = foreach x generate flatten (TOKENIZE ((chararray) $0))
as word;
z = group y by word;
counter = foreach z generate group, COUNT(y);
store counter into ‘/NewPigData/WordCount’;
Another Example
i/p: webcount
en google.com 70 2012
en yahoo.com 60 2013
us google.com 80 2012
en google.com 40 2014
us google.com 80 2012

records = LOAD ‘webcount’ using PigStorage (‘\t’) as (country:chararray,

name:chararray, pagecount:int, year:int);

filtered_records = filter records by country == ‘en’;

grouped_records = group filtered_records by name;

results = foreach grouped_records generate group, SUM

(filtered_records.pagecount);

sorted_result = order results by $1 desc;

store sorted_result into ‘/some_external_HDFS_location//data’; -- Hive external table
path
Find Maximum Score

i/p: CricketScore.txt
a = load '/user/cloudera/SampleDataFile/CricketScore.txt'
using PigStorage('\t');
b = foreach a generate $0, $1;
c = group b by $0;
d = foreach c generate group, max(b.$1);
dump d;
Sorting Data

Relations are unordered in Pig.

Consider a relation A:
• grunt> DUMP A;
• (2,3)
• (1,2)
• (2,4)
There is no guarantee which order the rows will be processed in. In particular, when
retrieving the contents of A using DUMP or STORE, the rows may be written in any
order. If you want to impose an order on the output, you can use the ORDER operator
to sort a relation by one or more fields.
The following example sorts A by the first field in ascending order and by the second
field in descending order:
grunt> B = ORDER A BY $0, $1 DESC;
grunt> DUMP B;
• (1,2)
• (2,4)
• (2,3)
Any further processing on a sorted relation is not guaranteed to retain its order.
Using Hive tables with HCatalog

• HCatalog (which is a component of Hive) provides

access to Hive’s metastore, so that Pig queries can
reference schemas each time.
• For example, after running through An Example to load
data into a Hive table called records, Pig can access the
table’s schema and data as follows:
• pig -useHCatalog
• grunt> records = LOAD ‘School_db.student_tbl'
USING org.apache.hcatalog.pig.HCatLoader();
• grunt> DESCRIBE records;
• grunt> DUMP records;
PIG UDFs

Pig provides extensive support for user defined functions (UDFs) to

specify custom processing.

REGISTER - Registers the JAR file with PIG runtime.

REGISTER myudfs.jar;
//JAR file should be available in local LINUX.
A = LOAD 'student_data‘ using PigStorage(‘,’) AS (name: chararray,
age: int, gpa: float);
B = FOREACH A GENERATE myudfs.UPPER(name);
DUMP B;
UDF Sample Program
package myudfs;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;

public class UPPER extends EvalFunc<String>

{
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try{
String str = (String)input.get(0);
return str.toUpperCase();
}
catch(Exception e){
throw new IOException("Caught exception processing input row ", e);
}
}
}
• (Pig’s Java UDF extends functionalities of EvalFunc)
Diagnostic operator

• DESCRIBE: Prints a relation’s schema.

• EXPLAIN: Prints the logical and physical plans.
• ILLUSTRATE: Shows a sample execution of the logical
plan, using a generated subset of the input.
Performance Tuning
Pig does not (yet) determine when a field is no longer needed and drop the field from the
row. For example, say you have a query like:
• Project Early and Often
– A = load 'myfile' as (t, u, v);
– B = load 'myotherfile' as (x, y, z);
– C = join A by t, B by x;
– D = group C by u;
– E = foreach D generate group, COUNT($1);
• There is no need for v, y, or z to participate in this query. And there is no need to carry
both t and x past the join, just one will suffice. Changing the query above to the query
below will greatly reduce the amount of data being carried through the map and
reduce phases by pig.
– A = load 'myfile' as (t, u, v);
– A1 = foreach A generate t, u;
– B = load 'myotherfile' as (x, y, z);
– B1 = foreach B generate x;
– C = join A1 by t, B1 by x;
– C1 = foreach C generate t, u;
– D = group C1 by u;
– E = foreach D generate group, COUNT($1);
Performance Tuning
As with early projection, in most cases it is beneficial to apply filters as early as possible
to reduce the amount of data flowing through the pipeline.
• Filter Early and Often
-- Query 1
– A = load 'myfile' as (t, u, v);
– B = load 'myotherfile' as (x, y, z);
– C = filter A by t == 1;
– D = join C by t, B by x;
– E = group D by u;
– F = foreach E generate group, COUNT($1);
-- Query 2
– A = load 'myfile' as (t, u, v);
– B = load 'myotherfile' as (x, y, z);
– C = join A by t, B by x;
– D = group C by u;
– E = foreach D generate group, COUNT($1);
– F = filter E by C.t == 1;
• The first query is clearly more efficient than the second one because
it reduces the amount of data going into the join.
Performance Tuning

Often you are not interested in the entire output but rather a
sample or top results. In such cases, LIMIT can yield a
much better performance as we push the limit as high as
possible to minimize the amount of data travelling through
the pipeline.
• Use the LIMIT Operator
– A = load 'myfile' as (t, u, v);
– B = order A by t;
– C = limit B 500;
Performance Tuning

If types are not specified in the load statement, Pig assumes the
type of double for numeric computations. A lot of the time, your
data would be much smaller, maybe, integer or long. Specifying
the real type will help with speed of arithmetic computation.
• Use Types
– --Query 1
• A = load 'myfile' as (t, u, v);
• B = foreach A generate t + u;
– --Query 2
• A = load 'myfile' as (t: int, u: int, v);
• B = foreach A generate t + u;
• The second query will run more efficiently than the first. In
some of our queries with see 2x speedup.
Performance Tuning

• Use Joins appropriately.

– Understand Skewed Vs. Replicated vs. Merge join.
• Remove null values before join.
– A = load 'myfile' as (t, u, v);
– B = load 'myotherfile' as (x, y, z);
– C = join A by t, B by x;
• is rewritten by Pig to
– A = load 'myfile' as (t, u, v);
– B = load 'myotherfile' as (x, y, z);
– C1 = cogroup A by t INNER, B by x INNER;
– C = foreach C1 generate flatten(A), flatten(B);
Since the nulls from A and B won't be collected together, when the
nulls are flattened we're guaranteed to have an empty bag, which
will result in no output. But they will not be dropped until the last
possible moment.
Performance Tuning

• Hence the previous query should be rewritten as

– A = load 'myfile' as (t, u, v);
– B = load 'myotherfile' as (x, y, z);
– A1 = filter A by t is not null;
– B1 = filter B by x is not null;
– C = join A1 by t, B1 by x;

Now nulls will be dropped before the join. Since all null
keys go to a single reducer, if your key is null even a small
percentage of the time the gain can be significant.
Performance Tuning

• You can set the number of reduce tasks for the

MapReduce jobs generated by Pig using parallel
reducer feature.
– set default parallel command is used at the script
level.
• In this example all the MapReduce jobs gets launched use 20
reducers.
– SET default_parallel 20;
– A = LOAD ‘myfile.txt’ USING PigStorage() AS (t, u, v);
– B = GROUP A BY t;
– C = FOREACH B GENERATE group, COUNT(A.t) as mycount;
– D = ORDER C BY mycount;
– PARALLEL clause can be used with any operator like
group, cogroup, join, order by, distinct that starts
reduce phase.
Replicated Join

• One of the datasets is small enough that it fits in the memory.

• A replicated join copies the small dataset to the distributed cache -

space that is available on every cluster machine - and loads it into
the memory.

• Coz the data is available in the memory(DC), and is processed on

the map side of MapReduce, this operation works much faster than
a default join.
• Limitations
It isn’t clear how small the dataset needs to be for using replicated join.
According to the Pig documentation, a relation of up to 100 MB can
be used when the process has 1 GB of memory. A run-time error will
be generated if not enough memory is available for loading the data.
• transactions = load 'customer_transactions' as ( fname, lname, city,
state, country, amount, tax);
• geography = load 'geo_data' as (state, country, district, manager);

Regular join
• sales = join transactions by (state, country), geography by (state,
country);

• sales = join transactions by (state, country), geography by (state,

country) using 'replicated';
Skewed Join

• One of the keys is much more common than others, and the data for
it is too large to fit in the memory.

• Standard joins run in parallel across different reducers by splitting

key values across processes. If there is a lot of data for a certain
key, the data will not be distributed evenly across the reducers, and
one of them will be ‘stuck’ processing the majority of data .

• Skewed join handles this case. It calculates a histogram to check

which key is the most prevalent and then splits its data across
different reducers for optimal performance.
• transactions = load 'customer_transactions' as ( fname,
lname, city, state, country, amount, tax);
• geography = load 'geo_data' as (state, country, district,
manager);

• sales = join transactions by (state, country), geography

by (state, country) using 'skewed';
Merge Join

• The two datasets are both sorted in ascending order by the join key.

• Datasets may already be sorted by the join key if that’s the order in
which data was entered or they have undergone sorting before the
join operation for other needs.

• When merge join receives the pre-sorted datasets, they are read
and compared on the map side, and as a result they run faster. Both
inner and outer join are available.
• transactions = load 'customer_transactions' as
( fname, lname, city, state, country, amount, tax);
• geography = load 'geo_data' as (state, country,
district, manager);
• sales = join transactions by (state, country),
geography by (state, country) using 'merge';
Thank You

• Question?
• Feedback?

Pig Hive
No ratings yet
Pig Hive
72 pages
Pig Hive
No ratings yet
Pig Hive
58 pages
Pig Hive
No ratings yet
Pig Hive
59 pages
Apache PIG.pptx
No ratings yet
Apache PIG.pptx
41 pages
Session 3.3
No ratings yet
Session 3.3
30 pages
Apache Pig
No ratings yet
Apache Pig
28 pages
Hadoop Week 5
No ratings yet
Hadoop Week 5
78 pages
BDA Unit - IV
No ratings yet
BDA Unit - IV
81 pages
Apache PIG by Sravanthi
No ratings yet
Apache PIG by Sravanthi
31 pages
BDA Unit-4-PPT
No ratings yet
BDA Unit-4-PPT
98 pages
BD Unit 2 (2)
No ratings yet
BD Unit 2 (2)
20 pages
Unit IV EBDP 22
No ratings yet
Unit IV EBDP 22
97 pages
Hadoop Pig Presentation
No ratings yet
Hadoop Pig Presentation
33 pages
Unit 5 Lecture No-2(PIG)
No ratings yet
Unit 5 Lecture No-2(PIG)
101 pages
Pig
No ratings yet
Pig
16 pages
PIG: A Big Data Processor: Tushar B. Kute
No ratings yet
PIG: A Big Data Processor: Tushar B. Kute
50 pages
Introduction To Pig: SESSION 2016-2017
No ratings yet
Introduction To Pig: SESSION 2016-2017
44 pages
Hadoop Pig
No ratings yet
Hadoop Pig
111 pages
bdaut2
No ratings yet
bdaut2
66 pages
Apache Pig
No ratings yet
Apache Pig
61 pages
PIG A Big Data Processor
No ratings yet
PIG A Big Data Processor
49 pages
06-Pig-01-Intro-1
No ratings yet
06-Pig-01-Intro-1
23 pages
Big Data Unit-5
No ratings yet
Big Data Unit-5
9 pages
BDA Module 4 - Part 1 (Pig) 2023
No ratings yet
BDA Module 4 - Part 1 (Pig) 2023
34 pages
UNIT-5
No ratings yet
UNIT-5
24 pages
Unit IV - Pig PDF
No ratings yet
Unit IV - Pig PDF
79 pages
Module 4 - Pig
No ratings yet
Module 4 - Pig
65 pages
Apache Pig
100% (2)
Apache Pig
80 pages
Unit IV - Big Data Programming
No ratings yet
Unit IV - Big Data Programming
17 pages
Unit 5 Lecture No-2(PIG)
No ratings yet
Unit 5 Lecture No-2(PIG)
94 pages
Unit V-Apache Pig
No ratings yet
Unit V-Apache Pig
10 pages
Scet Unit 5
No ratings yet
Scet Unit 5
9 pages
L Apachepigdataquery PDF
No ratings yet
L Apachepigdataquery PDF
10 pages
Apache Pig: For Live Hadoop Training, Please See Courses
No ratings yet
Apache Pig: For Live Hadoop Training, Please See Courses
25 pages
Nosql 24 011 Pig
No ratings yet
Nosql 24 011 Pig
41 pages
Pig Slides
No ratings yet
Pig Slides
46 pages
Unit-4_PIG_
No ratings yet
Unit-4_PIG_
9 pages
Big_Data_Unit-5
No ratings yet
Big_Data_Unit-5
81 pages
BDA-Unit 5-notes
No ratings yet
BDA-Unit 5-notes
36 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
Pig_Notes-1
No ratings yet
Pig_Notes-1
6 pages
Pig Full Lecture
No ratings yet
Pig Full Lecture
38 pages
bda-unit-4-060115-big-data-analytics-unit-4
No ratings yet
bda-unit-4-060115-big-data-analytics-unit-4
19 pages
pig skb
No ratings yet
pig skb
7 pages
6 part2
No ratings yet
6 part2
45 pages
Pig and Pig Latin
No ratings yet
Pig and Pig Latin
28 pages
KCS 061 - Big Data - Unit V
No ratings yet
KCS 061 - Big Data - Unit V
17 pages
Bda Module 5
No ratings yet
Bda Module 5
26 pages
UNIT 5 Complete Notes
No ratings yet
UNIT 5 Complete Notes
21 pages
pig
No ratings yet
pig
23 pages
unit-4-apachepig-210825041412
No ratings yet
unit-4-apachepig-210825041412
16 pages
Big Data Notes Pig
No ratings yet
Big Data Notes Pig
38 pages
Apache Pig Handy Notes Lab
No ratings yet
Apache Pig Handy Notes Lab
11 pages
Pig
No ratings yet
Pig
27 pages
Unit 4
No ratings yet
Unit 4
29 pages
Notes Unit 5 Bigdata
No ratings yet
Notes Unit 5 Bigdata
21 pages
unit5-part1-notes
No ratings yet
unit5-part1-notes
21 pages
big-data-unit-5-big-data-notes-of-unit-5
No ratings yet
big-data-unit-5-big-data-notes-of-unit-5
16 pages
Mastering Go A Practical Guide to Developers: A Practical Guide to Developers
From Everand
Mastering Go A Practical Guide to Developers: A Practical Guide to Developers
Miguel Miranda de Mattos
No ratings yet
Quick Python Guide
From Everand
Quick Python Guide
Coder1
No ratings yet
nps
No ratings yet
nps
3 pages
Lec 26
No ratings yet
Lec 26
10 pages
Function Spark
No ratings yet
Function Spark
9 pages
lec18
No ratings yet
lec18
21 pages
Spark Notes
No ratings yet
Spark Notes
37 pages
Per Partition
No ratings yet
Per Partition
3 pages
Lec 6
No ratings yet
Lec 6
16 pages
Lec 8
No ratings yet
Lec 8
24 pages
Big Data Analytics Using Hadoop
No ratings yet
Big Data Analytics Using Hadoop
26 pages
Prog Python
No ratings yet
Prog Python
67 pages
Lec 7
No ratings yet
Lec 7
10 pages
Lec 2
No ratings yet
Lec 2
20 pages
Lec 3
No ratings yet
Lec 3
28 pages
Lec 5
No ratings yet
Lec 5
6 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
2 pages
Advanced Computer Architecture
No ratings yet
Advanced Computer Architecture
2 pages
Lec 4
No ratings yet
Lec 4
28 pages
Advanced Data Structures
No ratings yet
Advanced Data Structures
2 pages
Lec 1
No ratings yet
Lec 1
30 pages
Advanced English Communication Skills Lab
No ratings yet
Advanced English Communication Skills Lab
2 pages
41Z S4HANA1909 Set-Up EN XX
No ratings yet
41Z S4HANA1909 Set-Up EN XX
38 pages
Presidential Address I Have Scinde Flogging A Dead White Male Orientalist Horse by Wendy Doniger
No ratings yet
Presidential Address I Have Scinde Flogging A Dead White Male Orientalist Horse by Wendy Doniger
22 pages
Critical Study of Syllabus and Text Book (1) Atikur Rahman
100% (3)
Critical Study of Syllabus and Text Book (1) Atikur Rahman
12 pages
Nithin Full Stack and C++ Trainer
No ratings yet
Nithin Full Stack and C++ Trainer
3 pages
CODICE 490 I.A.
No ratings yet
CODICE 490 I.A.
6 pages
Amanita Muscaria Herb of Immortality (Psilosophy - Info)
No ratings yet
Amanita Muscaria Herb of Immortality (Psilosophy - Info)
100 pages
Designer Help
No ratings yet
Designer Help
1,137 pages
Final - Ego - Module 1 Students
No ratings yet
Final - Ego - Module 1 Students
182 pages
ICSE Syllabus For Class 9
No ratings yet
ICSE Syllabus For Class 9
6 pages
Nuro Symbolic AI 1706972510
No ratings yet
Nuro Symbolic AI 1706972510
38 pages
BARS - NARS Planning & Programming Guide
No ratings yet
BARS - NARS Planning & Programming Guide
3 pages
Dip Lab Report
No ratings yet
Dip Lab Report
23 pages
PRACTICE TEST- unit 4
No ratings yet
PRACTICE TEST- unit 4
11 pages
Foreign Poem Notes
No ratings yet
Foreign Poem Notes
7 pages
CCS347- Game Development Lab Manual - Page
No ratings yet
CCS347- Game Development Lab Manual - Page
56 pages
SLEEPING BEAUTi
No ratings yet
SLEEPING BEAUTi
6 pages
Greek Tragedy Elements
No ratings yet
Greek Tragedy Elements
2 pages
Bahir Dar University
100% (1)
Bahir Dar University
6 pages
Computer Application & Management - KMB 108
No ratings yet
Computer Application & Management - KMB 108
2 pages
ME18A - MATLAB Lecture Notes
No ratings yet
ME18A - MATLAB Lecture Notes
49 pages
HSG Thanh Oai No Key
No ratings yet
HSG Thanh Oai No Key
9 pages
Creative Writing Module Quarter 2
No ratings yet
Creative Writing Module Quarter 2
74 pages
PHPCK 3 H 9 R
No ratings yet
PHPCK 3 H 9 R
15 pages
RECOUNT TEXT Contoh Soal Trip To
No ratings yet
RECOUNT TEXT Contoh Soal Trip To
4 pages
ilovepdf_merged (24)
No ratings yet
ilovepdf_merged (24)
4 pages
Word-Formation in Modern Standard Arabic PDF
No ratings yet
Word-Formation in Modern Standard Arabic PDF
240 pages
Introduction: Judaeo-Christianity Redivivus: Daniel Boyarin
No ratings yet
Introduction: Judaeo-Christianity Redivivus: Daniel Boyarin
4 pages
Surds
No ratings yet
Surds
24 pages
1 Review of Python Basics
No ratings yet
1 Review of Python Basics
75 pages
Describing Consonants: Manner of Articulation
No ratings yet
Describing Consonants: Manner of Articulation
5 pages

Pig_2

Uploaded by

Pig_2

Uploaded by

Apache Pig

Pig was designed for performing a long series of data operations,

• Extract-transform-load (ETL) data pipelines,

• Provides support for data types – long, float, chararray, schemas

• Really nasty data formats or complete unstructured data.

• Pig in Local mode

• Grunt Shell or Interactive mode

• PIG Scripts or Batch mode

Relation can be defined as follows:

• A tuple is an ordered set of fields.

• Map [key value]

Relations, Bags, Tuples, Fields

• A relation is a bag (more specifically, an outer bag).

• A Pig Latin statement is an operator that takes a relation as

• This definition applies to all Pig Latin operators except

• In PIG when a data element is null it means its unknown.

• Pig Latin statements are generally organized in the

– A LOAD statement reads data from the file system.

– A DUMP statement displays output to the screen

•Because DUMP is a diagnostic tool, it will always trigger execution.

Relations B and C are both derived from A, so to save reading A twice,

Move it to HDFS by using below command.

A = load 'student' using PigStorage(‘,’) AS

Groups the data in one relations.

C = FOREACH B GENERATE group, COUNT(A);

C = FOREACH B GENERATE $0, $1.name;

Definition: Selects tuples from a relation based on some condition.

Definition: The GROUP and COGROUP operators are identical.

• To see groups for which inputs have at least one tuple:

• For bags, flatten substitutes bags with new tuples.

• When we remove a level of nesting in a bag, sometimes we cause a cross product to

Definition: Performs join of two or more relations based on common field

which is equivalent to:

The result is: (1, {(1, 2, 3)}, {(1, 3)})

Removes duplicate tuples in a relation.

• Computes the cross product of two or more relations.

Partitions a relation into two or more relations.

SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6);

• To select few columns from one dataset

Create a basic wordsample.txt file and move to

records = LOAD ‘webcount’ using PigStorage (‘\t’) as (country:chararray,

filtered_records = filter records by country == ‘en’;

results = foreach grouped_records generate group, SUM

sorted_result = order results by $1 desc;

Relations are unordered in Pig.

• HCatalog (which is a component of Hive) provides

Pig provides extensive support for user defined functions (UDFs) to

REGISTER - Registers the JAR file with PIG runtime.

public class UPPER extends EvalFunc<String>

• DESCRIBE: Prints a relation’s schema.

• Use Joins appropriately.

• Hence the previous query should be rewritten as

• You can set the number of reduce tasks for the

• One of the datasets is small enough that it fits in the memory.

• A replicated join copies the small dataset to the distributed cache -

• Coz the data is available in the memory(DC), and is processed on

• sales = join transactions by (state, country), geography by (state,

• Standard joins run in parallel across different reducers by splitting

• Skewed join handles this case. It calculates a histogram to check

• sales = join transactions by (state, country), geography

You might also like