0% found this document useful (0 votes)
22 views

Module 4 - Pig

Uploaded by

Aditya Raj
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Module 4 - Pig

Uploaded by

Aditya Raj
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 65

BIG DATA

ANALYTICS
Part – 4
“PIG”
By: Dr.Rashmi L Malghan &
Ms. Shavantrevva S Bilakeri
Agenda:
Apache PIG
Pig was introduced by yahoo and later on it was made fully open source by Apache
Hadoop.

It also provides a bridge to query data over Hadoop clusters but unlike hive, it
implements a script implementation to make Hadoop data accessable by developers
and business persons.

Apache pig provides a high level programming platform for developers to process
and analyses Big Data using user defined functions and programming efforts.

In January 2013 Apache released Pig 0.10.1 which is defined for use with Hadoop
0.10.1 or later releases.
MapReduce Way: Apache
PIG:
• Focus on the data transformations rather than the
underlying MapReduce implementation.

• Apache Pig's high-level dataflow engine simplifies


the development of large-scale data processing
tasks on Hadoop clusters by providing an
abstraction layer and leveraging the power of
MapReduce without requiring users to write
complex Java code.
Key Features and example of Pig Latin code
 Declarative Language (Pig Latin) -- Load data
data = LOAD 'input_data' USING PigStorage(',');
 Abstraction from MapReduce -- Filter data
filtered_data = FILTER data BY $1 > 50;
 Data Flow Model
-- Group and aggregate
grouped_data = GROUP filtered_data BY $0;
 Schema Flexibility result = FOREACH grouped_data GENERATE
group, AVG(filtered_data.$1);
 Optimization Opportunities
-- Store the result
STORE result INTO 'output';
Apache PIG:
Why to Opt Pig instead of MapReduce:
PIG VS MapReduce
Apache Pig: Advantages
PIG Anatomy
1.Data flow Language- pig latin PIG supports:

2.Interactive shell- Grunt 1.HDFS commands

3.Prig interpreter and execution engine 2.UNIX shell operators


PIG Philosophy 3.Relational operators
4. Positional operators
5. Mathematical functions
6. User defined functions
7. Complex data structures
Apache Pig – Architecture / Pig MapReduce Engine

• Pig Latin Scripts: Execute queries over big


data.

• Grunt Shell: Native shell provided by Apache


Pig, to execute pig queries.
• Submit pig scripts to java client to pig server &
execute over Apache pig.

• Compiler: Converts pig latin scripts to Apache


MapReduce code
• Executed using executing engine over Hadoop
cluster
Operators in Apache Pig
Pig Latin Operators are the basic constructs that allow data
manipulation in Apache Pig. Some commonly used operators
include:
•LOAD and STORE: These operators are used to read and write
data.
•FILTER: The FILTER operator is used to remove unwanted data
based on a condition.
•GROUP: The GROUP operator is used to group the data in one or
more relations.
•JOIN: The JOIN operator merges two or more relations.
•SPLIT: The SPLIT operator is used to split a single relation into
two or more relations based on some condition.
Pig Syntax used for Data Processing
X= LOAD ‘file name.txt’; #directory name
………..

Y= GROUP ….;

……..
Z= FILTER …;

…..
DUMP Y; #view result on screen
……..
STORE Z into ‘temp’
Example Latin Script: find the total distance travelled by a flight
-- Load the flight data
flight_data = LOAD 'path/to/flight_data' USING PigStorage(',') AS (date:chararray,
distance:int);

-- Filter out empty or invalid distance values


filtered_data = FILTER flight_data BY distance is not null and distance >= 0;

-- Calculate the total distance covered


total_distance = FOREACH (GROUP filtered_data ALL) GENERATE
SUM(filtered_data.distance) as total_distance;

-- Display the result


DUMP total_distance;
Question: load the student data (assuming data contains rollno, name, gpa),

romove the students whose gpa is less than 5.0.

From the filtered data, find the student name with highest gpa.

Display and Store the result to output file


SOLUTION

A = load ‘student’ (rollno, name, gpa) #A is a relational table not a variable

A = filter A by gpa>5.0

A = foreach A generate Upper (name);

STORE A INTO ‘myreport’

Note: by default columns are indexed with $01, $02, $03 when USING PigStorage is not used.
--Load student data
student_data = LOAD 'path/to/student_data' USING PigStorage(',')
AS (rollno:chararray, name:chararray, gpa:float);

-- Filter out students with GPA less than 5.0


filtered_data = FILTER student_data BY gpa >= 5.0;
-- Find the student with the highest GPA
max_gpa_student = ORDER filtered_data BY gpa DESC;
top_student = LIMIT max_gpa_student 1;
-- Display the result
DUMP top_student;
-- Store the result in an output file
STORE top_student INTO 'path/to/output' USING PigStorage(',');
PIG LATIN IDENTIFIERS and COMMENTS
 Identifiers are the names assigned to field or the other data
structures
 Should begin with a letter and should be followed only by letters
and underscores
 Examples for valid: Y, A1, A1_2014, Sample
 Examples for invalid: 5, sales$, sales%, _sales
 Single line comment begin with “--”
 Multiline comment begin with “/* and end with */”
Case Study: Twitter

• Objective: To increase their user base / Enhance their offerings


• Procedures: To Extract insights monthly/weekly/daily
• Results: To scale up their infrastructure so that they will be able to
handle larger user base they are targeting.
Case Study: Twitter
High Level implementation – HDFS & Pig

• Twitter Database had many tables, in which archive data was


stored.
• The insight they want to extract was related to :Tweet & user
table.
Implementation Flow - Detail

 Import -Sqoop-
transfer data between
RDBM to HDFS or
HDFS to RDMS

 Twitter used Pig


instead of mapreduce
thus –saved their time
and effort.
Case Study: Twitter
What happens underneath the covers when you run/submit a Sqoop
import job
• Sqoop will connect to the database.
• Sqoop uses JDBC to examine the table by retrieving a list of all the columns and their SQL
data types. These SQL types (varchar, integer and more) can then be mapped to Java data types
(String, Integer etc.)
• Sqoop’s code generator will use this information to create a table-specific class to hold records
extracted from the table.
• Sqoop will connect to cluster and submit a MapReduce job.
• The dataset being transferred is sliced up into different partitions and a map-only job is
launched with individual mappers responsible for transferring a slice of this dataset.
• For databases, Sqoop will read the table row-by-row into HDFS.
• For mainframe datasets, sqoop will read records from each mainframe dataset into HDFS.
• The output of this import process is a set of files containing a copy of imported table or
datasets.
• The import process is performed in parallel for this reason, the output will be in multiple files.
• These files may be delimited text files CSV, TSV or binary Avro or Sequence files containing
serialized record data. By default it is CSV.
Pig Latin: Case Sensitivity

 Keywords/ operators are not case sensitive.


Ex: LOAD, STORE, GROUP, FOREACH DUMP
 Relations and paths are case sensitive
 Function names are case sensitive Ex:
PigStorage, COUNT
Pig Latin – Arithmetic Operators
Pig Latin – Comparison
Operators
Pig Latin – Relational Operations
Pig Latin – Relational Operations
Pig Latin – Type Construction Operators
Apache Pig - Diagnostic
Operators
To verify the execution of the Load statement, you have to use the Diagnostic
Operators.

1. Dump operator
2. Describe operator
3. Explanation operator
4. Illustration operator
Dump Operator
The Dump operator is used to run the Pig Latin statements and display the results on the
screen. It is generally used for debugging Purpose.

syntax of the Dump operator:


grunt> Dump Relation_Name
Describe operator

• The describe operator is used to view the schema of a


relation.
Syntax
grunt> Describe Relation_name

Example
grunt> describe student;
grunt> student: { id: int,firstname: chararray,lastname: chararray,phone:
chararray,city: chararray }
ILLUSTRATE OPERATOR
The illustrate operator gives you the step-by-step execution
of a sequence of statements.

Syntax
grunt> illustrate Relation_name;
GROUP operator
The GROUP operator is used to group the data in one or more
relations. It collects the data having the same key.

Syntax
grunt> Group_data = GROUP Relation_name BY age;

Example
grunt> group_data = GROUP student_details by
age;

(21,{(4,Preethi,Agarwal,21,9848022330,Pune),(1,Rajiv,Reddy,21,9848022337,Hydera bad)})
(22,{(3,Rajesh,Khanna,22,9848022339,Delhi),(2,siddarth,Battacharya,22,984802233 8,Kolkata)})
(23,{(6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,9848022336 ,Bhuwaneshwar)})
(24,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,9848022334, trivendram)})
• The COGROUP operator works more or less in the same way as the GROUP operator.

• The only difference between the two operators is that the group operator is normally used
with one relation, while the cogroup operator is used in statements involving two or more
relations.
Grouping Two Relations using Cogroup
Assume that we have two files namely student_details.txt and employee_details.txt in the
HDFS directory /pig_data/ .
grunt> cogroup_data = COGROUP student_details by age, employee_details by
age;
The UNION operator of Pig Latin is used to merge the content of two relations.

To perform UNION operation on two relations, their columns and domains must
be identical.
Syntax: grunt> Relation_name3 = UNION Relation_name1, Relation_name2;

grunt> student = UNION student1,


student2;

Output
The SPLIT operator is used to split a relation into two or more relations.
Syntax: grunt> SPLIT Relation1_name INTO Relation2_name IF (condition1), Relation2_name
(condition2)
Output

SPLIT student_details into student_details1 if


age<23, student_details2 if (22<age and age>25);
The FILTER operator is used to select the required tuples from a relation based on a condition.
Syntax
grunt> Relation2_name = FILTER Relation1_name BY (condition);

Output
(6,Archana,Mishra,23,9848022335,Chennai)
(8,Bharathi,Nambiayar,24,9848022333,Chennai)

filter_data = FILTER student_details BY


city == 'Chennai';
The DISTINCT operator is used to remove redundant (duplicate) tuples from a relation.

Syntax: grunt> Relation_name2 = DISTINCT Relatin_name1;

The FOREACH operator is used to generate specified data transformations based on the
column data.
Syntax: grunt> Relation_name2 = FOREACH Relatin_name1 GENERATE (required
data);

grunt> foreach_data = FOREACH


student_details GENERATE id,age,city;
Output
The ORDER BY operator is used to display the contents of a relation in a sorted order
based on one or more fields.
Syntax
grunt> Relation_name2 = ORDER Relatin_name1 BY (ASC|DESC);

Output

grunt> order_by_data = ORDER


student_details BY age DESC;
The LIMIT operator is used to get a limited number of tuples from a relation.
Syntax
grunt> Result = LIMIT Relation_name required number of tuples;

Example:
grunt> limit_data = LIMIT student_details 4;
Apache Pig - Eval Functions
Apache Pig provides various
built-in functions namely eval,
load, store, math, string,
bag and tuple functions.
Apache Pig - Eval Functions
Apache Pig - Load & Store Functions
Apache Pig - Bag & Tuple
Functions
Apache Pig - String Functions
Apache Pig - Date-time Functions
Apache Pig - Components

• Various ways to execute Pig Scripts


• Embedded : Execute over pigserver.

• Pig Latin : Very simple data flow language


given by Apache Pig .
• Write , transformation and analysis can be
performed over input data set
Pig – Execution Modes
Running PIG
1. Interactive mode: Run pig in interactive mode by invoking grunt shell.

2. Batch mode: Create pig script to run in batch mode. Write pig latin
statements in a file and save it with .pig extension
Executing Pig Script in Batch mode
Executing a Pig Script from HDFS
We can also execute a Pig script that resides in the HDFS.

Suppose there is a Pig script with the name Sample_script.pig in the


HDFS directory named /pig_data/. We can execute it as shown below.

$ pig -x mapreduce hdfs://localhost:9000/pig_data/Sample_script.pig


Data Model : Pig
Pig Data Model: Bag & Tuple
Pig Data Model: Map & Atom
Pig: Operators
Can you Match ??????????
Fill in??????
scripting
Pig latin
Pig engine
local mapreduce
grunt
relation path
Bag, tuple, map
ETL

True / False ???????????


Piggy Bank

• Apache Pig provides extensive support for User Defined Functions (UDF’s).

• The UDF support is provided in six programming languages, namely, Java, Jython,
Python, JavaScript, Ruby and Groovy.

• For writing UDF’s, complete support is provided in Java and limited support is provided
in all the remaining languages.

• Since Apache Pig has been written in Java, the UDF’s written using Java language work
efficiently compared to other languages.

• In Apache Pig, we also have a Java repository for UDF’s named Piggybank.

• User can use piggy bank functions in pig latin script and
can share their functions in piggy bank
PIG EXECUTION : Load and Store data locally and on Hadoop
Step1: Create input.txt file

Step2: Transfer to HDFS


hdfs dfs put /home/hdoop/input.txt bda1/
Step3: Create Pigscript file
sudo gedit pigscript.pig
OR
vi pigscript.pig

Code to be typed in pigscript.pig


record = load '/bda1/input.txt/’;
store record into '/bda1/out’;
Step4: Run pigscript in mapreduce mode
pig -x mapreduce pigscript.pig
Step5: Check the status of execution

Output: Found 2 items


-rw-r--r-- 2 hdoop supergroup 0 2024-01-12 15:05 /bda1/out/_SUCCESS
-rw-r--r-- 2 hdoop supergroup 112 2024-01-12 15:05 /bda1/out/part-m-00000

Step 6: View the output file


hdfs dfs -cat /bda1/out/part-m-00000
WORD COUNT PROGRAM

Step1: Create a text file and add some contents to text file

Step 2: Open .pig file and edit the following script into that

--LOAD THE DATA


records = LOAD '/pig1/input.txt';
-- SPLIT EACH LINE OF TEXT AND ELIMINATE NESTING
terms = FOREACH records GENERATE FLATTEN(TOKENIZE((chararray) $0)) AS word;
--GROUP SIMILAR TERMS
grouped_terms = GROUP terms BY word;
--COUNT THE NUMBER OF TUPLES IN EACH GROUP
word_counts = FOREACH grouped_terms GENERATE COUNT(terms), group;
--STORE THE RESULT
STORE word_counts INTO '/pig1/output';
Step 3: pig (type this at commond prompt )

Step4: grunt>pig wordcount.pig

Step 5: grunt>run wordcount.pig

Step 6: grunt> pwd

Step7: grunt> cd /pig1/output

Step8: grunt> ls
Output: 1 too
hdfs://192.168.159.101:9000/pig1/output/_SUCCESS<r 2> 0 2 you
hdfs://192.168.159.101:9000/pig1/output/part-r-00000<r 2> 127 1 Data
Step9: grunt> cat part-r-00000 2 good
Output: 1 hope
2 i 1 you.
2 am 1 Btech
1 hi 1 about
2 in 1 doing
1 are 1 manipal
2 how 1 science.
1 studying

You might also like