0% found this document useful (0 votes)

12 views

PIG A Big Data Processor

Uploaded by

AMIRISHETTY DEEPIKA

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

PIG A Big Data Processor

Uploaded by

AMIRISHETTY DEEPIKA

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 49

PIG: A Big Data

Processor
What is
Pig?
• Apache Pig is an abstraction over MapReduce. It is
a tool/platform which is used to analyze larger sets
of data representing them as data flows.
• Pig is generally used with Hadoop; we can perform
all the data manipulation operations in Hadoop using
Apache Pig.
• To write data analysis programs, Pig provides a high-
level language known as Pig Latin.
• This language provides various operators using which
programmers can develop their own functions for
reading, writing, and processing data.
Apache Pig

• To analyze data using Apache Pig, programmers

need to write scripts using Pig Latin language.
• All these scripts are internally converted to Map
and Reduce tasks.
• Apache Pig has a component known as
Pig Engine that accepts the Pig Latin
scripts as input and converts those scripts
into MapReduce jobs.
Why do we need Apache
Pig?
• Using Pig Latin, programmers can perform MapReduce tasks
easily without having to type complex codes in Java.
• Apache Pig uses multi-query approach, thereby reducing the
length of codes. For example, an operation that would require you
to type 200 lines of code (LoC) in Java can be easily done by
typing as less as just 10 LoC in Apache Pig. Ultimately, Apache Pig
reduces the development time by almost 16 times.
• Pig Latin is SQL-like language and it is easy to learn Apache Pig
when you are familiar with SQL.
• Apache Pig provides many built-in operators to support data
operations like joins, filters, ordering, etc. In addition, it also
provides nested data types like tuples, bags, and maps that are
missing from MapReduce.
Features of Pig

• Rich set of operators: It provides many operators to perform

operations like join, sort, filer, etc.
• Ease of programming: Pig Latin is similar to SQL and it is easy to
write a Pig script if you are good at SQL.
• Optimization opportunities: The tasks in Apache Pig optimize their
execution automatically, so the programmers need to focus only
on semantics of the language.
• Extensibility: Using the existing operators, users can develop their
own functions to read, process, and write data.
• UDF’s: Pig provides the facility to create User-defined Functions in
other programming languages such as Java and invoke or embed them
in Pig Scripts.
• Handles all kinds of data: Apache Pig analyzes all kinds of data, both
structured as well as unstructured. It stores the results in HDFS.
Pig vs. MapReduce
Pig vs.
SQL
Pig vs. Hive
Applications of Apache Pig

• To process huge data sources such as web

logs.
• To perform data processing for
search platforms.
• To process time sensitive data loads.
Apache Pig – History

• In 2006, Apache Pig was developed as a

research project at Yahoo, especially to create
and execute MapReduce jobs on every dataset.
• In 2007, Apache Pig was open sourced
via Apache incubator.
• In 2008, the first release of Apache Pig
came out. In 2010, Apache Pig graduated as
an Apache top-level project.
Pig Architecture
Apache Pig – Components

• Parser: Initially the Pig Scripts are handled by the Parser.

It checks the syntax of the script, does type checking, and
other miscellaneous checks. The output of the parser will be
a DAG (directed acyclic graph), which represents the Pig Latin
statements and logical operators.
• Optimizer: The logical plan (DAG) is passed to the logical
optimizer, which carries out the logical optimizations such as
projection and pushdown.
• Compiler: The compiler compiles the optimized logical plan
into a series of MapReduce jobs.
• Execution engine: Finally the MapReduce jobs are submitted
to Hadoop in a sorted order. Finally, these MapReduce jobs are
executed on Hadoop producing the desired results.
Apache Pig – Data Model
Apache Pig – Elements

• Atom
– Any single value in Pig Latin, irrespective of their
data, type is known as an Atom.
– It is stored as string and can be used as string
and number. int, long, float, double, chararray,
and bytearray are the atomic values of Pig.
– A piece of data or a simple atomic value is
known as a field.
– Example: ‘raja’ or ‘30’
Apache Pig – Elements

• Tuple
– A record that is formed by an ordered set of
fields is known as a tuple, the fields can be of
any type. A tuple is similar to a row in a table of
RDBMS.
– Example: (Raja, 30)
Apache Pig – Elements

• Bag
– A bag is an unordered set of tuples. In other words, a
collection of tuples (non-unique) is known as a bag. Each
tuple can have any number of fields (flexible schema). A
bag is represented by ‘{}’. It is similar to a table in
RDBMS, but unlike a table in RDBMS, it is not necessary
that every tuple contain the same number of fields or
that the fields in th same position (column) have the
same type.
– Example: {(Raja, 30), (Mohammad, 45)}
– A bag can be a field in a relation; in that context, it
is known as inner bag.
– Example: {Raja, 30, {9848022338, [email protected],
}}
Apache Pig – Elements

• Relation
– A relation is a bag of tuples. The relations in
Pig Latin are unordered (there is no guarantee
that tuples are processed in any particular
• order).
Map
– A map (or data map) is a set of key-value pairs.
The key needs to be of type chararray and
should be unique. The value might be of any
type. It is represented by ‘[]’
– Example: [name#Raja, age#30]
Installation of
PIG
Download

• Download the tar.gz file of Apache Pig from

here:
http://mirror.fibergrid.in/apache/pig/pig-0.15.0/
pig-0.15.0.tar.gz
Extract and copy

• Extract this file using right-click -> 'Extract

here' option or by tar -xzvf command.
• Rename the created folder 'pig-0.15.0' to 'pig'
• Now, move this folder to /usr/lib using
following command:

$ sudo mv pig/ /usr/lib

Edit the bashrc file

• Open the bashrc file:

sudo gedit ~/.bashrc

• Go to end of the file and add following lines.

export PIG_HOME=/usr/lib/pig
export PATH=$PATH:$PIG_HOME/bin

• Type following command to make

it in effect:

source ~/.bashrc
Start the Pig

• Start the pig in local mode:

pig -x local

• Start the pig in mapreduce mode (needs hadoop

datanode started):

pig -x mapreduce
Grunt shell
Data Processing with
PIG
Example: movies_data.csv

1,Dhadakebaz,1986,3.2,7560
2,Dhumdhadaka,1985,3.8,6300
3,Ashi hi banva banvi,1988,4.1,7802
4,Zapatlela,1993,3.7,6022
5,Ayatya Gharat Gharoba,1991,3.4,5420
6,Navra Maza Navsacha,2004,3.9,4904
7,De danadan,1987,3.4,5623
8,Gammat Jammat,1987,3.4,7563
9,Eka peksha ek,1990,3.2,6244
10,Pachhadlela,2004,3.1,6956
Load data

• $ pig -x local
• grunt> movies = LOAD
'movies_data.csv' USING
PigStorage(',') as
(id,name,year,rating,duration)

• grunt> dump movies;

it displays the contents

Filter data

• grunt> movies_greater_than_35 =
FILTER movies BY (float)rating > 3.5;

• grunt> dump movies_greater_than_35;

Store the results data

• grunt> store movies_greater_than_35

into 'my_movies';

• It stores the result in local file system

directory named 'my_movies'.
Display the result

• Now display the result from local file

system.

cat my_movies/part-m-00000
Load command

• The load command specified only the column

names. We can modify the statement as
follows to include the data type of the
columns:

• grunt> movies = LOAD

'movies_data.csv' USING
PigStorage(',') as (id:int,
name:chararray, year:int,
rating:double, duration:int);
Check the filters

• List the movies that were released between 1950

and 1960
grunt> movies_between_90_95 = FILTER
movies by year > 1990 and year < 1995;
• List the movies that start with the Alpahbet D
grunt> movies_starting_with_D = FILTER
movies by name matches 'D.*';
• List the movies that have duration greater that 2
hours
grunt> movies_duration_2_hrs = FILTER
movies by duration > 7200;
Output
Mo
vie
19 s be
90 t
t o wee
19 n
95

Mo
vie
W s sta
ith r
D'' ts

Mo
v
Th ies g
an r
2 h eate
ou r
rs
Describe

• DESCRIBE The schema of a relation/alias can

be viewed using the DESCRIBE command:

grunt> DESCRIBE movies;

movies: {id: int, name: chararray,

year: int, rating: double, duration:
int}
Foreach

• FOREACH gives a simple way to apply

transformations based on columns. Let’s understand
this with an example.
• List the movie names its duration in minutes
grunt> movie_duration = FOREACH movies
GENERATE name, (double)(duration/60);
• The above statement generates a new alias that
has the list of movies and it duration in minutes.
• You can check the results using the DUMP
command.
Output
Group

• The GROUP keyword is used to group fields in

a relation.
• List the years and the number of movies
released each year.

grunt> grouped_by_year = group movies

by year;
grunt> count_by_year = FOREACH
grouped_by_year GENERATE group,
COUNT(movies);
Output
Order by

• Let us question the data to illustrate the ORDER

BY operation.
• List all the movies in the ascending order of year.
grunt> desc_movies_by_year = ORDER
movies BY year ASC;
grunt> DUMP desc_movies_by_year;
• List all the movies in the descending order of year.
grunt> asc_movies_by_year = ORDER movies
by year DESC;
grunt> DUMP asc_movies_by_year;
Output- Ascending by year

From
1985
To
2004
Limit

• Use the LIMIT keyword to get only a limited

number for results from relation.

grunt> top_5_movies = LIMIT movies 5;

grunt> DUMP top_10_movies;
Pig: Modes of Execution

• Pig programs can be run in three methods which

work in both local and MapReduce mode. They
are
– Script Mode
– Grunt Mode
– Embedded Mode
Script mode

• Script Mode or Batch Mode: In script mode, pig runs

the commands specified in a script file. The
following example shows how to run a pig programs
from a script file:
$ vim scriptfile.pig
A = LOAD 'script_file';
DUMP A;

$ pig
x local
scriptfile.pig
Grunt mode

• Grunt Mode or Interactive Mode: The grunt mode can also

be called as interactive mode. Grunt is pig's interactive shell.
It is started when no file is specified for pig to run.
$ pig
x local
grunt> A = LOAD 'grunt_file';
grunt> DUMP A;

• You can also run pig scripts from grunt using run and
exec commands.
grunt> run scriptfile.pig
grunt> exec scriptfile.pig
Embedded mode

• You can embed pig programs in Java, python

and ruby and can run from the same.
Example: Wordcount program

• Q) How to find the number of occurrences of

the words in a file using the pig script?
• You can find the famous word count example
written in map reduce programs in apache website.
Here we will write a simple pig script for the word
count problem.
• The pig script given in next slide finds the number
of times a word repeated in a file:
Example: text file- shivneri.txt
Example: Wordcount program

lines = LOAD 'shivneri.txt' AS

(line:chararray);
words = FOREACH lines GENERATE
FLATTEN(TOKENIZE(line)) as word;
grouped = GROUP words BY word;
w_count = FOREACH grouped GENERATE group,
COUNT(words);
DUMP w_count;

forts.pig
Output snapshot

$ pig -x local forts.pig

References

• “Programming Pig” by Alan Gates, O'Reilly

Publishers.
• “Pig Design Patterns” by Pradeep Pasupuleti,
PACKT Publishing
• Tutorials Point
• http://github.com/rohitdens
• http://pig.apache.org

Data Analytics For Accounting Third Edition - Richardson
83% (6)
Data Analytics For Accounting Third Edition - Richardson
1,237 pages
Hadoop Python MapReduce Tutorial For Beginners
No ratings yet
Hadoop Python MapReduce Tutorial For Beginners
15 pages
Hinas SQL Assignment
No ratings yet
Hinas SQL Assignment
10 pages
Archiving Business Partners
No ratings yet
Archiving Business Partners
7 pages
PIG: A Big Data Processor: Tushar B. Kute
No ratings yet
PIG: A Big Data Processor: Tushar B. Kute
50 pages
unit-4-apachepig-210825041412
No ratings yet
unit-4-apachepig-210825041412
16 pages
Pig Hive
No ratings yet
Pig Hive
72 pages
Unit V-Apache Pig
No ratings yet
Unit V-Apache Pig
10 pages
Big Data Unit IV
No ratings yet
Big Data Unit IV
19 pages
Hadoop Pig Presentation
No ratings yet
Hadoop Pig Presentation
33 pages
Apache PIG.pptx
No ratings yet
Apache PIG.pptx
41 pages
Unit Iv Part - 2
No ratings yet
Unit Iv Part - 2
59 pages
Pig_2
No ratings yet
Pig_2
63 pages
Apache PIG by Sravanthi
No ratings yet
Apache PIG by Sravanthi
31 pages
Unit-V Pig Programming
No ratings yet
Unit-V Pig Programming
123 pages
Pig Hive
No ratings yet
Pig Hive
59 pages
Apache Pig
No ratings yet
Apache Pig
28 pages
BDA - Unit-4 Part 1
No ratings yet
BDA - Unit-4 Part 1
47 pages
Pig, Grunt, Hive: Presented By:Akila 20Spcs01
No ratings yet
Pig, Grunt, Hive: Presented By:Akila 20Spcs01
16 pages
Unit IV EBDP 22
No ratings yet
Unit IV EBDP 22
97 pages
Apache Pig
No ratings yet
Apache Pig
21 pages
Nosql 24 011 Pig
No ratings yet
Nosql 24 011 Pig
41 pages
Hadoop Week 5
No ratings yet
Hadoop Week 5
78 pages
Session 3.3
No ratings yet
Session 3.3
30 pages
bda unit 4
No ratings yet
bda unit 4
16 pages
Unit IV - Pig PDF
No ratings yet
Unit IV - Pig PDF
79 pages
Apache Pig
100% (2)
Apache Pig
80 pages
Unit5 Bigdatanotes
No ratings yet
Unit5 Bigdatanotes
52 pages
BDA Unit-4-PPT
No ratings yet
BDA Unit-4-PPT
98 pages
BDA-Unit 5-notes
No ratings yet
BDA-Unit 5-notes
36 pages
Unit v Notes
No ratings yet
Unit v Notes
17 pages
Accessing The System
No ratings yet
Accessing The System
32 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
Pig
No ratings yet
Pig
27 pages
Apache Pig Handy Notes Lab
No ratings yet
Apache Pig Handy Notes Lab
11 pages
Lesson 1 Installation HEB
No ratings yet
Lesson 1 Installation HEB
22 pages
3 Pig
No ratings yet
3 Pig
77 pages
Hadoop - PIG User Material
No ratings yet
Hadoop - PIG User Material
292 pages
Pig Slides
No ratings yet
Pig Slides
46 pages
Module 4 - Pig
No ratings yet
Module 4 - Pig
65 pages
Week 4 - PIG SqoopFall2019
No ratings yet
Week 4 - PIG SqoopFall2019
117 pages
Pig Full Lecture
No ratings yet
Pig Full Lecture
38 pages
Big Data Processing, 2014/15: Lecture 8: Pig Latin!
No ratings yet
Big Data Processing, 2014/15: Lecture 8: Pig Latin!
58 pages
Pig
No ratings yet
Pig
16 pages
Apache Pig: For Live Hadoop Training, Please See Courses
No ratings yet
Apache Pig: For Live Hadoop Training, Please See Courses
25 pages
Lab 5
No ratings yet
Lab 5
9 pages
pig
No ratings yet
pig
23 pages
BDA Unit - IV
No ratings yet
BDA Unit - IV
81 pages
Unit 4: The Command Line
No ratings yet
Unit 4: The Command Line
48 pages
Automatic Subdomain Enum
No ratings yet
Automatic Subdomain Enum
17 pages
Pig
No ratings yet
Pig
6 pages
BDP U4
No ratings yet
BDP U4
58 pages
BDA Module 2 PDF
No ratings yet
BDA Module 2 PDF
123 pages
Bda - Module Ii
No ratings yet
Bda - Module Ii
239 pages
Pig Hive
No ratings yet
Pig Hive
58 pages
Getting Started With R and RStudio
No ratings yet
Getting Started With R and RStudio
35 pages
Lecture 12
No ratings yet
Lecture 12
21 pages
Unit 5
No ratings yet
Unit 5
39 pages
Unit No. 8
No ratings yet
Unit No. 8
24 pages
Program 3 Assignment: Input
No ratings yet
Program 3 Assignment: Input
3 pages
Unit 5
No ratings yet
Unit 5
76 pages
Assembler Directives
No ratings yet
Assembler Directives
33 pages
Basic Information About C language PDF
From Everand
Basic Information About C language PDF
Suraj Das
No ratings yet
RMAN Point in Time Recovery (PITR) Scenario of A Dropped Oracle Tablespace
No ratings yet
RMAN Point in Time Recovery (PITR) Scenario of A Dropped Oracle Tablespace
2 pages
MySQL Queries
No ratings yet
MySQL Queries
51 pages
Pubmed Basics
No ratings yet
Pubmed Basics
2 pages
Creating Queries: Comsats Institute of Information Technology Wah Campus Department of Computer Science
No ratings yet
Creating Queries: Comsats Institute of Information Technology Wah Campus Department of Computer Science
20 pages
Backup & Restore of The Versiondog Server Archive: © Auvesy GMBH
No ratings yet
Backup & Restore of The Versiondog Server Archive: © Auvesy GMBH
20 pages
Ip Chapter 13 Introduction To Mysql
No ratings yet
Ip Chapter 13 Introduction To Mysql
3 pages
Sas InformationMap Document
No ratings yet
Sas InformationMap Document
21 pages
Unit 6 Serializability
No ratings yet
Unit 6 Serializability
24 pages
Handout DB2 PDF
No ratings yet
Handout DB2 PDF
69 pages
Database Test Questions
No ratings yet
Database Test Questions
12 pages
Repository Pattern
No ratings yet
Repository Pattern
9 pages
Integrity and Domain Constraints
No ratings yet
Integrity and Domain Constraints
25 pages
Bia Unit-2 Part-2
No ratings yet
Bia Unit-2 Part-2
170 pages
File Allocation
No ratings yet
File Allocation
37 pages
Performance Tuning Techniques For Handling High Volume of Data in Informatica
No ratings yet
Performance Tuning Techniques For Handling High Volume of Data in Informatica
16 pages
Operational Business Intelligence: Deep Dive
No ratings yet
Operational Business Intelligence: Deep Dive
5 pages
MongoDB Tutorial
No ratings yet
MongoDB Tutorial
4 pages
JCL Spawning Through CICS Screen
100% (2)
JCL Spawning Through CICS Screen
5 pages
PAT Trees and PAT Arrays
No ratings yet
PAT Trees and PAT Arrays
12 pages
SQLTest V1
No ratings yet
SQLTest V1
7 pages
Iit Pratical File
No ratings yet
Iit Pratical File
40 pages
Big Data study 1
No ratings yet
Big Data study 1
77 pages
SGA and Background Process - Architecture
100% (2)
SGA and Background Process - Architecture
68 pages
Alm 3 Co2 DBMS
No ratings yet
Alm 3 Co2 DBMS
3 pages
Database Caching Strategies
No ratings yet
Database Caching Strategies
13 pages
BCA 604 (Knowledge MGMT) Chapter 2
No ratings yet
BCA 604 (Knowledge MGMT) Chapter 2
7 pages