0% found this document useful (0 votes)
21 views34 pages

BDA Record (1)

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views34 pages

BDA Record (1)

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

ADVANCED DATA ANALYTICS

LAB RECORD
4/4 B.TECH (Computer Science and Engineering)

I Semester

Certificate

This is to certify that the experiments recorded in this book are the bonafide work
of …………………………………. student of …………..………………. carried
out in the subject ……………………. in the Bapatla Engineering College, Bapatla
during the year 2020 – 2021 of experiment recorded is…………….

Lecturer-in-charge
Date : Head of the Department
Department of Computer Science and Engineering

Bapatla Engineering College


INDEX
S.NO NAME OF THE EXPERIMENT PAGE NO
1) Hadoop Installation 3-5

2) Hadoop Commands 6-9

3) MapReduce - WordCount 10 - 20

4) PIG Installation & PIG Operators 21 - 30

5) PIG Scripts: WordCount, CardCount, 31 - 33


MaxTemp
6) PIG UserDefinedFunctions - UDF 34
Class:4/4 B.Tech Regd No: 3

Hadoop Installation steps:


Step-1:-
1)To update the packages from the repositories
sudo apt-get update
2) Java installation
sudo add-apt-repository ppa:linuxprising/java
sudo apt-get install oracle-java15-installer
Step-2:-
1)After installing Java we need to set Java Path for that:
We need to set java path in barsch file.
To open barsch file command is :- gedit ~/.bashrc
2)Go to bottom of the file in bashrc and set below commands.
For Java path setting command is :- export JAVA_HOME=our Java Path(Ex:/usr/lib/jvm/java-15-oracle)
To Know the java path command is :- $JAVA_HOME (Type this on terminal)
Step-3:-
Install SSH using following command :
sudo apt-get install ssh
First, we have to generate RSA, an SSH key for user :
ssh -keygen -t dsa -P “”
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Step-4:-
1)Download Hadoop latest version from apache sotware downloads page and download hadoop-3.3.0.tar.gz
binary tar file.
2)Now extract the tar file in the downloads location by using below command:
sudo tar -xzvf /home/Ubuntu/Downloads/hadoop-3.3.0.tar.gz
3)Now, rename the extracted file as Hadoop and move that hadoop file to the location where your user was
created.
sudo mv /home/Ubuntu/Downloads/hadoop /home/usr/local/hadoop
4)We need to copy this Commands to bashrc file:-(on bascrh file)
export HADOOP_INSTALL=/usr/local/Hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL

Bapatla Engineering College, Bapatla


Class:4/4 B.Tech Regd No: 4

export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib/native"
Step-5:-
Copy these Commands on terminal
sudo mkdir -p /usr/local/hadoopdata/hdfs/namenode
sudo mkdir -p /usr/local/hadoopdata/hdfs/datanode
sudo mkdir -p /app/hadoop/tmp
sudo chown hadoop1:hadoop1 /app/hadoop/tmp (Instead of hadoop1:hadoop1 give your user name
ex: user’s_username: group name)
sudo chmod 750 /app/hadoop/tmp
Step-6:-
Modify these files on /usr/local/hadoop/etc/Hadoop
1) core-site.xml
2) hadoop-env.sh
3) mapred-site.xml
4) hdfs-site.xml

core-site.xml :- (copy below code in between <configuration>....</configuration>)


<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
</property>
hadoop-env.sh:-
export JAVA_HOME=/usr/lib/jvm/java-8-oracle(Java path)
mapred-site.xml:- (You need to remove template from extension)
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
</property>

Bapatla Engineering College, Bapatla


Class:4/4 B.Tech Regd No: 5

hdfs-site.xml:-
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoopdata/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.name.dir</name>
<value>file:/usr/local/hadoopdata/hdfs/datanode</value>
</property>
Step-7:-
➢ source ~/.bashrc
➢ hadoop namenode -format
➢ start-all.sh (This will start all thenodes)
➢ jps (to check all the nodes, total - 6)
NOTE: If datanode is not started:
First stop all the services using below command:
STEP 1: Go to /app/hadoop/tmp/ location
Delete the all folders
STEP2: IN TERMINAL
sudo chmod -R 755 /app/hadoop
Start all services using commands mentioned above
If you get all the 6 nodes then your hadoop is Ready....

Bapatla Engineering College, Bapatla


Class:4/4 B.Tech Regd No: 6

Hadoop Commands:
1) Print the Hadoop version
Syntax: hadoop version
ex: hadoop version
2) To create adirectory
Syntax: hadoopfs -mkdir [-p] <path>
-p: create parent directory along the path
Ex: hadoop fs -mkdir /hadoop/bec
3) List the contents in human readable format
Syntax: hadoop fs -ls [-R][_h] <path>
Ex: hadoop fs -ls /hadoop
4) Upload a file
Syntax: hadoop fs -put <local src><dest>
Ex: hadoop fs -put /home/Ubuntu/Desktop/hadoop.txt /hadoop/bec/hadop.txt
5) Download a file
Syntax: hadoop fs -get <src> <local dest>
Ex: hadoop fs -get /hadoop/bec/jes.txt .home/ubuntu/doc.txt
6) View the content of the file
Syntax: hadoop fs -cat <path (filename)>
Ex: hadoop fs -cat /hadoop/bec/hadop.txt
7) Copy the file from src to destination within hdfs
Syntax: hadoop fs -cp [-f] <hdfssrc> <hdfsdest>
-f: to overwrite the file if already exists
Ex: hadoop fs -cp /hadoop/fds.txt /hadoop/bec/
8) Move the file from src to destination within the Hdfs
Syntax: hadoop fs -mv <hdfssrc> <hdfsdest>
Ex: hadoop fs -mv /hadoop/bec/hadoop.txt /hadoop/
9) Copy from local to hdfs:
Syntax: hadoop fs -copyFromLocal [-f] <local src> <dest>
-f: to overwrite the file if already exists
Ex: hadoop fs -copyFromLocal /home/Ubuntu/Desktop/hadoop.txt /hadoop/bec/hadop.txt
10) Copy to local from hdfs
Syntax: hadoop fs -copyToLocal [-f] <hdfssrc> <dest>
-f: to overwrite the file if already exists

Bapatla Engineering College, Bapatla


Class:4/4 B.Tech Regd No: 7

Ex: hadoop fs -copyToLocal /hadoop/bec/jes.txt .home/ubuntu/doc.txt


11) To remove a directory or folder from the hdfs
Syntax: hadoop fs -rm [-R] [-skipTrash] <path>
-skipTrash: to delete permanently
To remove textfile
Ex: hadoop fs -rm /hadoop/bec/hadop.txt
To remove a folder
Ex: hadoop fs -rm -R /hadoop/bec
12) To delete non empty files
Syntax: hadoop fs -rmdir [--ignore-fail-0n-non-empty] <path>
--ignore-fail-0n-non-empty: don’t fail even some fails are contain.
Ex: hadoop fs -rmdir /hadoop/bec/cse
13) Display last few lines of a file in hdfs
Syntax: Hadoop fs -tail [-f] <path>
-f: will output the data as the file grows
Ex: hadoop fs -tail /hadoop/bec/hadop.txt
14) Display disk usage of files and directories
Syntax: hadoop fs -du [-s] [-h] <path>
-s: give aggregate summary of file length
-h: human readable format file
Ex: hadoop fs -du /hadoop/bec/hadop.txt
15) To empty the trash of hdfs
Syntax: hadoop fs -expunge
Ex: hadoop fs -expunge

16) Create empty file in hdfs


Syntax: hadoop fs -touchz<path>
Ex: hadoop fs -touchz /hadoop/bec/empty.txt
17) Validate or perform various tests on the files
Syntax: hadoop fs -test -[defsz] <path>
-d: check if path is a directory or not
-e: check if path is exists or not
-f: check if the path is a file or not
-s: check if path is not empty ornot

Bapatla Engineering College, Bapatla


Class:4/4 B.Tech Regd No: 8

-z: check if the file is zero length ornot


Ex: hadoop fs -test -d /hadoop/bec
18) Collect specific information about the file
Syntax: hadoop fs -stat [format] <path>
%b: size of the file in bytes
%F: will return file or directory
%g: group name
%n: file name
%o: HDFS block size in bytes
%r: replication factor
%u: username of owner
%y: UTC date as “yyyy-MM-ddHH:mm:ss”
Ex: hadoop fs -stat “%b %F %g %n %o %r %u %y %Y” /hadoop/bec
19) Changes the replication factor of a file in hadoop
Syntax: hadoop fs -setrep [-W] <number><path>
-w flag requests that the command wait for replication to complete
Ex: hadoop fs -setrep /hadoop/bec
20) Count no of directories, files and bytes in specific path
Syntax: hadoop fs -count [-h] <path>
-h: human readable format
Ex: hadoop fs -count /hadoop/bec
21) Merging multiple files into one
Syntax: hadoop fs -getmerge [-n1] <src><localdest>
-n1: can be set to enable adding a new line at the end of each file
Ex: hadoop fs -getmerge –n1/hadoop/bec/hadop.txt
22) Change group association of files
Syntax: hadoop fs -chgrp [-R] <path>
Ex: hadoop fs -chgrp /hadoop/bec/hadop.txt
23) Move local files into hdfs filesystem
Syntax: hadoop fs -moveFromLocal <local src> <dest>
Ex: hadoop fs -moveFromLocal /home/Ubuntu/details.txt /hadoop/bec/details.txt
24) Information about particular command in hdfs
Syntax: hadoop fs -usage <command>
Ex: hadoop fs -usage rm

Bapatla Engineering College, Bapatla


Class:4/4 B.Tech Regd No: 9

25) Change the user and group ownership of the folders


Syntax: hadoop fs -chown [-R] [owner]:[group] <path of the folder>
Ex: hadoop fs -chown -R hadoop:hadoop /hadoop/bec
26) Check the file system Disk space usage (displays freespace)
Syntax: hadoop fs -df [-h] <path>
Ex: hadoop fs -df /hadoop/bec/
27) Displays the content of the file
Syntax: hadoop fs -text <src>
Ex: hadoop fs -text /hadoop/bec/hadop.txt
28) Merge the content of local file to another file in hdfs
Syntax: hadoop fs -appendToFile <local src> <dest>
EX: hadoop fs -appendToFile /home/lavanya/details.txt /hadoop/bec/hadop.txt
29) Change the permissions of files
Syntax: hadoop fs -chmod [-R] <mode> <path>
Ex: hadoop fs -chmod –R 777 /hadoop
30) List all hadoop file system commands
Syntax: hadoop fs
Ex: hadoop fs
31) Commands information (commands about commands)
Syntax: hadoop fs -help
Ex: hadoop fs -help

Bapatla Engineering College, Bapatla


Class:4/4 B.Tech Regd No: 10

Map-reduce word count program:


Step-1:-
1) Install eclipse using the below commands:
sudo apt-get update
sudo apt-get install eclipse
(or)
We can also install it using the snap:
sudo snap install --classic eclipse
2) Open eclipse from the applications after installing it.

3)After opening the workspace, click on “File” in the menu tab and click click on “New” then select “Java
Project”

Bapatla Engineering College, Bapatla


Class:4/4 B.Tech Regd No: 11

4)Provide a name for the project (say, wordcount) and leave other settings as default and then click on
“Finish”.

Step-2:-
1) Creating a package under the project.
➢ Right click on project name (say, wordcount), click on “New” and then select “Package”.

Bapatla Engineering College, Bapatla


Class:4/4 B.Tech Regd No: 12

2) Provide a name to the package (say, Demo) and click “Finish”.

3) Right click on project name, click on “New” and then select “Class”

4) Provide a name for the class (say, WCount), leave other settings as default and click on “Finish”.

Bapatla Engineering College, Bapatla


Class:4/4 B.Tech Regd No: 13

Step-3:-
WCount.java:
package Demo;
import java.io.IOException;
import java.util.Iterator;
import java.util.StringTokenizer;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;

public class WCount {

public static class Map extends MapReduceBase implements


Mapper<LongWritable, Text, Text, IntWritable> {

@Override
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output,
Reporter reporter)
throws IOException {

String line = value.toString();


StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {
value.set(tokenizer.nextToken());
output.collect(value, new IntWritable(1));
}

}
}

public static class Reduce extends MapReduceBase implements


Reducer<Text, IntWritable, Text, IntWritable> {

@Override
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter)

Bapatla Engineering College, Bapatla


Class:4/4 B.Tech Regd No: 14

throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}

output.collect(key, new IntWritable(sum));


}
}

public static void main(String[] args) throws Exception {

JobConf conf = new JobConf(WCount.class);


conf.setJobName("wcount");

conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(Map.class);
conf.setReducerClass(Reduce.class);

conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));


FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf);

}
}

Note: We may get errors at the import statements as we have not inculded the requires jars in project’s
configuration, Follow the below step to do so.

Bapatla Engineering College, Bapatla


Class:4/4 B.Tech Regd No: 15

Step-4:-
1) We need to include some external jar files to our project.
➢ There are two jar files which can be used for this program. They can be downloaded from the
following links given below:
➢ https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common/3.3.0
➢ https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-mapreduce-client-core/3.3.0
➢ These can be included in our project by configuring the build and adding then as external libraries to
the classpath.
2) Right click on the project and then click on “Build Path”

3) In java build path, click on “Libraries” tab and select “Classpath”.


Note: See that that the options on right side are highlighted.
4) Click on “Add External JARs…”

Bapatla Engineering College, Bapatla


Class:4/4 B.Tech Regd No: 16

5) Select the downloaded the jar files and then click on “open”

6) See that the jars are added to the classpath and click on “Apply and Close”

Note: See that the errors are solved in the java class after adding the jars in classpath.

Bapatla Engineering College, Bapatla


Class:4/4 B.Tech Regd No: 17

Step-5:-
1) Now, we need to convert our project into a java jar file. Right click on the project name and then select
“Export”.

2) In Export window, select “Java” and then select “JAR File” from dropdown menu.

Bapatla Engineering College, Bapatla


Class:4/4 B.Tech Regd No: 18

3) In file specification window, select the path and provide a name (say, wcount.jar) for the jar file which is
to be exported.

Note: See that the jar file is saved in the given path.
Step-6:-
1) Create a text document with some text which will be considered as an input for our program.

2) Start the hadoop daemons and move the text file to any hdfs directory and check if it is moved or not
using the below commands.
➢ start-all.sh
➢ jps
➢ hadoop fs -put /home/hayath/Desktop/hayath.txt /hadoop/
➢ hadoop fs -cat /hadoop/hayath.txt

Bapatla Engineering College, Bapatla


Class:4/4 B.Tech Regd No: 19

3) Now we can execute the jar file with the below command by specifying the input_text_file and an
output_dir to save the output.
Syntax: hadoop jar <JAR_FILE_PATH(In local file system)> Package_Name.Class_Name
<Input_Text_File_Path(In HDFS)> <Output_Dir_Path(In HDFS)>
Ex:

Bapatla Engineering College, Bapatla


Class:4/4 B.Tech Regd No: 20

4) Check that the specified output folder is created in hdfs. If found, open the directory and look out for the
“part-00000” file and download it. This file gives the output of the program which specifies the count of
each word in the given input file’s text.
Type below commands to view the output file or simply browse the hdfs file system and download the “part-
00000” file to view it in your system.

Bapatla Engineering College, Bapatla


Class:4/4 B.Tech Regd No: 21

Pig Installation Steps:


➢ Download the pig-0.16.0.tar.gz from the link
➢ http://www-us.apache.org/dist/pig/pig-0.16.0/
➢ Extract the zip file and copy and paste it in the usr/local/ directory.
➢ Now Open the bashrc file and copy the following content in it and save it.
o #pig_instalation_steps
o export PIG_HOME="/usr/local/pig"
o export PIG_CONF_DIR="$PIG_HOME/conf"
o export PIG_CLASSPATH="$PIG_CONF_DIR"
o export PATH="$PIG_HOME/bin:$PATH"

➢ pig –version : To know about the details of the pig version installed
➢ Pig Modes:
1. Local Mode: pig -x local;
2. Mapreduce Mode: pig -x mapreduce (or) pig

Pig Operators:
Processing Operators:
Loading and Storing Data:
For example Let us consider the stud.txt which contains the following content:
001,Rajiv,Reddy,21,9848022337,Hyderabad
002, Siddhartha, Battacharya, 22, 9848022338, Kolkata
003, Rajesh, Khanna, 22, 9848022339, Delhi
004, Preethi, Agarwal, 21, 9848022330, Pune
005, Trupthi, Mohanthy, 23, 9848022336, Bhubaneswar
006, Archana, Mishra, 23, 9848022335, Chennai
007, Komal, Nayak, 24, 9848022334, Trivandrum
008, Bharathi, Nambiayar, 24, 9848022333, Chennai
Load operator:
Syntax:
Relation_name = LOAD 'Input file path' USING function as schema;
Example:
Student = load '/home/ubuntu/Desktop/stud' using PigStorage (',') as (id: int, fname:chararray,
lname:chararray, age, contact,city);
Output:

Bapatla Engineering College, Bapatla


Class:4/4 B.Tech Regd No: 22

Store operator:
Syntax:
STORE Relation_name INTO ' required_directory_path ' [USING function];
Example:
store student into '/home/ubuntu/Desktop/det' using PigStorage(',');
Output:

Diagnostic Operators:
Dump operator:
Syntax:
Dump Relation_Name;
Example:
dump student;
Output:

Describe Operator:
Syntax:
describe Relation_Name;
Example:
describe student;
Output:

Bapatla Engineering College, Bapatla


Class:4/4 B.Tech Regd No: 23

Filtering Data:
Syntax:
Relation2_name = FILTER Relation1_name BY (condition);
Example1:
filter_data = FILTER student BY city == 'Chennai';
dump filter_data;
Output:

Example2:
filter_age = FILTER student BY age == '21';
dump filter_age;
Output:

Bapatla Engineering College, Bapatla


Class:4/4 B.Tech Regd No: 24

Foreach Operator:
Syntax:
Relation_name2 = FOREACH Relatin_name1 GENERATE (required data);
Example:
Grunt> foreach_data = foreach student generate id, firstname, city;
Grunt> dump foreach_data;
Output:

Grouping and Joining Data


Group operator:
Syntax:
Group_data = GROUP Relation_name BY age;
Example:
Grunt> Group_data = GROUP student BY age;
Grunt> dump Group_data;
Grouping by Multiple Columns:
Syntax:
relation_2 = GROUP Relation_name by (age, city);
Example:
group_multiple = GROUP student by (age, city);
dump group_multiple;
Output:

Bapatla Engineering College, Bapatla


Class:4/4 B.Tech Regd No: 25

Group All:
Syntax:
group_all = GROUP All;
Example:
group_all = GROUP student All;
dump group_all;
Output:

Join Operator:
Let us consider the two files Customers.txt and Orders.txt
customers.txt:
1, Ramesh,32,Ahmedabad,2000.00
2,Khilan,25,Delhi,1500.00
3,kaushik,23,Kota,2000.00
4,Chaitali,25,Mumbai,6500.00
5,Hardik,27,Bhopal,8500.00
6,Komal,22,MP,4500.00
7,Muffy,24,Indore,10000.00
orders.txt:
102,2009-10-08 00:00:00,3,3000
100,2009-10-08 00:00:00,3,1500
101,2009-11-20 00:00:00,2,1560
103,2008-05-20 00:00:00,4,2060
And we load these two files as follows:
customers = LOAD '/home/ubuntu/Desktop/customers’ USING PigStorage (',') as (id:int, name:chararray,
age:int, address:chararray, salary:int);
orders = LOAD '/home/ubuntu/Desktop/orders' USING PigStorage(',')as (oid:int, date:chararray,
customer_id:int, amount:int);
InnerJoin:
Syntax:
Relation3_name = JOIN Relation1_name BY key, Relation2_name BY key ;
Example:

Bapatla Engineering College, Bapatla


Class:4/4 B.Tech Regd No: 26

coustomer_orders = JOIN customers BY id, orders BY customer_id;


dump orders;
Self Join:
Syntax:
Relation3_name = JOIN Relation1_name BY key, Relation2_name BY key ;
Example: Load the customer.txt into two files customer1 and customer2 as follows:
customers1 = LOAD '/home/ubuntu/Desktop/customers' USING PigStorage(',') as (id:int, name:chararray,
age:int, address:chararray,salary:int);
customers2 = LOAD '/home/ubuntu/Desktop/customers' USING PigStorage(',') as (id:int, name:chararray,
age:int, address:chararray, salary:int);
customers3 = JOIN customers1 BY id, customers2 BY id; dump customers3;
Output:

Outer Join:
Left Outer Join:
Syntax:
Relation3_name = JOIN Relation1_name BY id LEFT OUTER, Relation2_name BY customer_id;
Example:
outer_left = JOIN customers BY id LEFT OUTER, orders BY customer_id; dump outer_left;
Output:

Bapatla Engineering College, Bapatla


Class:4/4 B.Tech Regd No: 27

Right Outer Join:


Syntax:
Relation3_name = JOIN Relation1_name BY id RIGHT OUTER, Relation2_name BY customer_id;
Example:
outer_right= JOIN customers BY id RIGHT OUTER, orders BY customer_id;
dump outer_right;
Output:

Full Outer Join:


Syntax:
Relation3_name = JOIN Relation1_name BY id FULL OUTER, Relation2_name BY customer_id;
Example:
outer_full= JOIN customers BY id FULL OUTER, orders BY customer_id; dump outer_full;
Output:

Union Operator:
stud.txt:
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi

Bapatla Engineering College, Bapatla


Class:4/4 B.Tech Regd No: 28

004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.
stud1.txt:
007,Komal,Nayak,9848022334,trivendram.
008,Bharathi,Nambiayar,9848022333,Chennai.
Syntax:
Relation_name3 = UNION Relation_name1, Relation_name2;
Example:
student1 = LOAD '/home/ubuntu/Desktop/stud' USING PigStorage (',') as (id:int,firstname:chararray,
lastname:chararray, phone:chararray, city:chararray);
student2 = LOAD '/home/ubuntu/Desktop/stud1' USING PigStorage (',') as (id:int,firstname:chararray,
lastname:chararray,phone:chararray, city:chararray);
student3 = UNION student1,student2;
dump student3;
Output:

Split Operator:
Syntax:
SPLIT Relation1_name INTO Relation2_name IF (condition1), Relation2_name(condition2);
Example:
student_details = LOAD '/home/ubuntu/Desktop/stu' USING PigStorage (',') as (id:int, firstname:chararray,
lastname:chararray, age:int, phone:chararray, city:chararray);
SPLIT student_details into student_details1 if age<23, student_details2 if (22<age and age<25);
dump student_details1;

Bapatla Engineering College, Bapatla


Class:4/4 B.Tech Regd No: 29

dump student_details2;

Execution of script file in grunt shell:


Let us consider the hii.pig script file which contains:
student= load '/home/ubuntu/Desktop/stud' using PigStorage (',') as (id: int, fname:chararray,
lname:chararray, age,contact, city);
describe student;
dump student;
1. Using exec:
Syntax: exec file_path;
Example: exec /home/ubuntu/Desktop/hii.pig;
Output:

Bapatla Engineering College, Bapatla


Class:4/4 B.Tech Regd No: 30

2. Using run:
Syntax: run file_path;
Example: run /home/ubuntu/Desktop/hii.pig;
Output:

Bapatla Engineering College, Bapatla


Class:4/4 B.Tech Regd No: 31

PIG Scripts:
WordCount:
wordcount.pig
lines = load '/home/ubuntu/Desktop/wordcount.txt' as (line: chararray);
words = FOREACH lines GENERATE FLATTEN (TOKENIZE (line)) as word;
grouped = group words by word;
wordcount = FOREACH grouped GENERATE group, COUNT (words);
dump wordcount;
Go to terminal
Grunt> exec /home/ubuntu/Desktop/wordcount.pig
Output:

Maxtemp:
maxtemp.pig
maxtmp =load '/home/ubuntu/Desktop/Maxtmp.txt' using PigStorage (',') as (year:int, tmp:int,
city:chararray);
maxtmp_year = group maxtemp by year;
max_tmp_yr = FOREACH maxtmp_year GENERATE group, MAX (maxtmp.tmp);
dump max_tmp_yr;
Goto terminal
Grunt> exec /home/ubuntu/Desktop/maxtemp.pig
Input:

Bapatla Engineering College, Bapatla


Class:4/4 B.Tech Regd No: 32

Maxtemp.txt
1992,23,HYDERABAD
1996, 28,GOA
1992,53,KOLKATTA
1996, 53,MUMBAI
2013,25,BAPATLA
2018,45,GUNTUR
2013,42,ONGOLE
Output:

Card Count:
cardcount.pig
cards = load '/home/ubuntu/Desktop/Cardcount.txt' USING PigStorage (',') as (color: chararray, symbol:
chararray, num: int);
colors = group cards by color;
cardcount = foreach colors generate group, COUNT(cards.num);
dump cardcount;
Goto terminal
Grunt> exec /home/ubuntu/Desktop/cardcount.pig
Input:
Cardcount.txt
red,club,1
red,diamond,5

Bapatla Engineering College, Bapatla


Class:4/4 B.Tech Regd No: 33

red,sprade,6
blue,sprade,7
blue,diamond,6
black,Sprade,9
black.Sprade,4
black,diamond,3

sym = group cards by symbol;


cou = foreach sym generate group, COUNT (cards.symbol);
dump cou;

Bapatla Engineering College, Bapatla


Class:4/4 B.Tech Regd No: 34

Pig UDFs:
Registering UDFs
--register_java_udf.pig
register ‘your_path_to_piggybank/piggybank.jar’;
divs = load ‘NYSE_dividends’ as (exchange:chararray, symbol:chararray, date:chararray,dividends:float);
Registering Python UDFs: (The Python script must be in your current directory)
--register_python_udf.pig
register ‘production.py’ using python as bballudfs;
players = load ‘baseball’ as (name:chararray, team:chararray, pos:bag{t:(p:chararray)}, bat:map[]);

Writing UDFs
Java UDFs:
package myudfs;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class UPPER extends EvalFunc
{
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try{
String str = (String)input.get(0);
return str.toUpperCase();
}catch(Exception e){
throw new IOException(&quot;Caught exception processing input row &quot;, e);
}
}
}

Bapatla Engineering College, Bapatla

You might also like