0% found this document useful (0 votes)
76 views

Bda Lab

The document provides instructions for installing Hadoop in standalone and pseudo-distributed modes. It describes downloading and configuring Java, Hadoop, and related files. It also explains setting up passwordless SSH access to localhost for the pseudo-distributed mode. Example MapReduce jobs are run to count words in input files to test the standalone installation.

Uploaded by

pawan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views

Bda Lab

The document provides instructions for installing Hadoop in standalone and pseudo-distributed modes. It describes downloading and configuring Java, Hadoop, and related files. It also explains setting up passwordless SSH access to localhost for the pseudo-distributed mode. Example MapReduce jobs are run to count words in input files to test the standalone installation.

Uploaded by

pawan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 47

Hadoop installation

Standalonde Mode:

Step 1
Download java (JDK <latest version> - X64.tar.gz) by visiting the following link www.oracle.com
Then jdk-7u71-linux-x64.tar.gz will be downloaded into your system.
Step 2
Generally you will find the downloaded java file in Downloads folder. Verify it and extract
the jdk-7u71-linux-x64.gz file using the following commands.

$ cd Downloads/
$ ls
jdk-7u71-linux-x64.gz

$ tar zxf jdk-7u71-linux-x64.gz


$ ls
jdk1.7.0_71 jdk-7u71-linux-x64.gz
Step 3
To make java available to all the users, you have to move it to the location “/usr/local/”. Open root,
and type the following commands.

$ su
password:
# mv jdk1.7.0_71 /usr/local/
# exit
Step 4
For setting up PATH and JAVA_HOME variables, add the following commands
to ~/.bashrc file.

export JAVA_HOME=/usr/local/jdk1.7.0_71
export PATH=$PATH:$JAVA_HOME/bin
Now apply all the changes into the current running system.

$ source ~/.bashrc
Downloading Hadoop
Download and extract Hadoop 2.4.1 from Apache software foundation using the following
commands.

$ su
password:
# cd /usr/local
# wget http://apache.claz.org/hadoop/common/hadoop-2.4.1/
hadoop-2.4.1.tar.gz
# tar xzf hadoop-2.4.1.tar.gz
# mv hadoop-2.4.1/* to hadoop/
# exit
Hadoop Operation Modes
Once you have downloaded Hadoop, you can operate your Hadoop cluster in one of the three
supported modes −
● Local/Standalone Mode − After downloading Hadoop in your system, by default, it is
configured in a standalone mode and can be run as a single java process.
● Pseudo Distributed Mode − It is a distributed simulation on single machine. Each Hadoop
daemon such as hdfs, yarn, MapReduce etc., will run as a separate java process. This mode
is useful for development.
● Fully Distributed Mode − This mode is fully distributed with minimum two or more
machines as a cluster. We will come across this mode in detail in the coming chapters.
Installing Hadoop in Standalone Mode
Here we will discuss the installation of Hadoop 2.4.1 in standalone mode.
There are no daemons running and everything runs in a single JVM. Standalone mode is suitable
for running MapReduce programs during development, since it is easy to test and debug them.
Setting Up Hadoop
You can set Hadoop environment variables by appending the following commands
to ~/.bashrc file.

export HADOOP_HOME=/usr/local/hadoop
Before proceeding further, you need to make sure that Hadoop is working fine. Just issue the
following command −

$ hadoop version
If everything is fine with your setup, then you should see the following result −

Hadoop 2.4.1
Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768
Compiled by hortonmu on 2013-10-07T06:28Z
Compiled with protoc 2.5.0
From source with checksum 79e53ce7994d1628b240f09af91e1af4
It means your Hadoop's standalone mode setup is working fine. By default, Hadoop is configured
to run in a non-distributed mode on a single machine.
Example
Let's check a simple example of Hadoop. Hadoop installation delivers the following example
MapReduce jar file, which provides basic functionality of MapReduce and can be used for
calculating, like Pi value, word counts in a given list of files, etc.

$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar
Let's have an input directory where we will push a few files and our requirement is to count the
total number of words in those files. To calculate the total number of words, we do not need to
write our MapReduce, provided the .jar file contains the implementation for word count. You can
try other examples using the same .jar file; just issue the following commands to check supported
MapReduce functional programs by hadoop-mapreduce-examples-2.2.0.jar file.

$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduceexamples-2.2.0.jar


Step 1:
Create temporary content files in the input directory. You can create this input directory anywhere
you would like to work.

$ mkdir input
$ cp $HADOOP_HOME/*.txt input
$ ls -l input
It will give the following files in your input directory −

total 24
-rw-r--r-- 1 root root 15164 Feb 21 10:14 LICENSE.txt
-rw-r--r-- 1 root root 101 Feb 21 10:14 NOTICE.txt
-rw-r--r-- 1 root root 1366 Feb 21 10:14 README.txt
These files have been copied from the Hadoop installation home directory. For your experiment,
you can have different and large sets of files.
Step 2:
Let's start the Hadoop process to count the total number of words in all the files available in the
input directory, as follows −

$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduceexamples-2.2.0.jar


wordcount input output
Step 3:
Step-2 will do the required processing and save the output in output/part-r00000 file, which you
can check by using −

$cat output/*
It will list down all the words along with their total counts available in all the files available in the
input directory.

"AS 4
"Contribution" 1
"Contributor" 1
"Derivative 1
"Legal 1
"License" 1
"License"); 1
"Licensor" 1
"NOTICE” 1
"Not 1
"Object" 1
"Source” 1
"Work” 1
"You" 1
"Your") 1
"[]" 1
"control" 1
"printed 1
"submitted" 1
(50%) 1
(BIS), 1
(C) 1
(Don't) 1
(ECCN) 1
(INCLUDING 2
(INCLUDING, 2
.............
Hadoop installation: Pseudo-distributed Mode

Edit the following files with this content

core-site.xml

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 in hadoop-env.sh

mapred-site.xml

<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>

hdfs-site.xml

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>

1. make user hadoop as owner of usr local hadoop

sudo chown -R hadoop usrlocalhadoop


Setup passwordless ssh to localhost
In a typical Hadoop production environment you’ll be setting up this passwordless ssh access
between the different servers. Since we are simulating a distributed environment on a single server,
we need to setup the passwordless ssh access to the localhost itself.
Use ssh-keygen to generate the private and public key value pair.
$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
The key fingerprint is:
02:5a:19:ab:1e:g2:1a:11:bb:22:30:6d:12:38:a9:b1 hadoop@hadoop
The key's randomart image is:
+--[ RSA 2048]----+
|oo |
|o + . . |
|++ oo |
|o .o = . |
| . += S |
|. o.o+. |
|. ..o. |
| . E .. |
| . .. |
+-----------------+

Add the public key to the authorized_keys. Just use the ssh-copy-id command, which will take care
of this step automatically and assign appropriate permissions to these files.
$ ssh-copy-id -i ~/.ssh/id_rsa.pub localhost
hadoop@localhost's password:
Now try logging into the machine, with "ssh 'localhost'", and check in:

.ssh/authorized_keys

to make sure we haven't added extra keys that you weren't expecting.

Jps command

jps stands for Java Virtual Machine Process Status Tool


In our practice, when we start the single node cluster following processes are must be up and
running:Name Node
Data Node
Resource Manager
Node Manager
jps is a tool to check, whether expected Hadoop processes are up and in running state or not.
How to run a mapreduce program
Do this in the root user
Create a folder called Mapreduce in Home directory
Create the java files in this directory
Compile the java files to generate the class files
Use javac -d . filename.java command.
Before compiling set the class path appropriately as shown below:
export CLASSPATH="$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-
core-2.9.2.jar:$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-common-
2.9.2.jar:$HADOOP_HOME/share/hadoop/common/hadoop-common-2.9.2.jar:~/Mapreduce/
WordCount/*:$HADOOP_HOME/lib/*"
Create the manifest file Manifest.txt
The content of the manifest class is the appropriate driver.
Main-Class: SalesCountry.SalesCountryDriver
Create jar file using the command shown below
jar cfm jarfilename.jar Manifest.txt Packagename/*.class

After the jar file is created copy it into hadoop location.

Create a text file for the processing to be copied later to HDFS.

Do this in hadoop user


format the namenode using bin/hadoop namenode -format
Start the hadoop daemons using the command sbin/start-all.sh
Check the daemons status using jps command.
Copy the text file from local file system to hadoop file system
like this ::: bin/hdfs dfs -copyFromLocal ~/inputMapReduce /

Run mapreduce job using the following command

bin/hadoop jar generatedjarfile.jar /textfile /outputfile

The result can be seen using the following command

bin/hdfs dfs -cat /mapreduce_output_sales/part-00000


wcfile.txt
bus,train,bus,car,plane,
boat,bus,car,bus,
train,car,plane,bus,
bus,car,car,train

Program WordCount.java

package WordCountEx;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {


public static void main(String [] args) throws Exception
{
Configuration c=new Configuration();
String[] files=new GenericOptionsParser(c,args).getRemainingArgs();
Path input=new Path(files[0]);
Path output=new Path(files[1]);
Job j=new Job(c,"wordcount");
j.setJarByClass(WordCount.class);
j.setMapperClass(MapForWordCount.class);
j.setReducerClass(ReduceForWordCount.class);
j.setOutputKeyClass(Text.class);
j.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(j, input);
FileOutputFormat.setOutputPath(j, output);
System.exit(j.waitForCompletion(true)?0:1);
}
public static class MapForWordCount extends Mapper<LongWritable, Text, Text, IntWritable>{
public void map(LongWritable key, Text value, Context con) throws IOException,
InterruptedException
{
String line = value.toString();
String[] words=line.split(",");
for(String word: words )
{
Text outputKey = new Text(word.toUpperCase().trim());
IntWritable outputValue = new IntWritable(1);
con.write(outputKey, outputValue);
}
}
}
public static class ReduceForWordCount extends Reducer<Text, IntWritable, Text, IntWritable>

{
public void reduce(Text word, Iterable<IntWritable> values, Context con) throws IOException,
InterruptedException
{
int sum = 0;
for(IntWritable value : values)
{
sum += value.get();
}
con.write(word, new IntWritable(sum));

}
}
}

Manifest.txt

Main-Class: WordCountEx.WordCount

SalesMapper

package SalesCountry;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;

public class SalesMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text,


IntWritable> {
private final static IntWritable one = new IntWritable(1);

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable>


output, Reporter reporter) throws IOException {

String valueString = value.toString();


String[] SingleCountryData = valueString.split(",");
output.collect(new Text(SingleCountryData[7]), one);
}
}
SalesCountryReducer

package SalesCountry;

import java.io.IOException;
import java.util.*;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;

public class SalesCountryReducer extends MapReduceBase implements Reducer<Text,


IntWritable, Text, IntWritable> {

public void reduce(Text t_key, Iterator<IntWritable> values,


OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException {
Text key = t_key;
int frequencyForCountry = 0;
while (values.hasNext()) {
// replace type of value with the actual type of our value
IntWritable value = (IntWritable) values.next();
frequencyForCountry += value.get();

}
output.collect(key, new IntWritable(frequencyForCountry));
}
}

SalesCountryDriver
package SalesCountry;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;

public class SalesCountryDriver {


public static void main(String[] args) {
JobClient my_client = new JobClient();
// Create a configuration object for the job
JobConf job_conf = new JobConf(SalesCountryDriver.class);

// Set a name of the Job


job_conf.setJobName("SalePerCountry");

// Specify data type of output key and value


job_conf.setOutputKeyClass(Text.class);
job_conf.setOutputValueClass(IntWritable.class);
// Specify names of Mapper and Reducer Class
job_conf.setMapperClass(SalesCountry.SalesMapper.class);
job_conf.setReducerClass(SalesCountry.SalesCountryReducer.class);

// Specify formats of the data type of Input and output


job_conf.setInputFormat(TextInputFormat.class);
job_conf.setOutputFormat(TextOutputFormat.class);

// Set input and output directories using command line arguments,


//arg[0] = name of input directory on HDFS, and
//arg[1] = name of output directory to be created to store the output file.

FileInputFormat.setInputPaths(job_conf, new Path(args[0]));


FileOutputFormat.setOutputPath(job_conf, new Path(args[1]));

my_client.setConf(job_conf);
try {
// Run the job
JobClient.runJob(job_conf);
} catch (Exception e) {
e.printStackTrace();
} }}
Implementing Hadoop Commands

Reading from file in hadoop

How to read a file from a HDFS file system like cat command

URLCat.java

package HDFSIO;
import java.io.BufferedOutputStream;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.io.OutputStream;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import java.net.URL;
import java.io.*;
import org.apache.hadoop.fs.FsUrlStreamHandlerFactory;
public class URLCat {
static {
URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
}
public static void main(String[] args) throws Exception {
InputStream in = null;
try {
in = new URL(args[0]).openStream();
IOUtils.copyBytes(in, System.out, 4096, false);
} finally {
IOUtils.closeStream(in);
}
}
}

Compile the java file as javac -d . URLCat.java

Edit the manifest file as follows

Main-Class: HDFSIO.URLCat
Create the jar file using the command

jar cfm URLRead.jar Manifest.txt HDFSIO/*.class

Copy the jar file to hadoop folder as follows


sudo cp URLRead.jar /usr/local/hadoop

bin/hadoop jar URLRead.jar hdfs://localhost:9000/filename

The above file with the name filename has to be present in HDFS.

Output of the above command will be the content of the file just like the cat command.
Writing Files to Hadoop

package HDFSWrite;
import java.io.BufferedOutputStream;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.io.OutputStream;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import java.net.URL;
import java.net.URI;
import java.io.*;
public class FileWriteToHDFS {

public static void main(String[] args) throws Exception {

//Source file in the local file system


String localSrc = args[0];
//Destination file in HDFS
String dst = args[1];

//Input stream for the file in local file system to be written to HDFS
InputStream in = new BufferedInputStream(new FileInputStream(localSrc));

//Get configuration of Hadoop system


Configuration conf = new Configuration();
System.out.println("Connecting to -- "+conf.get("fs.defaultFS"));

//Destination file in HDFS


FileSystem fs = FileSystem.get(URI.create(dst), conf);
OutputStream out = fs.create(new Path(dst));

//Copy file from local to HDFS


IOUtils.copyBytes(in, out, 4096, true);

System.out.println(dst + " copied to HDFS");

}
}

Before compiling java program set the path

export CLASSPATH="$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-
core-2.4.1.jar:$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-common-
2.4.1.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-cli-1.2.jar:$HADOOP_HOME/
share/hadoop/common/hadoop-common-2.4.1.jar:~/hadoopwrite/HDFSWrite/*:
$HADOOP_HOME/lib/*

The directory to be created is hadoopwrite in home directory of ubuntu

Package name is HDFSWrite as mentioned in the java program.

In root user after compiling the java program generate the jar file and copy the jar file to hadoop
location

Then do the following under hadoop user

Start hadoop
Then enter the following commands after hadoop starts successfully

bin/hadoop jar HWrite.jar /home/stanley/mapreduce/WCFile.txt /dir123

bin/hdfs dfs -cat /dir123

Manifest.txt contains the following


Main-Class: HDFSWrite.FileWriteToHDFS
Pig Installation

Steps for installing and running Apache pig

Pig runs as a client-side application. Even if you want to run Pig on a Hadoop cluster,there is
nothing extra to install on the cluster: Pig launches jobs and interacts with HDFS (or other Hadoop
filesystems) from your workstation.

Download a stable release from http://pig.apache.org/releases.html, and un-pack the tarball in a


suitable place on your workstation:

tar xzf pig-x.y.z.tar.gz

Copy the extracted pig folder to a folder /usr/local/pig

sudo mv pig-0.16.0 /usr/local/pig

Add the path in the terminal before running pig

export PIG_INSTALL=/usr/local/pig

export PATH=$PATH:$PIG_INSTALL/bin

After setting the path then navigate to the pig instalation location and type pig
Then the grunt shell(pig interactive shell) will be opened where we can run the pig commands

Create a file

Employee.txt
2. 001,Mehul,Hyderabad
3. 002,Ankur,Kolkata
4. 003,Shubham,Delhi
and copy to the hdfs

Create another file and copy to hdfs

Sample_script.pig
1. Employee = LOAD 'hdfs://localhost:9000/pig_data/Employee.txt' USING PigStorage(',')
2. as (id:int,name:chararray,city:chararray);
3. Dump Employee;

exec hdfs://localhost:9000/Sample_script.pig Executes the pig script and displays the output of
Employee.txt
Pig Programs
Loading and displaying data

FOREACH .. GENERATE

The FOREACH .. GENERATE operator is usedto act on every row in a relation. It can be used to
remove fields, or to generate newones. In this example, we do both.

Content of a.txt file

Joe,cherry,2

Ali,apple,3

Joe,banana,2

Eve,apple,7

Copy the file to HDFS

In grunt shell do the following:

A= LOAD 'hdfs://localhost:9000/a.txt' using PigStorage(',');

Dump A; will produce the output

(Joe,cherry,2)

(Ali,apple,3)

(Joe,banana,2)

(Eve,apple,7)
grunt> B = FOREACH A GENERATE $0, $2+1, 'Constant';

grunt> DUMP B;

(Joe,3,Constant)

(Ali,4,Constant)

(Joe,3,Constant)

(Eve,8,Constant)

Here we have created a new relation B with three fields. Its first field is a projection ofthe first field
($0) of A. B’s second field is the third field of A ($1) with one added to it

B’s third field is a constant field (every row in B has the same third field) with thechararray value
Constant.

STREAM

The STREAM operator allows you to transform data in a relation using an externalprogram or
script

STREAM can use built-in commands with arguments. Here is an example that uses theUnix cut
command to extract the second field of each tuple in A. Note that the com-mand and its arguments
are enclosed in backticks:

grunt> C = STREAM A THROUGH `cut -f 2`;

grunt> DUMP C;

(cherry)

(apple)

(banana)

(apple)

Grouping and Joining DataJ

oining datasets in MapReduce takes some work on the part of the programmer (see“Joins” on page
233), whereas Pig has very good built-in support for join operations,making it much more
approachable. Since the large datasets that are suitable for anal-ysis by Pig (and MapReduce in
general), are not normalized, joins are used more in-frequently in Pig than they are in SQL.JOIN

Let’s look at an example of an inner join.

Consider the relations A and B:

grunt> DUMP A;

(2,Tie)

(4,Coat)

(3,Hat)

(1,Scarf)

grunt> DUMP B;

(Joe,2)

(Hank,4)

(Ali,0)

(Eve,3)

(Hank,2)

We can join the two relations on the numerical (identity) field in each:

grunt> C = JOIN A BY $0, B BY $1;

grunt> DUMP C;

(2,Tie,Joe,2)

(2,Tie,Hank,2)

(3,Hat,Eve,3)

(4,Coat,Hank,4)

This is a classic inner join, where each match between the two relations correspondsto a row in the
result. (It’s actually an equijoin since the join predicate is equality.) The result’s fields are made up
of all the fields of all the input relations.

COGROUP

JOIN always gives a flat structure: a set of tuples. The COGROUP statement is similarto JOIN, but
creates a nested set of output tuples. This can be useful if you want toexploit the structure in
subsequent statements:
grunt> D = COGROUP A BY $0, B BY $1;

grunt> DUMP D;

(0,{},{(Ali,0)})

(1,{(1,Scarf)},{})

(2,{(2,Tie)},{(Joe,2),(Hank,2)})

(3,{(3,Hat)},{(Eve,3)})

(4,{(4,Coat)},{(Hank,4)})

COGROUP generates a tuple for each unique grouping key. The first field of each tupleis the key,
and the remaining fields are bags of tuples from the relations with a matchingkey. The first bag
contains the matching tuples from relation A with the same key.Similarly, the second bag contains
the matching tuples from relation B with the samekey.If for a particular key a relation has no
matching key, then the bag for that relation isempty. For example, since no one has bought a scarf
(with ID 1), the second bag in thetuple for that row is empty.

GROUP

Although COGROUP groups the data in two or more relations, the GROUP statement groups the
data in a single relation. GROUP supports grouping by more than equality of keys: you can use an
expression or user-defined function as the group key. For example, consider the following relation
A:

grunt> DUMP A;

(Joe,cherry)

(Ali,apple)

(Joe,banana)

(Eve,apple)

Let’s group by the number of characters in the second field:

grunt> B = GROUP A BY SIZE($1);

grunt> DUMP B;

(5L,{(Ali,apple),(Eve,apple)})

(6L,{(Joe,cherry),(Joe,banana)})
GROUP creates a relation whose first field is the grouping field, which is given the aliasgroup. The
second field is a bag containing the grouped fields with the same schema asthe original relation (in
this case, A).

Sorting Data

Relations are unordered in Pig. Consider a relation A:

grunt> DUMP A;

(2,3)

(1,2)

(2,4)

The following example sorts A by the first field in ascending order, and by the secondfield in
descending order:

grunt> B = ORDER A BY $0, $1 DESC;

grunt> DUMP B;

(1,2)

(2,4)

(2,3)
Filtering in Pig
How to Filter Records - Pig Tutorial Examples
Pig allows you to remove unwanted records based on a condition. The Filter functionality is
similar to the WHERE clause in SQL. The FILTER operator in pig is used to remove
unwanted records from the data file. The syntax of FILTER operator is shown below

<new relation> = FILTER <relation> BY <condition>

Here relation is the data set on which the filter is applied, condition is the filter condition and
new relation is the relation created after filtering the rows.

Pig Filter Examples:

Lets consider the below sales data set as an example

year,product,quantity

---------------------

2000, iphone, 1000

2001, iphone, 1500

2002, iphone, 2000

2000, nokia, 1200

2001, nokia, 1500

2002, nokia, 900

1. select products whose quantity is greater than or equal to 1000.


grunt> A = LOAD '/user/hadoop/sales' USING PigStorage(',') AS
(year:int,product:chararray,quantity:int);

grunt> B = FILTER A BY quantity >= 1000;

grunt> DUMP B;

(2000,iphone,1000)

(2001,iphone,1500)

(2002,iphone,2000)

(2000,nokia,1200)

(2001,nokia,1500)

2. select products whose quantity is greater than 1000 and year is 2001

grunt> C = FILTER A BY quantity > 1000 AND year == 2001;

(2001,iphone,1500)

(2001,nokia,1500)

3. select products with year not in 2000

grunt> D = FILTER A BY year != 2000;

grunt> DUMP D;

(2001,iphone,1500)

(2002,iphone,2000)
(2001,nokia,1500)

(2002,nokia,900)

You can use all the logical operators (NOT, AND, OR) and relational operators (< , >, ==, !=,
>=, <= ) in the filter conditions

Pig Joins
Running Pig

Note: The following should be performed in ***root user*** (not hadoop user).
Hadoop user is only for starting hadoop

Download Pig0.15.0 and extract


Move to usr/local/pig
Navigate to the above folder in the terminal
Set the path in the terminal before running pig
export PIG_HOME=/usr/local/pig
export PATH=$PATH:$PIG_HOME/bin

Then type pig in terminal which opens the grunt shell


(Make sure hadoop is running before pig is run)

Pig Join Example


The join operator is used to combine records from two or more relations. While performing a
join operation, we declare one (or a group of) tuple(s) from each relation, as keys. When
these keys match, the two particular tuples are matched, else the records are dropped. Joins
can be of the following types:
1) Inner-join
2) Self-join
3) Outer-join : left join, right join, and full join
Create a customers.txt file.

1,Ramesh,32,Ahmedabad,2000.00

2,Khilan,25,Delhi,1500.00

3,kaushik,23,Kota,2000.00

4,Chaitali,25,Mumbai,6500.00

5,Hardik,27,Bhopal,8500.00

6,Komal,22,MP,4500.00

7,Muffy,24,Indore,10000.00

Create a orders.txt file

102,2009-10-08 00:00:00,3,3000
100,2009-10-08 00:00:00,3,1500

101,2009-11-20 00:00:00,2,1560

103,2008-05-20 00:00:00,4,2060

Copy these files to HDFS

bin/hadoop fs -put /home/stanley/customers.txt /

bin/hadoop fs -put /home/stanley/orders.txt /

In the Pig Grunt Shell do the following

c= LOAD 'hdfs://localhost:9000/customers.txt' using PigStorage(',') as


(id:int,name:chararray,age:int,address:chararray,salary:int);

o= LOAD 'hdfs://localhost:9000/orders.txt' using PigStorage(',') as


(oid:int,date:chararray,cust_id:int,amount:int);

Inner join:

c_o= JOIN c BY id,o BY cust_id;

produces the following output

(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)

(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)

(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)

(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)

Self Join:
c1 = LOAD 'hdfs://localhost:9000/customers.txt' USING PigStorage(',') as
(id:int,name:chararray,age:int,address:chararray,salary:int);

c2 = LOAD 'hdfs://localhost:9000/customers.txt' USING PigStorage(',') as


(id:int,name:chararray,age:int,address:chararray,salary:int);

c3= JOIN c1 BY id, c2 BY id;

Dump c3;

produces the output

(1,Ramesh,32,Ahmedabad,2000,1,Ramesh,32,Ahmedabad,2000)

(2,Khilan,25,Delhi,1500,2,Khilan,25,Delhi,1500)

(3,kaushik,23,Kota,2000,3,kaushik,23,Kota,2000)

(4,Chaitali,25,Mumbai,6500,4,Chaitali,25,Mumbai,6500)

(5,Hardik,27,Bhopal,8500,5,Hardik,27,Bhopal,8500)

(6,Komal,22,MP,4500,6,Komal,22,MP,4500)

(7,Muffy,24,Indore,10000,7,Muffy,24,Indore,10000)

3) Outer Join
Unlike inner join, outer join returns all the rows from at least one of the relations. An outer
join operation is carried out in three ways -
a) Left outer join
b) Right outer join
c) Full outer join
a) Left outer join
The left outer Join operation returns all rows from the left table, even if there are no matches
in the right relation.
c= LOAD 'hdfs://localhost:9000/c.txt' using PigStorage(',') as
(id:int,name:chararray,age:int,address:chararray,salary:int);

o= LOAD 'hdfs://localhost:9000/o.txt' using PigStorage(',') as


(oid:int,date:chararray,cust_id:int,amount:int);

outer_left = JOIN c BY id LEFT OUTER, o BY cust_id;

Dump outer_left;

(1,Ramesh,32,Ahmedabad,2000,,,,)

(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)

(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)

(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)

(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)

(5,Hardik,27,Bhopal,8500,,,,)

(6,Komal,22,MP,4500,,,,)

(7,Muffy,24,Indore,10000,,,,)

Right Outer

outer_right = JOIN c BY id RIGHT, o BY cust_id;

Dump outer_right;

(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)

(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)

(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)

(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
c) Full outer join

outer_full = JOIN c BY id FULL OUTER, o BY cust_id;

Dump outer_full;
(1,Ramesh,32,Ahmedabad,2000,,,,)

(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)

(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)

(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)

(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)

(5,Hardik,27,Bhopal,8500,,,,)

(6,Komal,22,MP,4500,,,,)

(7,Muffy,24,Indore,10000,,,,)
PIG UDF

How to create an UDF in pig

1. Set the classpath as follows

export CLASSPATH="$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-
client-core-2.9.2.jar:$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-
common-2.9.2.jar:$HADOOP_HOME/share/hadoop/common/hadoop-common-2.9.2.jar:~/
pigudf/*:$HADOOP_HOME/lib/*:/usr/local/pig/pig-0.16.0-core-h1.jar:/usr/local/pig/pig-
0.16.0-core-h2.jar"

File Name: Sample_Eval.java

package pig;

import java.io.IOException;

import org.apache.pig.EvalFunc;

import org.apache.pig.data.Tuple;

public class Sample_Eval extends EvalFunc<String> {

public String exec(Tuple input) throws IOException {

if (input == null || input.size() == 0)

return null;

String str = (String) input.get(0);

return str.toUpperCase();

}
Compile the program and generate the jar file

Copy the jar file to hdfs

Create a file called Employ.txt

1,John,2007-01-24,250

2,Ram,2007-05-27,220

3,Jack,2007-05-06,170

3,Jack,2007-04-06,100

4,Jill,2007-04-06,220

5,Zara,2007-06-06,300

5,Zara,2007-02-06,35

In Pig grunt shell do the following:

Register the jar file from hdfs

Register ‘hdfs://localhost:9000/MyUDF.jar’

employee_data = LOAD 'hdfs://localhost:9000/user/hduser/pig/employee_new.txt' USING


PigStorage(',') as (id:int, name:chararray,workdate:chararray,daily_typing_pages:int);

Step 9
Let us now convert the names of the employees in to upper case using the UDF sample_eval.
Upper_case = FOREACH employee_data GENERATE pig.Sample_Eval(name);
Dump Upper_case;
Retrieving user login credentials from /etc/passwd using Pig Latin

First copy the passwd file from etc to the working directory.Assume the working directory
is /usr/local/pig

In terminal perform the following command

sudo cp /etc/passwd /usr/local/pig

Load pig in local mode

( You need not run hadoop in pseudo-distributed mode)

export PIG_HOME=/usr/local/pig

sandeep@sandeep-PC:/usr/local/pig$ export PATH=$PATH:$PIG_HOME/bin

sandeep@sandeep-PC:/usr/local/pig$ pig -x local

In grunt shell

grunt>A = load 'passwd' using PigStorage(':');

B = foreach A generate $0 as id;

store B into 'id.out';

Check id.out file in Pig directory


HIVE INSTALLATION

HIVE installation steps

Prerequisites are

1. JAVA 1.8 (not open jdk)

2. MySQL

3. Fully configured Hadoop cluster

Procedure

1. Install MySQL Server using the following command

$ sudo apt-get install mysql-server

2. Install the MySQL Java Connector

$ sudo apt-get install libmysql-java

3. Create soft link for connector in Hive lib directory or copy connector jar to lib folder –

sudo ln -s /usr/share/java/mysql-connector-java.jar $HIVE_HOME/lib/mysql-connector-


java.jar

Note :- HIVE_HOME points to installed hive folder.


4. Create the Initial database schema using the hive-schema-0.14.0.mysql.sql file ( or the file
corresponding to your installed version of Hive) located in the
$HIVE_HOME/scripts/metastore/upgrade/mysql directory.
sudo mysql -u root -p

create database metastore

use metastore

mysql> SOURCE $HIVE_HOME/scripts/metastore/upgrade/mysql/hive-schema-


0.14.0.mysql.sql;

mysql> CREATE USER 'hiveuser'@'%' IDENTIFIED BY 'hivepassword';


mysql> GRANT all on *.* to 'hiveuser'@localhost identified by 'hivepassword';

mysql> flush privileges;

Note : – hiveuser is the ConnectionUserName in hive-site.xml ( As explained next)


6. Create hive-site.xml ( If not already present) in $HIVE_HOME/conf folder with the
configuration below –

<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost/metastore?createDatabaseIfNotExist=true</value>
<description>metadata is stored in a MySQL server</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>MySQL JDBC driver class</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hiveuser</value>
<description>user name for connecting to mysql server</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hivepassword</value>
<description>password for connecting to mysql server</description>
</property>
</configuration>
7. We are all set now. Start the hive console.
HIVE EXPERIMENTS

Table Creation and loading data from text file


Create a table in hive like below

create table emp(id int,ename string) row format delimited fields terminated by ',';

Create a file called Emp.txt

1,Sandeep

2,Ramesh

3,Pradeep

Load data from the above file to the table emp as shown below;

load data local inpath '/home/sandeep/Emp.txt' into table emp;

select * from emp displays the table contents


Creating Managed and external tables

Hive External and Managed Tables

Managed Table:

When we load data into a Managed table, then Hive moves data into Hive warehouse
directory.

CREATE TABLE managed_table (dummy STRING);

LOAD DATA INPATH 'hdfs://localhost:9000/user/b.txt' into table managed_table;

This moves the file b.txt into Hive’s warehouse directory for the managed_table table, which
is hdfs://user/hive/warehouse/managed_table.

Further, if we drop the table using:


DROP TABLE managed_table

Then this will delete the table metadata including its data. The data no longer exists
anywhere. This is what it means for HIVE to manage the data.

External Tables – External table behaves differently. In this, we can control the creation and
deletion of the data. The location of the external data is specified at the table creation time:

Now, with the EXTERNAL keyword, Apache Hive knows that it is not managing the data.
So it doesn’t move data to its warehouse directory. It does not even check whether the
external location exists at the time it is defined. This very useful feature because it means we
create the data lazily after creating the table.
The important thing to notice is that when we drop an external table, Hive will leave the data
untouched and only delete the metadata.
ii. Security
5. Managed Tables –Hive solely controls the Managed table security. Within Hive,
security needs to be managed; probably at the schema level (depends on
organization).
6. External Tables –These tables’ files are accessible to anyone who has access to
HDFS file structure. So, it needs to manage security at the HDFS file/folder level.
iii. When to use Managed and external table
Use Managed table when –
4. We want Hive to completely manage the lifecycle of the data and table.
5. Data is temporary
Use External table when –
● Data is used outside of Hive. For example, the data files are read and processed by an
existing program that does not lock the files.
● We are not creating a table based on the existing table.
● We need data to remain in the underlying location even after a DROP TABLE. This
may apply if we are pointing multiple schemas at a single data set.
● The hive shouldn’t own data and control settings, directories etc., we may have
another program or process that will do these things.
Creating Partitions and buckets
Hive Partitions and buckets

Partitioning –Apache Hive organizes tables into partitions for grouping same type of data
together based on a column or partition key. Each table in the hive can have one or more
partition keys to identify a particular partition. Using partition we can make it faster to do
queries on slices of the data.
Bucketing –In Hive Tables or partition are subdivided into buckets based on the hash
function of a column in the table to give extra structure to the data that may be used for more
efficient queries.

Table creation

hive> create table student(sid int,sname string,smarks int) row format delimited fields
terminated by ',';

Contents of file s.txt

1,abc,20

2,xyz,30

3,def,40

Loading data from a local file s.txt

hive> load data local inpath '/home/stanley/s.txt' into table student;

Partition Creation

create table student_part(sname string,smarks int) partitioned by (sid int);


set hive.exec.dynamic.partition.mode=nonstrict

Loading data into partition table

INSERT OVERWRITE TABLE student_part PARTITION(sid)


SELECT sid,sname,smarks from student;

show partitions student_part;

shows the partitions based on id

sid=20

sid=30

sid=40

sid=__HIVE_DEFAULT_PARTITION__

Time taken: 0.138 seconds, Fetched: 4 row(s)

Buckets:
Buckets in hive is used in segregating of hive table-data into multiple files or directories. it is
used for efficient querying.
1. The data i.e. present in that partitions can be divided further into Buckets
2. The division is performed based on Hash of particular columns that we selected in the
table.
3. Buckets use some form of Hashing algorithm at back end to read each record and
place it into buckets
4. In Hive, we have to enable buckets by using the set.hive.enforce.bucketing=true;
Enable bucketing

hive>set hive.enforce.bucketing=true;
Create bucketed table

create table student_bucket(sid int,sname string,smarks int) clustered by (sid) into 3 buckets
row format delimited fields terminated by ',';
Inserting data
insert overwrite table student_bucket select * from student;

Buckets created and view of HDFS

You might also like