UNIT 5 Notes by ARUN JHAPATE
UNIT 5 Notes by ARUN JHAPATE
Step 1
Open the homepage of Apache Pig website. Under the section News, click on the link release
page as shown in the following snapshot.
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
Step 2
On clicking the specified link, you will be redirected to the Apache Pig Releases page. On this
page, under the Download section, you will have two links, namely, Pig 0.8 and later and Pig
0.7 and before. Click on the link Pig 0.8 and later, then you will be redirected to the page
having a set of mirrors.
Step 3
Choose and click any one of these mirrors as shown below.
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
Step 4
These mirrors will take you to the Pig Releases page. This page contains various versions of
Apache Pig. Click the latest version among them.
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
Step 5
Within these folders, you will have the source and binary files of Apache Pig in various
distributions. Download the tar files of the source and binary files of Apache Pig
0.15, pig0.15.0-src.tar.gz and pig-0.15.0.tar.gz.
Step 1
Create a directory with the name Pig in the same directory where the installation directories
of Hadoop, Java, and other software were installed. (In our tutorial, we have created the Pig
directory in the user named Hadoop).
$ mkdir Pig
Step 2
Extract the downloaded tar files as shown below.
$ cd Downloads/
$ tar zxvf pig-0.15.0-src.tar.gz
$ tar zxvf pig-0.15.0.tar.gz
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
Step 3
Move the content of pig-0.15.0-src.tar.gz file to the Pig directory created earlier as shown
below.
$ mv pig-0.15.0-src.tar.gz/* /home/Hadoop/Pig/
.bashrc file
In the .bashrc file, set the following variables −
pig.properties file
In the conf folder of Pig, we have a file named pig.properties. In the pig.properties file, you can
set various parameters as given below.
pig -h properties
The following properties are supported −
Logging: verbose = true|false; default is false. This property is the same as -v
switch brief=true|false; default is false. This property is the same
as -b switch debug=OFF|ERROR|WARN|INFO|DEBUG; default is INFO.
This property is the same as -d switch aggregate.warning = true|false; default is true.
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
If true, prints count of warnings of each type rather than logging each warning.
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
80
You can run Apache Pig in two modes, namely, Local Mode and HDFS mode.
Local Mode
In this mode, all the files are installed and run from your local host and local file system. There is no need of Hadoop or HDFS. This mode is generally used for testing
purpose.
MapReduce Mode
MapReduce mode is where we load or process the data that exists in the Hadoop File System (HDFS) using Apache Pig. In this mode, whenever we execute the Pig
Latin statements to process the data, a MapReduce job is invoked in the back-end to perform a particular operation on the data that exists in the HDFS.
Apache Pig scripts can be executed in three ways, namely, interactive mode, batch mode, and embedded mode.
Interactive Mode (Grunt shell) − You can run Apache Pig in interactive mode
using the Grunt shell. In this shell, you can enter the Pig Latin statements and get
the output (using Dump operator).
Batch Mode (Script) − You can run Apache Pig in Batch mode by writing the Pig
Latin script in a single file with .pig extension.
Embedded Mode (UDF) − Apache Pig provides the provision of defining our own
functions (User Defined Functions) in programming languages such as Java, and
using them in our script.
Invoking the Grunt Shell
You can invoke the Grunt shell in a desired mode (local/MapReduce) using the −x option as shown below.
Command − Command −
$ ./pig –x local $ ./pig -x mapreduce
Output − Output −
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
81
Either of these commands gives you the Grunt shell prompt as shown below.
grunt>
After invoking the Grunt shell, you can execute a Pig script by directly entering the Pig Latin statements in it.
You can write an entire Pig Latin script in a file and execute it using the –x command. Let us suppose we have a Pig script in a file named sample_script.pig as
shown below.
Sample_script.pig
student = LOAD 'hdfs://localhost:9000/pig_data/student.txt' USING
PigStorage(',') as (id:int,name:chararray,city:chararray);
Dump student;
Now, you can execute the script in the above file as shown below.
$ pig -x $ pig -x
local Sample_script.pig mapreduce Sample_script.pig
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
82
$ java –version
If Java is already installed on your system, you get to see the following response:
java version "1.7.0_71"
Java(TM) SE Runtime Environment (build 1.7.0_71-b13)
Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)
If java is not installed in your system, then follow the steps given below for installing java.
Installing Java
Step I:
Step II:
Generally you will find the downloaded java file in the Downloads folder. Verify it and extract
the jdk-7u71-linux-x64.gz file using the following commands.
$ cd Downloads/
$ ls
jdk-7u71-linux-x64.gz
$ tar zxf jdk-7u71-linux-x64.gz
$ ls
jdk1.7.0_71 jdk-7u71-linux-x64.gz
Step III:
To make java available to all the users, you have to move it to the location “/usr/local/”. Open
root, and type the following commands.
$ su
password:
# mv jdk1.7.0_71 /usr/local/
# exit
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
83
Step IV:
For setting up PATH and JAVA_HOME variables, add the following commands to ~/.bashrc
file.
export JAVA_HOME=/usr/local/jdk1.7.0_71
export PATH=$PATH:$JAVA_HOME/bin
Now apply all the changes into the current running system.
$ source ~/.bashrc
Step V:
$ hadoop version
If Hadoop is already installed on your system, then you will get the following response:
Hadoop 2.4.1 Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768
Compiled by hortonmu on 2013-10-07T06:28Z
Compiled with protoc 2.5.0
From source with checksum 79e53ce7994d1628b240f09af91e1af4
If Hadoop is not installed on your system, then proceed with the following steps:
Downloading Hadoop
Download and extract Hadoop 2.4.1 from Apache Software Foundation using the following
commands.
$ su
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
84
password:
# cd /usr/local
# wget http://apache.claz.org/hadoop/common/hadoop-2.4.1/
hadoop-2.4.1.tar.gz
# tar xzf hadoop-2.4.1.tar.gz
# mv hadoop-2.4.1/* to hadoop/
# exit
You can set Hadoop environment variables by appending the following commands
to ~/.bashrc file.
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export
PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
Now apply all the changes into the current running system.
$ source ~/.bashrc
You can find all the Hadoop configuration files in the location
“$HADOOP_HOME/etc/hadoop”. You need to make suitable changes in those configuration
files according to your Hadoop infrastructure.
$ cd $HADOOP_HOME/etc/hadoop
In order to develop Hadoop programs using java, you have to reset the java environment
variables in hadoop-env.sh file by replacing JAVA_HOME value with the location of java in
your system.
export JAVA_HOME=/usr/local/jdk1.7.0_71
Given below are the list of files that you have to edit to configure Hadoop.
core-site.xml
The core-site.xml file contains information such as the port number used for Hadoop instance,
memory allocated for the file system, memory limit for storing the data, and the size of
Read/Write buffers.
Open the core-site.xml and add the following properties in between the <configuration> and
</configuration> tags.
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
85
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
hdfs-site.xml
The hdfs-site.xml file contains information such as the value of replication data, the namenode
path, and the datanode path of your local file systems. It means the place where you want to
store the Hadoop infra.
Let us assume the following data.
dfs.replication (data replication value) = 1
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/namenode </value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/datanode </value >
</property>
</configuration>
Note: In the above file, all the property values are user-defined and you can make changes
according to your Hadoop infrastructure.
yarn-site.xml
This file is used to configure yarn into Hadoop. Open the yarn-site.xml file and add the
following properties in between the <configuration>, </configuration> tags in this file.
<configuration>
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
86
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
mapred-site.xml
This file is used to specify which MapReduce framework we are using. By default, Hadoop
contains a template of yarn-site.xml. First of all, you need to copy the file from mapred-
site,xml.template to mapred-site.xml file using the following command.
$ cp mapred-site.xml.template mapred-site.xml
Open mapred-site.xml file and add the following properties in between the <configuration>,
</configuration> tags in this file.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Set up the namenode using the command “hdfs namenode -format” as follows.
$ cd ~
$ hdfs namenode -format
The expected result is as follows.
10/24/14 21:30:55 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = localhost/192.168.1.11
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 2.4.1
...
...
10/24/14 21:30:56 INFO common.Storage: Storage directory
/home/hadoop/hadoopinfra/hdfs/namenode has been successfully formatted.
10/24/14 21:30:56 INFO namenode.NNStorageRetentionManager: Going to
retain 1 images with txid >= 0
10/24/14 21:30:56 INFO util.ExitUtil: Exiting with status 0
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
87
The following command is used to start dfs. Executing this command will start your Hadoop
file system.
$ start-dfs.sh
The expected output is as follows:
10/24/14 21:37:56
Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/hadoop/hadoop-2.4.1/logs/hadoop-hadoop-
namenode-localhost.out
localhost: starting datanode, logging to /home/hadoop/hadoop-2.4.1/logs/hadoop-hadoop-datanode-
localhost.out
Starting secondary namenodes [0.0.0.0]
The following command is used to start the yarn script. Executing this command will start your
yarn daemons.
$ start-yarn.sh
The expected output is as follows:
starting yarn daemons
starting resourcemanager, logging to /home/hadoop/hadoop-2.4.1/logs/yarn-hadoop-
resourcemanager-localhost.out
localhost: starting nodemanager, logging to /home/hadoop/hadoop-2.4.1/logs/yarn-hadoop-
nodemanager-localhost.out
The default port number to access Hadoop is 50070. Use the following url to get Hadoop
services on your browser.
http://localhost:50070/
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
88
The default port number to access all applications of cluster is 8088. Use the following url to
visit this service.
http://localhost:8088/
We use hive-0.14.0 in this tutorial. You can download it by visiting the following
link http://apache.petsads.us/hive/hive-0.14.0/. Let us assume it gets downloaded onto the
/Downloads directory. Here, we download Hive archive named “apache-hive-0.14.0-bin.tar.gz”
for this tutorial. The following command is used to verify the download:
$ cd Downloads
$ ls
On successful download, you get to see the following response:
apache-hive-0.14.0-bin.tar.gz
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
89
The following command is used to verify the download and extract the hive archive:
$ tar zxvf apache-hive-0.14.0-bin.tar.gz
$ ls
On successful download, you get to see the following response:
apache-hive-0.14.0-bin apache-hive-0.14.0-bin.tar.gz
We need to copy the files from the super user “su -”. The following commands are used to copy
the files from the extracted directory to the /usr/local/hive” directory.
$ su -
passwd:
# cd /home/user/Download
# mv apache-hive-0.14.0-bin /usr/local/hive
# exit
You can set up the Hive environment by appending the following lines to ~/.bashrc file:
export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin
export CLASSPATH=$CLASSPATH:/usr/local/Hadoop/lib/*:.
export CLASSPATH=$CLASSPATH:/usr/local/hive/lib/*:.
The following command is used to execute ~/.bashrc file.
$ source ~/.bashrc
export HADOOP_HOME=/usr/local/hadoop
Hive installation is completed successfully. Now you require an external database server to
configure Metastore. We use Apache Derby database.
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
90
The following command is used to download Apache Derby. It takes some time to download.
$ cd ~
$ wget http://archive.apache.org/dist/db/derby/db-derby-10.4.2.0/db-derby-10.4.2.0-bin.tar.gz
The following command is used to verify the download:
$ ls
On successful download, you get to see the following response:
db-derby-10.4.2.0-bin.tar.gz
The following commands are used for extracting and verifying the Derby archive:
$ tar zxvf db-derby-10.4.2.0-bin.tar.gz
$ ls
On successful download, you get to see the following response:
db-derby-10.4.2.0-bin db-derby-10.4.2.0-bin.tar.gz
We need to copy from the super user “su -”. The following commands are used to copy the files
from the extracted directory to the /usr/local/derby directory:
$ su -
passwd:
# cd /home/user
# mv db-derby-10.4.2.0-bin /usr/local/derby
# exit
You can set up the Derby environment by appending the following lines to ~/.bashrc file:
export DERBY_HOME=/usr/local/derby
export PATH=$PATH:$DERBY_HOME/bin
Apache Hive
18
export
CLASSPATH=$CLASSPATH:$DERBY_HOME/lib/derby.jar:$DERBY_HOME/lib/derbytools.jar
The following command is used to execute ~/.bashrc file:
$ source ~/.bashrc
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
91
$ mkdir $DERBY_HOME/data
Derby installation and environmental setup is now complete.
org.jpox.PersistenceManagerFactoryImpl
org.jpox.autoCreateSchema = false
org.jpox.validateTables = false
org.jpox.validateColumns = false
org.jpox.validateConstraints = false
org.jpox.storeManagerType = rdbms
org.jpox.autoCreateSchema = true
org.jpox.autoStartMechanismMode = checked
org.jpox.transactionIsolation = read_committed
javax.jdo.option.DetachAllOnCommit = true
javax.jdo.option.NontransactionalRead = true
javax.jdo.option.ConnectionDriverName = org.apache.derby.jdbc.ClientDriver
javax.jdo.option.ConnectionURL = jdbc:derby://hadoop1:1527/metastore_db;create = true
javax.jdo.option.ConnectionUserName = APP
javax.jdo.option.ConnectionPassword = mine
chmod g+w
Now set them in HDFS before verifying Hive. Use the following commands:
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
92
Hive query language provides the basic SQL like operations. Here are few of the tasks
which HQL can do easily.
FROM sales;
FROM products
GROUP BY category;
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
93
When you look at the above query, you can see they are very similar to SQL like queries.
UDF
A UDF processes one or several columns of one row and outputs one value. For example
:SELECT lower(str) from table
For each row in "table," the "lower" UDF takes one argument, the value of "str", and
outputs one value, the lowercase representation of "str".
For each row in "table," the "datediff" UDF takes two arguments, the value of
"date_begin" and "date_end", and outputs one value, the difference in time between these
two dates.
UDAF
An UDAF processes one or several columns of several input rows and outputs one value.
It is commonly used together with the GROUP operator. For example:
The Hive Query executor will group rows by customer, and for each group, call the
UDAF with all price values. The UDAF then outputs one value for the output record (one
output record per customer);
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
94
For each record of each group, the UDAF will receive the three values of the
threeselected column, and output one value of the output record.
Extends Oracle SQL to Hadoop and NoSQL and the security of Oracle Database to
all your data. It also includes a unique Smart Scan service that minimizes data
movement and maximizes performance, by parsing and intelligently filtering data
where it resides.
HIVE:
PIG
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP