0% found this document useful (0 votes)
30 views

UNIT 5 Notes by ARUN JHAPATE

The document provides instructions for installing and configuring Apache Pig on a system. It describes downloading the latest version of Pig from the Apache website, extracting and moving the files to a Pig directory. It also explains editing the .bashrc file to set environment variables and the pig.properties file to configure properties. The steps also include verifying the installation by running the pig version command.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

UNIT 5 Notes by ARUN JHAPATE

The document provides instructions for installing and configuring Apache Pig on a system. It describes downloading the latest version of Pig from the Apache website, extracting and moving the files to a Pig directory. It also explains editing the .bashrc file to set environment variables and the pig.properties file to configure properties. The steps also include verifying the installation by running the pig version command.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

UNIT-5

Installing and running PI Prerequisites


It is essential that you have Hadoop and Java installed on your system before you go for Apache
Pig. Therefore, prior to installing Apache Pig, install Hadoop and Java by following the steps
given in the following link −
http://www.tutorialspoint.com/hadoop/hadoop_enviornment_setup.htm

Download Apache Pig


First of all, download the latest version of Apache Pig from the following website
− https://pig.apache.org/

Step 1
Open the homepage of Apache Pig website. Under the section News, click on the link release
page as shown in the following snapshot.

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
Step 2
On clicking the specified link, you will be redirected to the Apache Pig Releases page. On this
page, under the Download section, you will have two links, namely, Pig 0.8 and later and Pig
0.7 and before. Click on the link Pig 0.8 and later, then you will be redirected to the page
having a set of mirrors.

Step 3
Choose and click any one of these mirrors as shown below.

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
Step 4
These mirrors will take you to the Pig Releases page. This page contains various versions of
Apache Pig. Click the latest version among them.

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
Step 5
Within these folders, you will have the source and binary files of Apache Pig in various
distributions. Download the tar files of the source and binary files of Apache Pig
0.15, pig0.15.0-src.tar.gz and pig-0.15.0.tar.gz.

Install Apache Pig


After downloading the Apache Pig software, install it in your Linux environment by following
the steps given below.

Step 1
Create a directory with the name Pig in the same directory where the installation directories
of Hadoop, Java, and other software were installed. (In our tutorial, we have created the Pig
directory in the user named Hadoop).

$ mkdir Pig

Step 2
Extract the downloaded tar files as shown below.

$ cd Downloads/
$ tar zxvf pig-0.15.0-src.tar.gz
$ tar zxvf pig-0.15.0.tar.gz

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
Step 3
Move the content of pig-0.15.0-src.tar.gz file to the Pig directory created earlier as shown
below.

$ mv pig-0.15.0-src.tar.gz/* /home/Hadoop/Pig/

Configure Apache Pig


After installing Apache Pig, we have to configure it. To configure, we need to edit two files
− bashrc and pig.properties.

.bashrc file
In the .bashrc file, set the following variables −

 PIG_HOME folder to the Apache Pig’s


installation folder,
 PATH environment variable to the bin folder,
and
 PIG_CLASSPATH environment variable to the
etc (configuration) folder of your Hadoop
installations (the directory that contains the core-
site.xml, hdfs-site.xml and mapred-site.xml
files).
export PIG_HOME = /home/Hadoop/Pig
export PATH = $PATH:/home/Hadoop/pig/bin
export PIG_CLASSPATH = $HADOOP_HOME/conf

pig.properties file
In the conf folder of Pig, we have a file named pig.properties. In the pig.properties file, you can
set various parameters as given below.
pig -h properties
The following properties are supported −
Logging: verbose = true|false; default is false. This property is the same as -v
switch brief=true|false; default is false. This property is the same
as -b switch debug=OFF|ERROR|WARN|INFO|DEBUG; default is INFO.
This property is the same as -d switch aggregate.warning = true|false; default is true.

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
If true, prints count of warnings of each type rather than logging each warning.

Performance tuning: pig.cachedbag.memusage=<mem fraction>; default is 0.2 (20% of all


memory).
Note that this memory is shared across all large bags used by the application.
pig.skewedjoin.reduce.memusagea=<mem fraction>; default is 0.3 (30% of all memory).
Specifies the fraction of heap available for the reducer to perform the join.
pig.exec.nocombiner = true|false; default is false.
Only disable combiner as a temporary workaround for problems.
opt.multiquery = true|false; multiquery is on by default.
Only disable multiquery as a temporary workaround for problems.
opt.fetch=true|false; fetch is on by default.
Scripts containing Filter, Foreach, Limit, Stream, and Union can be dumped without MR
jobs.
pig.tmpfilecompression = true|false; compression is off by default.
Determines whether output of intermediate jobs is compressed.
pig.tmpfilecompression.codec = lzo|gzip; default is gzip.
Used in conjunction with pig.tmpfilecompression. Defines compression type.
pig.noSplitCombination = true|false. Split combination is on by default.
Determines if multiple small files are combined into a single map.

pig.exec.mapPartAgg = true|false. Default is false.


Determines if partial aggregation is done within map phase, before records are sent to
combiner.
pig.exec.mapPartAgg.minReduction=<min aggregation factor>. Default is 10.
If the in-map partial aggregation does not reduce the output num records by this factor, it
gets disabled.

Miscellaneous: exectype = mapreduce|tez|local; default is mapreduce. This property is the same


as -x switch
pig.additional.jars.uris=<comma seperated list of jars>. Used in place of register command.
udf.import.list=<comma seperated list of imports>. Used to avoid package names in UDF.
stop.on.failure = true|false; default is false. Set to true to terminate on the first error.
pig.datetime.default.tz=<UTC time offset>. e.g. +08:00. Default is the default timezone of
the host.
Determines the timezone used to handle datetime datatype and UDFs.
Additionally, any Hadoop property can be specified.

Verifying the Installation


Verify the installation of Apache Pig by typing the version command. If the installation is
successful, you will get the version of Apache Pig as shown below.
$ pig –version

Apache Pig version 0.15.0 (r1682971)


compiled Jun 01 2015, 11:44:35

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
80

Apache Pig Execution Modes

You can run Apache Pig in two modes, namely, Local Mode and HDFS mode.

Local Mode
In this mode, all the files are installed and run from your local host and local file system. There is no need of Hadoop or HDFS. This mode is generally used for testing
purpose.

MapReduce Mode
MapReduce mode is where we load or process the data that exists in the Hadoop File System (HDFS) using Apache Pig. In this mode, whenever we execute the Pig
Latin statements to process the data, a MapReduce job is invoked in the back-end to perform a particular operation on the data that exists in the HDFS.

Apache Pig Execution Mechanisms

Apache Pig scripts can be executed in three ways, namely, interactive mode, batch mode, and embedded mode.

Interactive Mode (Grunt shell) − You can run Apache Pig in interactive mode
using the Grunt shell. In this shell, you can enter the Pig Latin statements and get
the output (using Dump operator).

Batch Mode (Script) − You can run Apache Pig in Batch mode by writing the Pig
Latin script in a single file with .pig extension.

Embedded Mode (UDF) − Apache Pig provides the provision of defining our own
functions (User Defined Functions) in programming languages such as Java, and
using them in our script.
Invoking the Grunt Shell

You can invoke the Grunt shell in a desired mode (local/MapReduce) using the −x option as shown below.

Local mode MapReduce mode

Command − Command −
$ ./pig –x local $ ./pig -x mapreduce

Output − Output −

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
81

Either of these commands gives you the Grunt shell prompt as shown below.
grunt>

You can exit the Grunt shell using ‘ctrl + d’.

After invoking the Grunt shell, you can execute a Pig script by directly entering the Pig Latin statements in it.

grunt> customers = LOAD 'customers.txt' USING PigStorage(',');

Executing Apache Pig in Batch Mode

You can write an entire Pig Latin script in a file and execute it using the –x command. Let us suppose we have a Pig script in a file named sample_script.pig as
shown below.

Sample_script.pig
student = LOAD 'hdfs://localhost:9000/pig_data/student.txt' USING

PigStorage(',') as (id:int,name:chararray,city:chararray);

Dump student;

Now, you can execute the script in the above file as shown below.

Local mode MapReduce mode

$ pig -x $ pig -x
local Sample_script.pig mapreduce Sample_script.pig

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
82

Installing and Running HIVE


All Hadoop sub-projects such as Hive, Pig, and HBase support Linux operating system.
Therefore, you need to install any Linux flavored OS. The following simple steps are executed
for Hive installation:

Step 1: Verifying JAVA Installation


Java must be installed on your system before installing Hive. Let us verify java installation
using the following command:

$ java –version
If Java is already installed on your system, you get to see the following response:
java version "1.7.0_71"
Java(TM) SE Runtime Environment (build 1.7.0_71-b13)
Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)
If java is not installed in your system, then follow the steps given below for installing java.

Installing Java
Step I:

Download java (JDK <latest version> - X64.tar.gz) by visiting the following


link http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html.
Then jdk-7u71-linux-x64.tar.gz will be downloaded onto your system.

Step II:

Generally you will find the downloaded java file in the Downloads folder. Verify it and extract
the jdk-7u71-linux-x64.gz file using the following commands.

$ cd Downloads/
$ ls
jdk-7u71-linux-x64.gz
$ tar zxf jdk-7u71-linux-x64.gz
$ ls
jdk1.7.0_71 jdk-7u71-linux-x64.gz

Step III:

To make java available to all the users, you have to move it to the location “/usr/local/”. Open
root, and type the following commands.
$ su
password:
# mv jdk1.7.0_71 /usr/local/
# exit

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
83

Step IV:

For setting up PATH and JAVA_HOME variables, add the following commands to ~/.bashrc
file.
export JAVA_HOME=/usr/local/jdk1.7.0_71
export PATH=$PATH:$JAVA_HOME/bin
Now apply all the changes into the current running system.

$ source ~/.bashrc

Step V:

Use the following commands to configure java alternatives:


# alternatives --install /usr/bin/java/java/usr/local/java/bin/java 2

# alternatives --install /usr/bin/javac/javac/usr/local/java/bin/javac 2

# alternatives --install /usr/bin/jar/jar/usr/local/java/bin/jar 2

# alternatives --set java/usr/local/java/bin/java

# alternatives --set javac/usr/local/java/bin/javac

# alternatives --set jar/usr/local/java/bin/jar


Now verify the installation using the command java -version from the terminal as explained
above.

Step 2: Verifying Hadoop Installation


Hadoop must be installed on your system before installing Hive. Let us verify the Hadoop
installation using the following command:

$ hadoop version
If Hadoop is already installed on your system, then you will get the following response:
Hadoop 2.4.1 Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768
Compiled by hortonmu on 2013-10-07T06:28Z
Compiled with protoc 2.5.0
From source with checksum 79e53ce7994d1628b240f09af91e1af4
If Hadoop is not installed on your system, then proceed with the following steps:

Downloading Hadoop
Download and extract Hadoop 2.4.1 from Apache Software Foundation using the following
commands.
$ su

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
84

password:
# cd /usr/local
# wget http://apache.claz.org/hadoop/common/hadoop-2.4.1/
hadoop-2.4.1.tar.gz
# tar xzf hadoop-2.4.1.tar.gz
# mv hadoop-2.4.1/* to hadoop/
# exit

Installing Hadoop in Pseudo Distributed Mode


The following steps are used to install Hadoop 2.4.1 in pseudo distributed mode.

Step I: Setting up Hadoop

You can set Hadoop environment variables by appending the following commands
to ~/.bashrc file.
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export
PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
Now apply all the changes into the current running system.

$ source ~/.bashrc

Step II: Hadoop Configuration

You can find all the Hadoop configuration files in the location
“$HADOOP_HOME/etc/hadoop”. You need to make suitable changes in those configuration
files according to your Hadoop infrastructure.

$ cd $HADOOP_HOME/etc/hadoop
In order to develop Hadoop programs using java, you have to reset the java environment
variables in hadoop-env.sh file by replacing JAVA_HOME value with the location of java in
your system.

export JAVA_HOME=/usr/local/jdk1.7.0_71
Given below are the list of files that you have to edit to configure Hadoop.
core-site.xml
The core-site.xml file contains information such as the port number used for Hadoop instance,
memory allocated for the file system, memory limit for storing the data, and the size of
Read/Write buffers.
Open the core-site.xml and add the following properties in between the <configuration> and
</configuration> tags.

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
85

<configuration>

<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>

</configuration>
hdfs-site.xml
The hdfs-site.xml file contains information such as the value of replication data, the namenode
path, and the datanode path of your local file systems. It means the place where you want to
store the Hadoop infra.
Let us assume the following data.
dfs.replication (data replication value) = 1

(In the following path /hadoop/ is the user name.


hadoopinfra/hdfs/namenode is the directory created by hdfs file system.)

namenode path = //home/hadoop/hadoopinfra/hdfs/namenode

(hadoopinfra/hdfs/datanode is the directory created by hdfs file system.)


datanode path = //home/hadoop/hadoopinfra/hdfs/datanode
Open this file and add the following properties in between the <configuration>,
</configuration> tags in this file.
<configuration>

<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/namenode </value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/datanode </value >
</property>

</configuration>
Note: In the above file, all the property values are user-defined and you can make changes
according to your Hadoop infrastructure.
yarn-site.xml
This file is used to configure yarn into Hadoop. Open the yarn-site.xml file and add the
following properties in between the <configuration>, </configuration> tags in this file.
<configuration>

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
86

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

</configuration>
mapred-site.xml
This file is used to specify which MapReduce framework we are using. By default, Hadoop
contains a template of yarn-site.xml. First of all, you need to copy the file from mapred-
site,xml.template to mapred-site.xml file using the following command.

$ cp mapred-site.xml.template mapred-site.xml
Open mapred-site.xml file and add the following properties in between the <configuration>,
</configuration> tags in this file.

<configuration>

<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

</configuration>

Verifying Hadoop Installation


The following steps are used to verify the Hadoop installation.

Step I: Name Node Setup

Set up the namenode using the command “hdfs namenode -format” as follows.
$ cd ~
$ hdfs namenode -format
The expected result is as follows.
10/24/14 21:30:55 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = localhost/192.168.1.11
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 2.4.1
...
...
10/24/14 21:30:56 INFO common.Storage: Storage directory
/home/hadoop/hadoopinfra/hdfs/namenode has been successfully formatted.
10/24/14 21:30:56 INFO namenode.NNStorageRetentionManager: Going to
retain 1 images with txid >= 0
10/24/14 21:30:56 INFO util.ExitUtil: Exiting with status 0

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
87

10/24/14 21:30:56 INFO namenode.NameNode: SHUTDOWN_MSG:


/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at localhost/192.168.1.11
************************************************************/

Step II: Verifying Hadoop dfs

The following command is used to start dfs. Executing this command will start your Hadoop
file system.

$ start-dfs.sh
The expected output is as follows:
10/24/14 21:37:56
Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/hadoop/hadoop-2.4.1/logs/hadoop-hadoop-
namenode-localhost.out
localhost: starting datanode, logging to /home/hadoop/hadoop-2.4.1/logs/hadoop-hadoop-datanode-
localhost.out
Starting secondary namenodes [0.0.0.0]

Step III: Verifying Yarn Script

The following command is used to start the yarn script. Executing this command will start your
yarn daemons.

$ start-yarn.sh
The expected output is as follows:
starting yarn daemons
starting resourcemanager, logging to /home/hadoop/hadoop-2.4.1/logs/yarn-hadoop-
resourcemanager-localhost.out
localhost: starting nodemanager, logging to /home/hadoop/hadoop-2.4.1/logs/yarn-hadoop-
nodemanager-localhost.out

Step IV: Accessing Hadoop on Browser

The default port number to access Hadoop is 50070. Use the following url to get Hadoop
services on your browser.
http://localhost:50070/

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
88

Step V: Verify all applications for cluster

The default port number to access all applications of cluster is 8088. Use the following url to
visit this service.

http://localhost:8088/

Step 3: Downloading Hive

We use hive-0.14.0 in this tutorial. You can download it by visiting the following
link http://apache.petsads.us/hive/hive-0.14.0/. Let us assume it gets downloaded onto the
/Downloads directory. Here, we download Hive archive named “apache-hive-0.14.0-bin.tar.gz”
for this tutorial. The following command is used to verify the download:
$ cd Downloads
$ ls
On successful download, you get to see the following response:

apache-hive-0.14.0-bin.tar.gz

Step 4: Installing Hive


The following steps are required for installing Hive on your system. Let us assume the Hive
archive is downloaded onto the /Downloads directory.

Extracting and verifying Hive Archive

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
89

The following command is used to verify the download and extract the hive archive:
$ tar zxvf apache-hive-0.14.0-bin.tar.gz
$ ls
On successful download, you get to see the following response:

apache-hive-0.14.0-bin apache-hive-0.14.0-bin.tar.gz

Copying files to /usr/local/hive directory

We need to copy the files from the super user “su -”. The following commands are used to copy
the files from the extracted directory to the /usr/local/hive” directory.
$ su -
passwd:

# cd /home/user/Download
# mv apache-hive-0.14.0-bin /usr/local/hive
# exit

Setting up environment for Hive

You can set up the Hive environment by appending the following lines to ~/.bashrc file:
export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin
export CLASSPATH=$CLASSPATH:/usr/local/Hadoop/lib/*:.
export CLASSPATH=$CLASSPATH:/usr/local/hive/lib/*:.
The following command is used to execute ~/.bashrc file.

$ source ~/.bashrc

Step 5: Configuring Hive


To configure Hive with Hadoop, you need to edit the hive-env.sh file, which is placed in
the $HIVE_HOME/conf directory. The following commands redirect to Hive config folder
and copy the template file:
$ cd $HIVE_HOME/conf
$ cp hive-env.sh.template hive-env.sh
Edit the hive-env.sh file by appending the following line:

export HADOOP_HOME=/usr/local/hadoop
Hive installation is completed successfully. Now you require an external database server to
configure Metastore. We use Apache Derby database.

Step 6: Downloading and Installing Apache Derby


Follow the steps given below to download and install Apache Derby:

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
90

Downloading Apache Derby

The following command is used to download Apache Derby. It takes some time to download.
$ cd ~
$ wget http://archive.apache.org/dist/db/derby/db-derby-10.4.2.0/db-derby-10.4.2.0-bin.tar.gz
The following command is used to verify the download:

$ ls
On successful download, you get to see the following response:

db-derby-10.4.2.0-bin.tar.gz

Extracting and verifying Derby archive

The following commands are used for extracting and verifying the Derby archive:
$ tar zxvf db-derby-10.4.2.0-bin.tar.gz
$ ls
On successful download, you get to see the following response:

db-derby-10.4.2.0-bin db-derby-10.4.2.0-bin.tar.gz

Copying files to /usr/local/derby directory

We need to copy from the super user “su -”. The following commands are used to copy the files
from the extracted directory to the /usr/local/derby directory:
$ su -
passwd:
# cd /home/user
# mv db-derby-10.4.2.0-bin /usr/local/derby
# exit

Setting up environment for Derby

You can set up the Derby environment by appending the following lines to ~/.bashrc file:
export DERBY_HOME=/usr/local/derby
export PATH=$PATH:$DERBY_HOME/bin
Apache Hive
18
export
CLASSPATH=$CLASSPATH:$DERBY_HOME/lib/derby.jar:$DERBY_HOME/lib/derbytools.jar
The following command is used to execute ~/.bashrc file:

$ source ~/.bashrc

Create a directory to store Metastore

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
91

Create a directory named data in $DERBY_HOME directory to store Metastore data.

$ mkdir $DERBY_HOME/data
Derby installation and environmental setup is now complete.

Step 7: Configuring Metastore of Hive


Configuring Metastore means specifying to Hive where the database is stored. You can do this
by editing the hive-site.xml file, which is in the $HIVE_HOME/conf directory. First of all, copy
the template file using the following command:
$ cd $HIVE_HOME/conf
$ cp hive-default.xml.template hive-site.xml
Edit hive-site.xml and append the following lines between the <configuration> and
</configuration> tags:
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby://localhost:1527/metastore_db;create=true </value>
<description>JDBC connect string for a JDBC metastore </description>
</property>
Create a file named jpox.properties and add the following lines into it:
javax.jdo.PersistenceManagerFactoryClass =

org.jpox.PersistenceManagerFactoryImpl
org.jpox.autoCreateSchema = false
org.jpox.validateTables = false
org.jpox.validateColumns = false
org.jpox.validateConstraints = false
org.jpox.storeManagerType = rdbms
org.jpox.autoCreateSchema = true
org.jpox.autoStartMechanismMode = checked
org.jpox.transactionIsolation = read_committed
javax.jdo.option.DetachAllOnCommit = true
javax.jdo.option.NontransactionalRead = true
javax.jdo.option.ConnectionDriverName = org.apache.derby.jdbc.ClientDriver
javax.jdo.option.ConnectionURL = jdbc:derby://hadoop1:1527/metastore_db;create = true
javax.jdo.option.ConnectionUserName = APP
javax.jdo.option.ConnectionPassword = mine

Step 8: Verifying Hive Installation


Before running Hive, you need to create the /tmp folder and a separate Hive folder in HDFS.
Here, we use the /user/hive/warehouse folder. You need to set write permission for these
newly created folders as shown below:

chmod g+w
Now set them in HDFS before verifying Hive. Use the following commands:

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
92

$ $HADOOP_HOME/bin/hadoop fs -mkdir /tmp


$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse
The following commands are used to verify Hive installation:
$ cd $HIVE_HOME
$ bin/hive
On successful installation of Hive, you get to see the following response:
Logging initialized using configuration in jar:file:/home/hadoop/hive-0.9.0/lib/hive-common-
0.9.0.jar!/hive-log4j.properties
Hive history file=/tmp/hadoop/hive_job_log_hadoop_201312121621_1494929084.txt
………………….
hive>
The following sample command is executed to display all the tables:
hive> show tables;
OK
Time taken: 2.798 seconds
hive>

 What is HiveQL (HQL)?

Hive query language provides the basic SQL like operations. Here are few of the tasks
which HQL can do easily.

 Create and manage tables and partitions


 Support various Relational, Arithmetic and Logical Operators
 Evaluate functions
 Download the contents of a table to a local directory or result of queries to HDFS
directory

Here is the example of the HQL Query:

SELECT upper(name), salesprice

FROM sales;

SELECT category, count(1)

FROM products

GROUP BY category;

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
93

When you look at the above query, you can see they are very similar to SQL like queries.

HIVE User Defined Functions

 Hive UDF versus UDAF

In Hive, you can define two main kinds of custom functions:

UDF
A UDF processes one or several columns of one row and outputs one value. For example
:SELECT lower(str) from table
For each row in "table," the "lower" UDF takes one argument, the value of "str", and
outputs one value, the lowercase representation of "str".

o SELECT datediff(date_begin, date_end) from table

For each row in "table," the "datediff" UDF takes two arguments, the value of
"date_begin" and "date_end", and outputs one value, the difference in time between these
two dates.

Each argument of a UDF can be:


 A column of the table
 A constant value
 The result of another UDF
 The result of an arithmetic computation
 TODO : Example

UDAF
An UDAF processes one or several columns of several input rows and outputs one value.
It is commonly used together with the GROUP operator. For example:

o SELECT sum(price) from table GROUP by customer;

The Hive Query executor will group rows by customer, and for each group, call the
UDAF with all price values. The UDAF then outputs one value for the output record (one
output record per customer);

o SELECT total_customer_value(quantity, unit_price, day) from


table group by customer;

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
94

For each record of each group, the UDAF will receive the three values of the
threeselected column, and output one value of the output record.

 Oracle Big Data SQL

Extends Oracle SQL to Hadoop and NoSQL and the security of Oracle Database to
all your data. It also includes a unique Smart Scan service that minimizes data
movement and maximizes performance, by parsing and intelligently filtering data
where it resides.

What is main differences between hive vs pig vs sql?

HIVE:

1. Hive is a Dataware house system for Hadoop that facilitates easydata


summarisation ,adhoc queries,and analysis of large datasets stored in
Hadoopcompatible Filesystems.
2. Hive provides a mechanism to query the data using Sql like lamguage
called asHIVE QL or HQL.
3. Hive enables developers not familiar with MapReduce to write data
queries thatare translated into MapReduce jobs in Hadoop.

Hive limitations (Compared to SQL Languages)

1. No support for Update or Delete.


2. No support for inserting single rows.
3. Limited number of Built in functions
4. Not all Standard SQL is supported.

PIG

1. An abstraction over the complexity of MapReduce programming, the Pig


platform includes an execution environment and a scripting language (Pig
Latin)used to analyzeHadoop data sets.
2. Its compiler translates Pig Latin into sequences of MapReduce programs

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP

You might also like