0% found this document useful (0 votes)

30 views

UNIT 5 Notes by ARUN JHAPATE

The document provides instructions for installing and configuring Apache Pig on a system. It describes downloading the latest version of Pig from the Apache website, extracting and moving the files to a Pig directory. It also explains editing the .bashrc file to set environment variables and the pig.properties file to configure properties. The steps also include verifying the installation by running the pig version command.

Uploaded by

Ankit “अंकित मौर्य” Mourya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views

UNIT 5 Notes by ARUN JHAPATE

Uploaded by

Ankit “अंकित मौर्य” Mourya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

UNIT-5

Installing and running PI Prerequisites

It is essential that you have Hadoop and Java installed on your system before you go for Apache
Pig. Therefore, prior to installing Apache Pig, install Hadoop and Java by following the steps
given in the following link −
http://www.tutorialspoint.com/hadoop/hadoop_enviornment_setup.htm

Download Apache Pig

First of all, download the latest version of Apache Pig from the following website
− https://pig.apache.org/

Step 1
Open the homepage of Apache Pig website. Under the section News, click on the link release
page as shown in the following snapshot.

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
Step 2
On clicking the specified link, you will be redirected to the Apache Pig Releases page. On this
page, under the Download section, you will have two links, namely, Pig 0.8 and later and Pig
0.7 and before. Click on the link Pig 0.8 and later, then you will be redirected to the page
having a set of mirrors.

Step 3
Choose and click any one of these mirrors as shown below.

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
Step 4
These mirrors will take you to the Pig Releases page. This page contains various versions of
Apache Pig. Click the latest version among them.

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
Step 5
Within these folders, you will have the source and binary files of Apache Pig in various
distributions. Download the tar files of the source and binary files of Apache Pig
0.15, pig0.15.0-src.tar.gz and pig-0.15.0.tar.gz.

Install Apache Pig

After downloading the Apache Pig software, install it in your Linux environment by following
the steps given below.

Step 1
Create a directory with the name Pig in the same directory where the installation directories
of Hadoop, Java, and other software were installed. (In our tutorial, we have created the Pig
directory in the user named Hadoop).

$ mkdir Pig

Step 2
Extract the downloaded tar files as shown below.

$ cd Downloads/
$ tar zxvf pig-0.15.0-src.tar.gz
$ tar zxvf pig-0.15.0.tar.gz

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
Step 3
Move the content of pig-0.15.0-src.tar.gz file to the Pig directory created earlier as shown
below.

$ mv pig-0.15.0-src.tar.gz/* /home/Hadoop/Pig/

Configure Apache Pig

After installing Apache Pig, we have to configure it. To configure, we need to edit two files
− bashrc and pig.properties.

.bashrc file
In the .bashrc file, set the following variables −

 PIG_HOME folder to the Apache Pig’s

installation folder,
 PATH environment variable to the bin folder,
and
 PIG_CLASSPATH environment variable to the
etc (configuration) folder of your Hadoop
installations (the directory that contains the core-
site.xml, hdfs-site.xml and mapred-site.xml
files).
export PIG_HOME = /home/Hadoop/Pig
export PATH = $PATH:/home/Hadoop/pig/bin
export PIG_CLASSPATH = $HADOOP_HOME/conf

pig.properties file
In the conf folder of Pig, we have a file named pig.properties. In the pig.properties file, you can
set various parameters as given below.
pig -h properties
The following properties are supported −
Logging: verbose = true|false; default is false. This property is the same as -v
switch brief=true|false; default is false. This property is the same
as -b switch debug=OFF|ERROR|WARN|INFO|DEBUG; default is INFO.
This property is the same as -d switch aggregate.warning = true|false; default is true.

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
If true, prints count of warnings of each type rather than logging each warning.

Performance tuning: pig.cachedbag.memusage=<mem fraction>; default is 0.2 (20% of all

memory).
Note that this memory is shared across all large bags used by the application.
pig.skewedjoin.reduce.memusagea=<mem fraction>; default is 0.3 (30% of all memory).
Specifies the fraction of heap available for the reducer to perform the join.
pig.exec.nocombiner = true|false; default is false.
Only disable combiner as a temporary workaround for problems.
opt.multiquery = true|false; multiquery is on by default.
Only disable multiquery as a temporary workaround for problems.
opt.fetch=true|false; fetch is on by default.
Scripts containing Filter, Foreach, Limit, Stream, and Union can be dumped without MR
jobs.
pig.tmpfilecompression = true|false; compression is off by default.
Determines whether output of intermediate jobs is compressed.
pig.tmpfilecompression.codec = lzo|gzip; default is gzip.
Used in conjunction with pig.tmpfilecompression. Defines compression type.
pig.noSplitCombination = true|false. Split combination is on by default.
Determines if multiple small files are combined into a single map.

pig.exec.mapPartAgg = true|false. Default is false.

Determines if partial aggregation is done within map phase, before records are sent to
combiner.
pig.exec.mapPartAgg.minReduction=<min aggregation factor>. Default is 10.
If the in-map partial aggregation does not reduce the output num records by this factor, it
gets disabled.

Miscellaneous: exectype = mapreduce|tez|local; default is mapreduce. This property is the same

as -x switch
pig.additional.jars.uris=<comma seperated list of jars>. Used in place of register command.
udf.import.list=<comma seperated list of imports>. Used to avoid package names in UDF.
stop.on.failure = true|false; default is false. Set to true to terminate on the first error.
pig.datetime.default.tz=<UTC time offset>. e.g. +08:00. Default is the default timezone of
the host.
Determines the timezone used to handle datetime datatype and UDFs.
Additionally, any Hadoop property can be specified.

Verifying the Installation

Verify the installation of Apache Pig by typing the version command. If the installation is
successful, you will get the version of Apache Pig as shown below.
$ pig –version

Apache Pig version 0.15.0 (r1682971)

compiled Jun 01 2015, 11:44:35

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
80

Apache Pig Execution Modes

You can run Apache Pig in two modes, namely, Local Mode and HDFS mode.

Local Mode
In this mode, all the files are installed and run from your local host and local file system. There is no need of Hadoop or HDFS. This mode is generally used for testing
purpose.

MapReduce Mode
MapReduce mode is where we load or process the data that exists in the Hadoop File System (HDFS) using Apache Pig. In this mode, whenever we execute the Pig
Latin statements to process the data, a MapReduce job is invoked in the back-end to perform a particular operation on the data that exists in the HDFS.

Apache Pig Execution Mechanisms

Apache Pig scripts can be executed in three ways, namely, interactive mode, batch mode, and embedded mode.

Interactive Mode (Grunt shell) − You can run Apache Pig in interactive mode
using the Grunt shell. In this shell, you can enter the Pig Latin statements and get
the output (using Dump operator).

Batch Mode (Script) − You can run Apache Pig in Batch mode by writing the Pig
Latin script in a single file with .pig extension.

Embedded Mode (UDF) − Apache Pig provides the provision of defining our own
functions (User Defined Functions) in programming languages such as Java, and
using them in our script.
Invoking the Grunt Shell

You can invoke the Grunt shell in a desired mode (local/MapReduce) using the −x option as shown below.

Local mode MapReduce mode

Command − Command −
$ ./pig –x local $ ./pig -x mapreduce

Output − Output −

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
81

Either of these commands gives you the Grunt shell prompt as shown below.
grunt>

You can exit the Grunt shell using ‘ctrl + d’.

After invoking the Grunt shell, you can execute a Pig script by directly entering the Pig Latin statements in it.

grunt> customers = LOAD 'customers.txt' USING PigStorage(',');

Executing Apache Pig in Batch Mode

You can write an entire Pig Latin script in a file and execute it using the –x command. Let us suppose we have a Pig script in a file named sample_script.pig as
shown below.

Sample_script.pig
student = LOAD 'hdfs://localhost:9000/pig_data/student.txt' USING

PigStorage(',') as (id:int,name:chararray,city:chararray);

Dump student;

Now, you can execute the script in the above file as shown below.

Local mode MapReduce mode

$ pig -x $ pig -x
local Sample_script.pig mapreduce Sample_script.pig

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
82

Installing and Running HIVE

All Hadoop sub-projects such as Hive, Pig, and HBase support Linux operating system.
Therefore, you need to install any Linux flavored OS. The following simple steps are executed
for Hive installation:

Step 1: Verifying JAVA Installation

Java must be installed on your system before installing Hive. Let us verify java installation
using the following command:

$ java –version
If Java is already installed on your system, you get to see the following response:
java version "1.7.0_71"
Java(TM) SE Runtime Environment (build 1.7.0_71-b13)
Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)
If java is not installed in your system, then follow the steps given below for installing java.

Installing Java
Step I:

Download java (JDK <latest version> - X64.tar.gz) by visiting the following

link http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html.
Then jdk-7u71-linux-x64.tar.gz will be downloaded onto your system.

Step II:

Generally you will find the downloaded java file in the Downloads folder. Verify it and extract
the jdk-7u71-linux-x64.gz file using the following commands.

$ cd Downloads/
$ ls
jdk-7u71-linux-x64.gz
$ tar zxf jdk-7u71-linux-x64.gz
$ ls
jdk1.7.0_71 jdk-7u71-linux-x64.gz

Step III:

To make java available to all the users, you have to move it to the location “/usr/local/”. Open
root, and type the following commands.
$ su
password:
# mv jdk1.7.0_71 /usr/local/
# exit

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
83

Step IV:

For setting up PATH and JAVA_HOME variables, add the following commands to ~/.bashrc
file.
export JAVA_HOME=/usr/local/jdk1.7.0_71
export PATH=$PATH:$JAVA_HOME/bin
Now apply all the changes into the current running system.

$ source ~/.bashrc

Step V:

Use the following commands to configure java alternatives:

# alternatives --install /usr/bin/java/java/usr/local/java/bin/java 2

# alternatives --install /usr/bin/javac/javac/usr/local/java/bin/javac 2

# alternatives --install /usr/bin/jar/jar/usr/local/java/bin/jar 2

# alternatives --set java/usr/local/java/bin/java

# alternatives --set javac/usr/local/java/bin/javac

# alternatives --set jar/usr/local/java/bin/jar

Now verify the installation using the command java -version from the terminal as explained
above.

Step 2: Verifying Hadoop Installation

Hadoop must be installed on your system before installing Hive. Let us verify the Hadoop
installation using the following command:

$ hadoop version
If Hadoop is already installed on your system, then you will get the following response:
Hadoop 2.4.1 Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768
Compiled by hortonmu on 2013-10-07T06:28Z
Compiled with protoc 2.5.0
From source with checksum 79e53ce7994d1628b240f09af91e1af4
If Hadoop is not installed on your system, then proceed with the following steps:

Downloading Hadoop
Download and extract Hadoop 2.4.1 from Apache Software Foundation using the following
commands.
$ su

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
84

password:
# cd /usr/local
# wget http://apache.claz.org/hadoop/common/hadoop-2.4.1/
hadoop-2.4.1.tar.gz
# tar xzf hadoop-2.4.1.tar.gz
# mv hadoop-2.4.1/* to hadoop/
# exit

Installing Hadoop in Pseudo Distributed Mode

The following steps are used to install Hadoop 2.4.1 in pseudo distributed mode.

Step I: Setting up Hadoop

You can set Hadoop environment variables by appending the following commands
to ~/.bashrc file.
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export
PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
Now apply all the changes into the current running system.

$ source ~/.bashrc

Step II: Hadoop Configuration

You can find all the Hadoop configuration files in the location
“$HADOOP_HOME/etc/hadoop”. You need to make suitable changes in those configuration
files according to your Hadoop infrastructure.

$ cd $HADOOP_HOME/etc/hadoop
In order to develop Hadoop programs using java, you have to reset the java environment
variables in hadoop-env.sh file by replacing JAVA_HOME value with the location of java in
your system.

export JAVA_HOME=/usr/local/jdk1.7.0_71
Given below are the list of files that you have to edit to configure Hadoop.
core-site.xml
The core-site.xml file contains information such as the port number used for Hadoop instance,
memory allocated for the file system, memory limit for storing the data, and the size of
Read/Write buffers.
Open the core-site.xml and add the following properties in between the <configuration> and
</configuration> tags.

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
85

<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>

</configuration>
hdfs-site.xml
The hdfs-site.xml file contains information such as the value of replication data, the namenode
path, and the datanode path of your local file systems. It means the place where you want to
store the Hadoop infra.
Let us assume the following data.
dfs.replication (data replication value) = 1

(In the following path /hadoop/ is the user name.

hadoopinfra/hdfs/namenode is the directory created by hdfs file system.)

namenode path = //home/hadoop/hadoopinfra/hdfs/namenode

(hadoopinfra/hdfs/datanode is the directory created by hdfs file system.)

datanode path = //home/hadoop/hadoopinfra/hdfs/datanode
Open this file and add the following properties in between the <configuration>,
</configuration> tags in this file.
<configuration>

<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/namenode </value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/datanode </value >
</property>

</configuration>
Note: In the above file, all the property values are user-defined and you can make changes
according to your Hadoop infrastructure.
yarn-site.xml
This file is used to configure yarn into Hadoop. Open the yarn-site.xml file and add the
following properties in between the <configuration>, </configuration> tags in this file.
<configuration>

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
86

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

</configuration>
mapred-site.xml
This file is used to specify which MapReduce framework we are using. By default, Hadoop
contains a template of yarn-site.xml. First of all, you need to copy the file from mapred-
site,xml.template to mapred-site.xml file using the following command.

$ cp mapred-site.xml.template mapred-site.xml
Open mapred-site.xml file and add the following properties in between the <configuration>,
</configuration> tags in this file.

<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

</configuration>

Verifying Hadoop Installation

The following steps are used to verify the Hadoop installation.

Step I: Name Node Setup

Set up the namenode using the command “hdfs namenode -format” as follows.
$ cd ~
$ hdfs namenode -format
The expected result is as follows.
10/24/14 21:30:55 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = localhost/192.168.1.11
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 2.4.1
...
...
10/24/14 21:30:56 INFO common.Storage: Storage directory
/home/hadoop/hadoopinfra/hdfs/namenode has been successfully formatted.
10/24/14 21:30:56 INFO namenode.NNStorageRetentionManager: Going to
retain 1 images with txid >= 0
10/24/14 21:30:56 INFO util.ExitUtil: Exiting with status 0

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
87

10/24/14 21:30:56 INFO namenode.NameNode: SHUTDOWN_MSG:

/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at localhost/192.168.1.11
************************************************************/

Step II: Verifying Hadoop dfs

The following command is used to start dfs. Executing this command will start your Hadoop
file system.

$ start-dfs.sh
The expected output is as follows:
10/24/14 21:37:56
Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/hadoop/hadoop-2.4.1/logs/hadoop-hadoop-
namenode-localhost.out
localhost: starting datanode, logging to /home/hadoop/hadoop-2.4.1/logs/hadoop-hadoop-datanode-
localhost.out
Starting secondary namenodes [0.0.0.0]

Step III: Verifying Yarn Script

The following command is used to start the yarn script. Executing this command will start your
yarn daemons.

$ start-yarn.sh
The expected output is as follows:
starting yarn daemons
starting resourcemanager, logging to /home/hadoop/hadoop-2.4.1/logs/yarn-hadoop-
resourcemanager-localhost.out
localhost: starting nodemanager, logging to /home/hadoop/hadoop-2.4.1/logs/yarn-hadoop-
nodemanager-localhost.out

Step IV: Accessing Hadoop on Browser

The default port number to access Hadoop is 50070. Use the following url to get Hadoop
services on your browser.
http://localhost:50070/

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
88

Step V: Verify all applications for cluster

The default port number to access all applications of cluster is 8088. Use the following url to
visit this service.

http://localhost:8088/

Step 3: Downloading Hive

We use hive-0.14.0 in this tutorial. You can download it by visiting the following
link http://apache.petsads.us/hive/hive-0.14.0/. Let us assume it gets downloaded onto the
/Downloads directory. Here, we download Hive archive named “apache-hive-0.14.0-bin.tar.gz”
for this tutorial. The following command is used to verify the download:
$ cd Downloads
$ ls
On successful download, you get to see the following response:

apache-hive-0.14.0-bin.tar.gz

Step 4: Installing Hive

The following steps are required for installing Hive on your system. Let us assume the Hive
archive is downloaded onto the /Downloads directory.

Extracting and verifying Hive Archive

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
89

The following command is used to verify the download and extract the hive archive:
$ tar zxvf apache-hive-0.14.0-bin.tar.gz
$ ls
On successful download, you get to see the following response:

apache-hive-0.14.0-bin apache-hive-0.14.0-bin.tar.gz

Copying files to /usr/local/hive directory

We need to copy the files from the super user “su -”. The following commands are used to copy
the files from the extracted directory to the /usr/local/hive” directory.
$ su -
passwd:

# cd /home/user/Download
# mv apache-hive-0.14.0-bin /usr/local/hive
# exit

Setting up environment for Hive

You can set up the Hive environment by appending the following lines to ~/.bashrc file:
export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin
export CLASSPATH=$CLASSPATH:/usr/local/Hadoop/lib/*:.
export CLASSPATH=$CLASSPATH:/usr/local/hive/lib/*:.
The following command is used to execute ~/.bashrc file.

$ source ~/.bashrc

Step 5: Configuring Hive

To configure Hive with Hadoop, you need to edit the hive-env.sh file, which is placed in
the $HIVE_HOME/conf directory. The following commands redirect to Hive config folder
and copy the template file:
$ cd $HIVE_HOME/conf
$ cp hive-env.sh.template hive-env.sh
Edit the hive-env.sh file by appending the following line:

export HADOOP_HOME=/usr/local/hadoop
Hive installation is completed successfully. Now you require an external database server to
configure Metastore. We use Apache Derby database.

Step 6: Downloading and Installing Apache Derby

Follow the steps given below to download and install Apache Derby:

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
90

Downloading Apache Derby

The following command is used to download Apache Derby. It takes some time to download.
$ cd ~
$ wget http://archive.apache.org/dist/db/derby/db-derby-10.4.2.0/db-derby-10.4.2.0-bin.tar.gz
The following command is used to verify the download:

$ ls
On successful download, you get to see the following response:

db-derby-10.4.2.0-bin.tar.gz

Extracting and verifying Derby archive

The following commands are used for extracting and verifying the Derby archive:
$ tar zxvf db-derby-10.4.2.0-bin.tar.gz
$ ls
On successful download, you get to see the following response:

db-derby-10.4.2.0-bin db-derby-10.4.2.0-bin.tar.gz

Copying files to /usr/local/derby directory

We need to copy from the super user “su -”. The following commands are used to copy the files
from the extracted directory to the /usr/local/derby directory:
$ su -
passwd:
# cd /home/user
# mv db-derby-10.4.2.0-bin /usr/local/derby
# exit

Setting up environment for Derby

You can set up the Derby environment by appending the following lines to ~/.bashrc file:
export DERBY_HOME=/usr/local/derby
export PATH=$PATH:$DERBY_HOME/bin
Apache Hive
18
export
CLASSPATH=$CLASSPATH:$DERBY_HOME/lib/derby.jar:$DERBY_HOME/lib/derbytools.jar
The following command is used to execute ~/.bashrc file:

$ source ~/.bashrc

Create a directory to store Metastore

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
91

Create a directory named data in $DERBY_HOME directory to store Metastore data.

$ mkdir $DERBY_HOME/data
Derby installation and environmental setup is now complete.

Step 7: Configuring Metastore of Hive

Configuring Metastore means specifying to Hive where the database is stored. You can do this
by editing the hive-site.xml file, which is in the $HIVE_HOME/conf directory. First of all, copy
the template file using the following command:
$ cd $HIVE_HOME/conf
$ cp hive-default.xml.template hive-site.xml
Edit hive-site.xml and append the following lines between the <configuration> and
</configuration> tags:
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby://localhost:1527/metastore_db;create=true </value>
<description>JDBC connect string for a JDBC metastore </description>
</property>
Create a file named jpox.properties and add the following lines into it:
javax.jdo.PersistenceManagerFactoryClass =

org.jpox.PersistenceManagerFactoryImpl
org.jpox.autoCreateSchema = false
org.jpox.validateTables = false
org.jpox.validateColumns = false
org.jpox.validateConstraints = false
org.jpox.storeManagerType = rdbms
org.jpox.autoCreateSchema = true
org.jpox.autoStartMechanismMode = checked
org.jpox.transactionIsolation = read_committed
javax.jdo.option.DetachAllOnCommit = true
javax.jdo.option.NontransactionalRead = true
javax.jdo.option.ConnectionDriverName = org.apache.derby.jdbc.ClientDriver
javax.jdo.option.ConnectionURL = jdbc:derby://hadoop1:1527/metastore_db;create = true
javax.jdo.option.ConnectionUserName = APP
javax.jdo.option.ConnectionPassword = mine

Step 8: Verifying Hive Installation

Before running Hive, you need to create the /tmp folder and a separate Hive folder in HDFS.
Here, we use the /user/hive/warehouse folder. You need to set write permission for these
newly created folders as shown below:

chmod g+w
Now set them in HDFS before verifying Hive. Use the following commands:

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
92

$ $HADOOP_HOME/bin/hadoop fs -mkdir /tmp

$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse
The following commands are used to verify Hive installation:
$ cd $HIVE_HOME
$ bin/hive
On successful installation of Hive, you get to see the following response:
Logging initialized using configuration in jar:file:/home/hadoop/hive-0.9.0/lib/hive-common-
0.9.0.jar!/hive-log4j.properties
Hive history file=/tmp/hadoop/hive_job_log_hadoop_201312121621_1494929084.txt
………………….
hive>
The following sample command is executed to display all the tables:
hive> show tables;
OK
Time taken: 2.798 seconds
hive>

 What is HiveQL (HQL)?

Hive query language provides the basic SQL like operations. Here are few of the tasks
which HQL can do easily.

 Create and manage tables and partitions

 Support various Relational, Arithmetic and Logical Operators
 Evaluate functions
 Download the contents of a table to a local directory or result of queries to HDFS
directory

Here is the example of the HQL Query:

SELECT upper(name), salesprice

FROM sales;

SELECT category, count(1)

FROM products

GROUP BY category;

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
93

When you look at the above query, you can see they are very similar to SQL like queries.

HIVE User Defined Functions

 Hive UDF versus UDAF

In Hive, you can define two main kinds of custom functions:

UDF
A UDF processes one or several columns of one row and outputs one value. For example
:SELECT lower(str) from table
For each row in "table," the "lower" UDF takes one argument, the value of "str", and
outputs one value, the lowercase representation of "str".

o SELECT datediff(date_begin, date_end) from table

For each row in "table," the "datediff" UDF takes two arguments, the value of
"date_begin" and "date_end", and outputs one value, the difference in time between these
two dates.

Each argument of a UDF can be:

 A column of the table
 A constant value
 The result of another UDF
 The result of an arithmetic computation
 TODO : Example

UDAF
An UDAF processes one or several columns of several input rows and outputs one value.
It is commonly used together with the GROUP operator. For example:

o SELECT sum(price) from table GROUP by customer;

The Hive Query executor will group rows by customer, and for each group, call the
UDAF with all price values. The UDAF then outputs one value for the output record (one
output record per customer);

o SELECT total_customer_value(quantity, unit_price, day) from

table group by customer;

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
94

For each record of each group, the UDAF will receive the three values of the
threeselected column, and output one value of the output record.

 Oracle Big Data SQL

Extends Oracle SQL to Hadoop and NoSQL and the security of Oracle Database to
all your data. It also includes a unique Smart Scan service that minimizes data
movement and maximizes performance, by parsing and intelligently filtering data
where it resides.

What is main differences between hive vs pig vs sql?

HIVE:

1. Hive is a Dataware house system for Hadoop that facilitates easydata

summarisation ,adhoc queries,and analysis of large datasets stored in
Hadoopcompatible Filesystems.
2. Hive provides a mechanism to query the data using Sql like lamguage
called asHIVE QL or HQL.
3. Hive enables developers not familiar with MapReduce to write data
queries thatare translated into MapReduce jobs in Hadoop.

Hive limitations (Compared to SQL Languages)

1. No support for Update or Delete.

2. No support for inserting single rows.
3. Limited number of Built in functions
4. Not all Standard SQL is supported.

PIG

1. An abstraction over the complexity of MapReduce programming, the Pig

platform includes an execution environment and a scripting language (Pig
Latin)used to analyzeHadoop data sets.
2. Its compiler translates Pig Latin into sequences of MapReduce programs

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP

1Q84 by Haruki Murakami
0% (36)
1Q84 by Haruki Murakami
944 pages
Hadoop - PIG User Material
No ratings yet
Hadoop - PIG User Material
292 pages
Pig Setup and Test Run: by Kannan Kalidasan
No ratings yet
Pig Setup and Test Run: by Kannan Kalidasan
17 pages
3 Pig
No ratings yet
3 Pig
77 pages
Pig
No ratings yet
Pig
16 pages
BDP U4
No ratings yet
BDP U4
58 pages
BDA-Unit 5-notes
No ratings yet
BDA-Unit 5-notes
36 pages
Unit 4
No ratings yet
Unit 4
5 pages
Hadoop Pig Presentation
No ratings yet
Hadoop Pig Presentation
33 pages
PIG A Big Data Processor
No ratings yet
PIG A Big Data Processor
49 pages
4.1_PIG_UNIT4
No ratings yet
4.1_PIG_UNIT4
55 pages
Unit III
No ratings yet
Unit III
118 pages
Big Data Unit IV
No ratings yet
Big Data Unit IV
19 pages
Pig_Notes-1
No ratings yet
Pig_Notes-1
6 pages
Cse 17CS82 M2 S1 PPT
No ratings yet
Cse 17CS82 M2 S1 PPT
35 pages
unit-4_SGS
No ratings yet
unit-4_SGS
13 pages
Unit-V Pig Programming
No ratings yet
Unit-V Pig Programming
123 pages
Apache Pig: For Live Hadoop Training, Please See Courses
No ratings yet
Apache Pig: For Live Hadoop Training, Please See Courses
25 pages
pig skb
No ratings yet
pig skb
7 pages
PIG
No ratings yet
PIG
9 pages
BigDataTraining ApachePig Intro Install
No ratings yet
BigDataTraining ApachePig Intro Install
5 pages
Unit5 Bigdatanotes
No ratings yet
Unit5 Bigdatanotes
52 pages
Unit 4
No ratings yet
Unit 4
29 pages
UNIT 5 Complete Notes
No ratings yet
UNIT 5 Complete Notes
21 pages
unit-4-apachepig-210825041412
No ratings yet
unit-4-apachepig-210825041412
16 pages
PIG: A Big Data Processor: Tushar B. Kute
No ratings yet
PIG: A Big Data Processor: Tushar B. Kute
50 pages
Shoaib Program From 7
No ratings yet
Shoaib Program From 7
17 pages
BigData Module 2
No ratings yet
BigData Module 2
41 pages
UNIT 3
No ratings yet
UNIT 3
26 pages
Unit IV - Big Data Programming
No ratings yet
Unit IV - Big Data Programming
17 pages
Apache PIG by Sravanthi
No ratings yet
Apache PIG by Sravanthi
31 pages
Pig Slides
No ratings yet
Pig Slides
46 pages
Apache Pig Tutorial
100% (1)
Apache Pig Tutorial
207 pages
Apache Pig Tutorial PDF
0% (1)
Apache Pig Tutorial PDF
21 pages
06-Pig-01-Intro-1
No ratings yet
06-Pig-01-Intro-1
23 pages
Apache Pig in noSql Databases
No ratings yet
Apache Pig in noSql Databases
5 pages
Unit 5
No ratings yet
Unit 5
16 pages
UNIT-5
No ratings yet
UNIT-5
24 pages
Pig Hive
No ratings yet
Pig Hive
72 pages
BDA-V
No ratings yet
BDA-V
10 pages
BIG DATA Module 2 FINAL SMI
No ratings yet
BIG DATA Module 2 FINAL SMI
44 pages
3.pig
No ratings yet
3.pig
1 page
BDA_UNIT-4-PIG-Notes
No ratings yet
BDA_UNIT-4-PIG-Notes
9 pages
KCS 061 - Big Data - Unit V
No ratings yet
KCS 061 - Big Data - Unit V
17 pages
BDA Module 4 - Part 1 (Pig) 2023
No ratings yet
BDA Module 4 - Part 1 (Pig) 2023
34 pages
Notes Unit 5 Bigdata
No ratings yet
Notes Unit 5 Bigdata
21 pages
pig
No ratings yet
pig
23 pages
Unit V-Apache Pig
No ratings yet
Unit V-Apache Pig
10 pages
Pig and Pig Latin
No ratings yet
Pig and Pig Latin
16 pages
Big Data Unit-5
No ratings yet
Big Data Unit-5
9 pages
Apache PIG.pptx
No ratings yet
Apache PIG.pptx
41 pages
Big_Data_Unit-5
No ratings yet
Big_Data_Unit-5
81 pages
notes of aktu btech 3 yr big data
No ratings yet
notes of aktu btech 3 yr big data
15 pages
BD 5
No ratings yet
BD 5
28 pages
Unit IV
No ratings yet
Unit IV
36 pages
Unit 5
No ratings yet
Unit 5
76 pages
Unit 4
No ratings yet
Unit 4
20 pages
Notes Unit 5 Bigdata
No ratings yet
Notes Unit 5 Bigdata
19 pages
Configuration of a Simple Samba File Server, Quota and Schedule Backup
From Everand
Configuration of a Simple Samba File Server, Quota and Schedule Backup
Dr. Hedaya Alasooly
No ratings yet
Configuration of a Simple Samba File Server, Quota and Schedule Backup
From Everand
Configuration of a Simple Samba File Server, Quota and Schedule Backup
Dr. Hidaia Mahmood Alassouli
No ratings yet
Python-Deprecated Library v1.1 Documentation
From Everand
Python-Deprecated Library v1.1 Documentation
Laurent LAPORTE
No ratings yet
InstallationGuide MICROSARAdaptive-r8
No ratings yet
InstallationGuide MICROSARAdaptive-r8
12 pages
How To Install Vtiger CRM Open Source Edition On CentOS 7
No ratings yet
How To Install Vtiger CRM Open Source Edition On CentOS 7
8 pages
Intel (R) ME SW Installation Guide
No ratings yet
Intel (R) ME SW Installation Guide
36 pages
5100332-01A00 Salwico Cargo User Guide E
No ratings yet
5100332-01A00 Salwico Cargo User Guide E
54 pages
SW A103
No ratings yet
SW A103
4 pages
Pdfdocs Desktop User Guide: PDF Ocs Esktop
No ratings yet
Pdfdocs Desktop User Guide: PDF Ocs Esktop
6 pages
ESS+Intermec+PF8T New
No ratings yet
ESS+Intermec+PF8T New
17 pages
Jenkins Book PDF
No ratings yet
Jenkins Book PDF
224 pages
Instant ebooks textbook Operating System Concepts 10th 10th Edition Abraham Silberschatz download all chapters
100% (2)
Instant ebooks textbook Operating System Concepts 10th 10th Edition Abraham Silberschatz download all chapters
65 pages
Kivy Tutorial Build Desktop GUI Apps Using Python
No ratings yet
Kivy Tutorial Build Desktop GUI Apps Using Python
23 pages
The 5 FSMO Roles Are As Follows
No ratings yet
The 5 FSMO Roles Are As Follows
4 pages
ReadMe Cooker
100% (1)
ReadMe Cooker
5 pages
JK Mna: Foundation of Hardware Office 365 Administration (MS-100)
No ratings yet
JK Mna: Foundation of Hardware Office 365 Administration (MS-100)
3 pages
Session6-Ref PTActA IPS
No ratings yet
Session6-Ref PTActA IPS
4 pages
ACCOM 4.0 User's Manual - en
No ratings yet
ACCOM 4.0 User's Manual - en
79 pages
Fims-User Manual of Mobile-1
No ratings yet
Fims-User Manual of Mobile-1
23 pages
Csgo Python Instruction
No ratings yet
Csgo Python Instruction
1 page
linux commands
No ratings yet
linux commands
5 pages
Installation Instructions For Mercedes-Benz EWAnet
No ratings yet
Installation Instructions For Mercedes-Benz EWAnet
13 pages
ADF Demos
100% (1)
ADF Demos
50 pages
2nd-Meeting Database
100% (1)
2nd-Meeting Database
16 pages
Engel Integer
No ratings yet
Engel Integer
19 pages
Sg245222 MQSeries Backup and Recovery
No ratings yet
Sg245222 MQSeries Backup and Recovery
176 pages
Computer Application Lab - For Merge
No ratings yet
Computer Application Lab - For Merge
47 pages
System Center Configuration Manager SCCM 2012 SP1 Installation Step by Step
No ratings yet
System Center Configuration Manager SCCM 2012 SP1 Installation Step by Step
25 pages
ISTI Earthworm Training Class Summer 2021 - Class Content Overview - 1.2 - EarthwormOverview
No ratings yet
ISTI Earthworm Training Class Summer 2021 - Class Content Overview - 1.2 - EarthwormOverview
41 pages
Git Hub Cheat Sheet
No ratings yet
Git Hub Cheat Sheet
2 pages
LSMW Carga Sap
No ratings yet
LSMW Carga Sap
7 pages
Lab 3 - Exercise (Terminal Commands)
No ratings yet
Lab 3 - Exercise (Terminal Commands)
3 pages

UNIT 5 Notes by ARUN JHAPATE

Uploaded by

UNIT 5 Notes by ARUN JHAPATE

Uploaded by

UNIT-5

Installing and running PI Prerequisites

Download Apache Pig

Install Apache Pig

Configure Apache Pig

 PIG_HOME folder to the Apache Pig’s

Performance tuning: pig.cachedbag.memusage=<mem fraction>; default is 0.2 (20% of all

pig.exec.mapPartAgg = true|false. Default is false.

Miscellaneous: exectype = mapreduce|tez|local; default is mapreduce. This property is the same

Verifying the Installation

Apache Pig version 0.15.0 (r1682971)

Apache Pig Execution Modes

Apache Pig Execution Mechanisms

Local mode MapReduce mode

You can exit the Grunt shell using ‘ctrl + d’.

grunt> customers = LOAD 'customers.txt' USING PigStorage(',');

Executing Apache Pig in Batch Mode

Local mode MapReduce mode

Installing and Running HIVE

Step 1: Verifying JAVA Installation

Download java (JDK <latest version> - X64.tar.gz) by visiting the following

Use the following commands to configure java alternatives:

# alternatives --install /usr/bin/javac/javac/usr/local/java/bin/javac 2

# alternatives --install /usr/bin/jar/jar/usr/local/java/bin/jar 2

# alternatives --set java/usr/local/java/bin/java

# alternatives --set javac/usr/local/java/bin/javac

# alternatives --set jar/usr/local/java/bin/jar

Step 2: Verifying Hadoop Installation

Installing Hadoop in Pseudo Distributed Mode

Step I: Setting up Hadoop

Step II: Hadoop Configuration

(In the following path /hadoop/ is the user name.

namenode path = //home/hadoop/hadoopinfra/hdfs/namenode

(hadoopinfra/hdfs/datanode is the directory created by hdfs file system.)

Verifying Hadoop Installation

Step I: Name Node Setup

10/24/14 21:30:56 INFO namenode.NameNode: SHUTDOWN_MSG:

Step II: Verifying Hadoop dfs

Step III: Verifying Yarn Script

Step IV: Accessing Hadoop on Browser

Step V: Verify all applications for cluster

Step 3: Downloading Hive

Step 4: Installing Hive

Extracting and verifying Hive Archive

Copying files to /usr/local/hive directory

Setting up environment for Hive

Step 5: Configuring Hive

Step 6: Downloading and Installing Apache Derby

Downloading Apache Derby

Extracting and verifying Derby archive

Copying files to /usr/local/derby directory

Setting up environment for Derby

Create a directory to store Metastore

Create a directory named data in $DERBY_HOME directory to store Metastore data.

Step 7: Configuring Metastore of Hive

Step 8: Verifying Hive Installation

$ $HADOOP_HOME/bin/hadoop fs -mkdir /tmp

 What is HiveQL (HQL)?

 Create and manage tables and partitions

Here is the example of the HQL Query:

SELECT upper(name), salesprice

SELECT category, count(1)

HIVE User Defined Functions

 Hive UDF versus UDAF

In Hive, you can define two main kinds of custom functions:

o SELECT datediff(date_begin, date_end) from table

Each argument of a UDF can be:

o SELECT sum(price) from table GROUP by customer;

o SELECT total_customer_value(quantity, unit_price, day) from

 Oracle Big Data SQL

What is main differences between hive vs pig vs sql?

1. Hive is a Dataware house system for Hadoop that facilitates easydata

Hive limitations (Compared to SQL Languages)

1. No support for Update or Delete.

1. An abstraction over the complexity of MapReduce programming, the Pig

You might also like