0% found this document useful (0 votes)
13 views

big data

Assignment

Uploaded by

Mansha Singad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

big data

Assignment

Uploaded by

Mansha Singad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Experiment 3 : Installation of Hadoop framework in standalone mode

Step 1 — Installing Java


To get started, you’ll update our package list and install OpenJDK, the default Java Development Kit on
Ubuntu 20.04:

Once the installation is complete, let’s check the version.

This output verifies that OpenJDK has been successfully installed.

Step 2 — Installing Hadoop


With Java in place, you’ll visit the Apache Hadoop Releases page to find the most recent stable release.

Navigate to binary for the release you’d like to install. In this guide you’ll install Hadoop 3.3.1, but you can
substitute the version numbers in this guide with one of your choice.

On the next page, right-click and copy the link to the release binary.
On the server, you’ll use wget to fetch it:

In order to make sure that the file you downloaded hasn’t been altered, you’ll do a quick check using SHA-512,
or the Secure Hash Algorithm 512. Return to the releases page, then right-click and copy the link to the
checksum file for the release binary you downloaded:

Again, you’ll use wget on our server to download the file:

Then run the verification:

Output

2fd0bf74852c797dc864f373ec82ffaa1e98706b309b30d1effa91ac399b477e1accc1ee74d4ccbb1db7da1c5c541b
72e4a834f131a99f2814b030fbd043df66 hadoop-3.3.1.tar.gz

Compare this value with the SHA-512 value in the .sha512 file:

~/hadoop-3.3.1.tar.gz.sha512

...

SHA512 (hadoop-3.3.1.tar.gz) =
2fd0bf74852c797dc864f373ec82ffaa1e98706b309b30d1effa91ac399b477e1accc1ee74d4ccbb1db7da1c5c541b
72e4a834f131a99f2814b030fbd043df66

...

The output of the command you ran against the file you downloaded from the mirror should match the value in
the file you downloaded from apache.org.

Now that you’ve verified that the file wasn’t corrupted or changed, you can extract it:
Use the tar command with the -x flag to extract, -z to uncompress, -v for verbose output, and -f to specify that
you’re extracting from a file.

Finally, you’ll move the extracted files into /usr/local, the appropriate place for locally installed software:

With the software in place, you’re ready to configure its environment.

Step 3 — Configuring Hadoop’s Java Home


Hadoop requires that you set the path to Java, either as an environment variable or in the Hadoop configuration
file.

The path to Java, /usr/bin/java is a symlink to /etc/alternatives/java, which is in turn a symlink to default Java
binary. You will use readlink with the -f flag to follow every symlink in every part of the path, recursively.
Then, you’ll use sed to trim bin/java from the output to give us the correct value for JAVA_HOME.

You can copy this output to set Hadoop’s Java home to this specific version, which ensures that if the default
Java changes, this value will not. Alternatively, you can use the readlink command dynamically in the file so
that Hadoop will automatically use whatever Java version is set as the system default.

To begin, open hadoop-env.sh:

Then, modify the file by choosing one of the following options:

Option 1: Set a Static Value

Option 2: Use Readlink to Set the Value Dynamically


If you have trouble finding these lines, use CTRL+W to quickly search through the text. Once you’re done, exit
with CTRL+X and save your file.

Note: With respect to Hadoop, the value of JAVA_HOME in hadoop-env.sh overrides any values that are set in
the environment by /etc/profile or in a user’s profile.

Step 4 — Running Hadoop


Now you should be able to run Hadoop:

This output means you’ve successfully configured Hadoop to run in stand-alone mode.
Experiment 4 : Installation of Hadoop framework in pseudo distribution mode
Step 1: Download Binary Package :
Download the latest binary from the following site as follows.

http://hadoop.apache.org/releases.html

For reference, you can check the file save to the folder as follows.

C:\BigData

Step 2: Unzip the binary package

Open Git Bash, and change directory (cd) to the folder where you save the binary package and then unzip as
follows.

$ cd C:\BigData

MINGW64: C:\BigData

$ tar -xvzf hadoop-3.1.2.tar.gz

For my situation, the Hadoop twofold is extricated to C:\BigData\hadoop-3.1.2.

Next, go to this GitHub Repo and download the receptacle organizer as a speed as demonstrated as follows.
Concentrate the compress and duplicate all the documents present under the receptacle envelope to
C:\BigData\hadoop-3.1.2\bin. Supplant the current records too.

Step 3: Create folders for datanode and namenode :


 Goto C:/BigData/hadoop-3.1.2 and make an organizer ‘information’. Inside the ‘information’ envelope
make two organizers ‘datanode’ and ‘namenode’. Your documents on HDFS will dwell under the
datanode envelope.

 Set Hadoop Environment Variables

 Hadoop requires the following environment variables to be set.

HADOOP_HOME=” C:\BigData\hadoop-3.1.2”

HADOOP_BIN=”C:\BigData\hadoop-3.1.2\bin”

JAVA_HOME=<Root of your JDK installation>”

 To set these variables, navigate to My Computer or This PC.

Right-click -> Properties -> Advanced System settings -> Environment variables.
 Click New to create a new environment variable.

 In the event that you don’t have JAVA 1.8 introduced, at that point you’ll have to download and introduce it
first. In the event that the JAVA_HOME climate variable is now set, at that point check whether the way has
any spaces in it (ex: C:\Program Files\Java\… ). Spaces in the JAVA_HOME way will lead you to issues.
There is a stunt to get around it. Supplant ‘Program Files ‘to ‘Progra~1’in the variable worth. Guarantee
that the variant of Java is 1.8 and JAVA_HOME is highlighting JDK 1.8.

Step 4: To make Short Name of Java Home path


 Set Hadoop Environment Variables

 Edit PATH Environment Variable


 Click on New and
Add %JAVA_HOME%, %HADOOP_HOME%, %HADOOP_BIN%, %HADOOP_HOME%/sin to
your PATH one by one.
 Now we have set the environment variables, we need to validate them. Open a new Windows Command
prompt and run an echo command on each variable to confirm they are assigned the desired values.

echo %HADOOP_HOME%

echo %HADOOP_BIN%

echo %PATH%

 On the off chance that the factors are not instated yet, at that point it can likely be on the grounds that
you are trying them in an old meeting. Ensure you have opened another order brief to test them.

Step 5: Configure Hadoop

Once environment variables are set up, we need to configure Hadoop by editing the following configuration
files.

hadoop-env.cmd

core-site.xml
hdfs-site.xml

mapred-site.xml

yarn-site.xml

hadoop-env.cmd

First, let’s configure the Hadoop environment file. Open C:\BigData\hadoop-3.1.2\etc\hadoop\hadoop-env.cmd


and add below content at the bottom

set HADOOP_PREFIX=%HADOOP_HOME%

set HADOOP_CONF_DIR=%HADOOP_PREFIX%\etc\hadoop
set YARN_CONF_DIR=%HADOOP_CONF_DIR%

set PATH=%PATH%;%HADOOP_PREFIX%\bin

Step 6: Edit hdfs-site.xml

After editing core-site.xml, you need to set the replication factor and the location of namenode and datanodes.
Open C:\BigData\hadoop-3.1.2\etc\hadoop\hdfs-site.xml and below content within <configuration>
</configuration> tags.

<configuration>

<property>

<name>dfs.replication</name>

<value>1</value>

</property>

<property>

<name>dfs.namenode.name.dir</name>
<value>C:\BigData\hadoop-3.2.1\data\namenode</value>

</property>

<property>

<name>dfs.datanode.data.dir</name>

<value>C:\BigData\hadoop-3.1.2\data\datanode</value>

</property>

</configuration>

Step 7: Edit core-site.xml

Now, configure Hadoop Core’s settings. Open C:\BigData\hadoop-3.1.2\etc\hadoop\core-site.xml and below


content within <configuration> </configuration> tags.

<configuration>

<property>

<name>fs.default.name</name>
<value>hdfs://0.0.0.0:19000</value>

</property>

</configuration>

Step 8: YARN configurations

Edit file yarn-site.xml

Make sure the following entries are existing as follows.


<configuration> <property>

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value> </property>

<property>

<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>

<value>org.apache.hadoop.mapred.ShuffleHandler</value>

</property>
</configuration>

Step 9: Edit mapred-site.xml

At last, how about we arrange properties for the Map-Reduce system. Open C:\BigData\hadoop-
3.1.2\etc\hadoop\mapred-site.xml and beneath content inside <configuration> </configuration> labels. In the
event that you don’t see mapred-site.xml, at that point open mapred-site.xml.template record and rename it to
mapred-site.xml

<configuration>

<property>
<name>mapreduce.job.user.name</name> <value>%USERNAME%</value>

</property>

<property>

<name>mapreduce.framework.name</name>

<value>yarn</value>

</property>

<property>

<name>yarn.apps.stagingDir</name> <value>/user/%USERNAME%/staging</value>

</property>

<property>

<name>mapreduce.jobtracker.address</name>

<value>local</value>

</property>
</configuration>

Check if C:\BigData\hadoop-3.1.2\etc\hadoop\slaves file is present, if it’s not then created one and add
localhost in it and save it.

Step 10: Format Name Node :

To organize the Name Node, open another Windows Command Prompt and run the beneath order. It might give
you a few admonitions, disregard them.

 hadoop namenode -format

Format Hadoop Name Node

Step 11: Launch Hadoop :

Open another Windows Command brief, make a point to run it as an Administrator to maintain a strategic
distance from authorization mistakes. When opened, execute the beginning all.cmd order. Since we have
added %HADOOP_HOME%\sbin to the PATH variable, you can run this order from any envelope. In the event
that you haven’t done as such, at that point go to the %HADOOP_HOME%\sbin organizer and run the order.

You can check the given below screenshot for your reference 4 new windows will open and cmd terminals for 4
daemon processes like as follows.
 namenode

 datanode

 node manager

 resource manager
Don’t close these windows, minimize them. Closing the windows will terminate the daemons. You can run
them in the background if you don’t like to see these windows.

Step 12: Hadoop Web UI

In conclusion, how about we screen to perceive how are Hadoop daemons are getting along. Also you can
utilize the Web UI for a wide range of authoritative and observing purposes. Open your program and begin.

Step 13: Resource Manager

Open localhost:8088 to open Resource Manager


Step 14: Node Manager

Open localhost:8042 to open Node Manager

Step 15: Name Node :

Open localhost:9870 to check out the health of Name Node


Step 16: Data Node :
Open localhost:9864 to check out Data Node
Experiment 5 : Installation of Hadoop framework in fully distributed mode

1. Download Hadoop from the official website and put it in an appropriate directory (eg. /usr/local/) later
on we shall refer to this folder as HADOOP_PARENT_DIR. The full path of the hadoop home
directory shall instead be referred to as HADOOP_HOME.

2. Set the hadoop user as the owner of the hadoop home directory.

3. Configure Hadoop, it’s just about editing a couple of configuration files, you will find them in
the $HADOOP_PARENT_DIR/etc/hadoop/ directory. The files to be edited are core-site.xml, hdfs-
site.xml, mapred-site.xml, yarn-site.xml workers, hadoop-env.sh:
 core-site.xml

<? xml version = "1.0" encoding = "UTF-8"?>


<? xml-stylesheet type = "text/xsl" href = "configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
You may not use this file in compliance with the License.
You can get a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law
distributed under the license is distributed on an ASIS BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions
limitations under the License. See accompanying LICENSE file.
-->
<!-- Site-specific property overrides in this file. -->
<Configuration>
<Property>
<Name> fs.defaultFS </name>
<Value> hdfs:/master:9000 </value>
</Property>
</Configuration>

We can see a property which specifies that the hdfs namenode process is hosted by the node which hostname
is master, and is running on port 9000.

 hdfs-site.xml

<? xml version = "1.0" encoding = "UTF-8"?>


<? xml-stylesheet type = "text/xsl" href = "configuration.xsl"?>
<! -
Licensed under the Apache License, Version 2.0 (the "License");
You may not use this file in compliance with the License.
You can get a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law
distributed under the license is distributed on an ASIS BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions
limitations under the License. See accompanying LICENSE file.
->
<!-- Site-specific property overrides in this file. -->
<Configuration>
<Property>
<Name> dfs.replication </name>
<Value> 3 </value>
</Property>
<!-- namenode storage dir property -->
<Property>
<Name> dfs.namenode.name.dir </name>
<Value> /home/hadoop/data/hdfs/NameNode </value>
</Property>
<!-- datanodes storage dir property -->
<Property>
<Name> dfs.datanode.data.dir </name>
<Value> /home/hadoop/data/hdfs/datanode </value>
</Property>
</Configuration>

In this file, we specify the replication factor for hdfs, here it is set to 3 , we also specify where to store the hdfs
data on each node. Here in the /home/hadoop/data/hdfs/namenode folder for the namenode, and in
the /home/hadoop/data/hdfs/datanode folder for the datanodes.

 mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.--><!-- Put site-specific property
overrides in this file. --><configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<value>2048</value>
</property> <property>
<name>mapreduce.map.memory.mb</name>
<value>1024</value>
</property> <property>
<name>mapreduce.reduce.memory.mb</name>
<value>1024</value>
</property>
</configuration>

In this file, the amount of RAM to allocate to the mapreduce applications in the node is specified. We maintain
the same configurations for each node, the exact values you should use depend on the specs of the computer
and its workload besides Hadoop. The above configuration is one adopted on an existing cluster with 4GB ram
datanodes

 yarn-site.xml

<?xml version="1.0"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<configuration><!-- Site specific YARN configuration properties -->
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property> <property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
</property> <property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property> <property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>2048</value>
</property> <property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>2048</value>
</property> <property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1024</value>
</property> <property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
</configuration>

For more information on how to choose the memory configuration values when editing the mapred-
site.xml and yarn-site.xml files, I suggest to read this post.

 workers
Probably the simplest configuration file, we just have to list the hostnames of the nodes hosting the
datanodes in our cluster (one hostname per line).

 hadoop-env.sh
We only have to set the JAVA_HOME environment variable at the end of the file.

8. La last step is to update the hadoop user’s bash profile, that is to edit the /home/hadoop/.bashrc file. We
have to:

 Set and export the JAVA_HOME environment variable

 Set the PDSH_RCMD_TYPE environment variable to ssh

 Set the HADOOP_HOME environment variable

 Add the $HADOOP_HOME/bin and $HADOOP_HOME/sbin to the PATH environment variable,


then export it.

9. Finished!
To check if all went right you can launch yarn and hdfs by typing the following command lines on your master
node.

hdfs namenode -format


start-dfs.sh
start-yarn.sh
Experiment 7 : Installation of pig

Step 1

Open the homepage of Apache Pig website. Under the section News, click on the link release page as shown in
the following snapshot.

Step 2

On clicking the specified link, you will be redirected to the Apache Pig Releases page. On this page, under
the Download section, you will have two links, namely, Pig 0.8 and later and Pig 0.7 and before. Click on the
link Pig 0.8 and later, then you will be redirected to the page having a set of mirrors.
Step 3

Choose and click any one of these mirrors as shown below.


Step 4

These mirrors will take you to the Pig Releases page. This page contains various versions of Apache Pig. Click
the latest version among them.

Step 5

Within these folders, you will have the source and binary files of Apache Pig in various distributions. Download
the tar files of the source and binary files of Apache Pig 0.15, pig0.15.0-src.tar.gz and pig-0.15.0.tar.gz.

Install Apache Pig


After downloading the Apache Pig software, install it in your Linux environment by following the steps given
below.

Step 1

Create a directory with the name Pig in the same directory where the installation directories of Hadoop,
Java, and other software were installed. (In our tutorial, we have created the Pig directory in the user named
Hadoop).

$ mkdir Pig

Step 2

Extract the downloaded tar files as shown below.

$ cd Downloads/
$ tar zxvf pig-0.15.0-src.tar.gz

$ tar zxvf pig-0.15.0.tar.gz

Step 3

Move the content of pig-0.15.0-src.tar.gz file to the Pig directory created earlier as shown below.

$ mv pig-0.15.0-src.tar.gz/* /home/Hadoop/Pig/

Configure Apache Pig


After installing Apache Pig, we have to configure it. To configure, we need to edit two files − bashrc and
pig.properties.

.bashrc file
In the .bashrc file, set the following variables −

 PIG_HOME folder to the Apache Pig’s installation folder,

 PATH environment variable to the bin folder, and

 PIG_CLASSPATH environment variable to the etc (configuration) folder of your Hadoop installations
(the directory that contains the core-site.xml, hdfs-site.xml and mapred-site.xml files).

export PIG_HOME = /home/Hadoop/Pig

export PATH = $PATH:/home/Hadoop/pig/bin

export PIG_CLASSPATH = $HADOOP_HOME/conf


pig.properties file
In the conf folder of Pig, we have a file named pig.properties. In the pig.properties file, you can set various
parameters as given below.

pig -h properties

The following properties are supported −

Logging: verbose = true|false; default is false. This property is the same as -v

switch brief=true|false; default is false. This property is the same

as -b switch debug=OFF|ERROR|WARN|INFO|DEBUG; default is INFO.


This property is the same as -d switch aggregate.warning = true|false; default is true.

If true, prints count of warnings of each type rather than logging each warning.

Performance tuning: pig.cachedbag.memusage=<mem fraction>; default is 0.2 (20% of all memory).

Note that this memory is shared across all large bags used by the application.

pig.skewedjoin.reduce.memusagea=<mem fraction>; default is 0.3 (30% of all memory).

Specifies the fraction of heap available for the reducer to perform the join.
pig.exec.nocombiner = true|false; default is false.

Only disable combiner as a temporary workaround for problems.

opt.multiquery = true|false; multiquery is on by default.

Only disable multiquery as a temporary workaround for problems.

opt.fetch=true|false; fetch is on by default.

Scripts containing Filter, Foreach, Limit, Stream, and Union can be dumped without MR jobs.

pig.tmpfilecompression = true|false; compression is off by default.

Determines whether output of intermediate jobs is compressed.


pig.tmpfilecompression.codec = lzo|gzip; default is gzip.

Used in conjunction with pig.tmpfilecompression. Defines compression type.

pig.noSplitCombination = true|false. Split combination is on by default.

Determines if multiple small files are combined into a single map.

pig.exec.mapPartAgg = true|false. Default is false.

Determines if partial aggregation is done within map phase, before records are sent to combiner.
pig.exec.mapPartAgg.minReduction=<min aggregation factor>. Default is 10.
If the in-map partial aggregation does not reduce the output num records by this factor, it gets disabled.

Miscellaneous: exectype = mapreduce|tez|local; default is mapreduce. This property is the same as -x switch

pig.additional.jars.uris=<comma seperated list of jars>. Used in place of register command.

udf.import.list=<comma seperated list of imports>. Used to avoid package names in UDF.

stop.on.failure = true|false; default is false. Set to true to terminate on the first error.
pig.datetime.default.tz=<UTC time offset>. e.g. +08:00. Default is the default timezone of the host.

Determines the timezone used to handle datetime datatype and UDFs.

Additionally, any Hadoop property can be specified.

Verifying the Installation


Verify the installation of Apache Pig by typing the version command. If the installation is successful, you will
get the version of Apache Pig as shown below.

$ pig –version

Apache Pig version 0.15.0 (r1682971)

compiled Jun 01 2015, 11:44:35


Experiment 8 : Installation of Hive

Step 1: Verifying JAVA Installation

Java must be installed on your system before installing Hive. Let us verify java installation using the following
command:

$ java –version

If Java is already installed on your system, you get to see the following response:

java version "1.7.0_71"

Java(TM) SE Runtime Environment (build 1.7.0_71-b13)

Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)

If java is not installed in your system, then follow the steps given below for installing java.
Installing Java

Step I:

Download java (JDK <latest version> - X64.tar.gz) by visiting the following


link http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html.

Then jdk-7u71-linux-x64.tar.gz will be downloaded onto your system.

Step II:

Generally you will find the downloaded java file in the Downloads folder. Verify it and extract the jdk-7u71-
linux-x64.gz file using the following commands.

$ cd Downloads/

$ ls

jdk-7u71-linux-x64.gz

$ tar zxf jdk-7u71-linux-x64.gz

$ ls
jdk1.7.0_71 jdk-7u71-linux-x64.gz

Step III:

To make java available to all the users, you have to move it to the location “/usr/local/”. Open root, and type the
following commands.

$ su

password:

# mv jdk1.7.0_71 /usr/local/

# exit

Step IV:
For setting up PATH and JAVA_HOME variables, add the following commands to ~/.bashrc file.
export JAVA_HOME=/usr/local/jdk1.7.0_71

export PATH=$PATH:$JAVA_HOME/bin

Now apply all the changes into the current running system.

$ source ~/.bashrc

Step V:

Use the following commands to configure java alternatives:


# alternatives --install /usr/bin/java/java/usr/local/java/bin/java 2

# alternatives --install /usr/bin/javac/javac/usr/local/java/bin/javac 2

# alternatives --install /usr/bin/jar/jar/usr/local/java/bin/jar 2

# alternatives --set java/usr/local/java/bin/java

# alternatives --set javac/usr/local/java/bin/javac

# alternatives --set jar/usr/local/java/bin/jar

Now verify the installation using the command java -version from the terminal as explained above.

Step 2: Downloading Hive


We use hive-0.14.0 in this tutorial. You can download it by visiting the following
link http://apache.petsads.us/hive/hive-0.14.0/. Let us assume it gets downloaded onto the /Downloads
directory. Here, we download Hive archive named “apache-hive-0.14.0-bin.tar.gz” for this tutorial. The
following command is used to verify the download:

$ cd Downloads

$ ls

On successful download, you get to see the following response:

apache-hive-0.14.0-bin.tar.gz

Step 3: Installing Hive

The following steps are required for installing Hive on your system. Let us assume the Hive archive is
downloaded onto the /Downloads directory.

Extracting and verifying Hive Archive

The following command is used to verify the download and extract the hive archive:
$ tar zxvf apache-hive-0.14.0-bin.tar.gz

$ ls

On successful download, you get to see the following response:

apache-hive-0.14.0-bin apache-hive-0.14.0-bin.tar.gz

Copying files to /usr/local/hive directory

We need to copy the files from the super user “su -”. The following commands are used to copy the files from
the extracted directory to the /usr/local/hive” directory.

$ su -
passwd:

# cd /home/user/Download

# mv apache-hive-0.14.0-bin /usr/local/hive

# exit

Setting up environment for Hive

You can set up the Hive environment by appending the following lines to ~/.bashrc file:

export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin

export CLASSPATH=$CLASSPATH:/usr/local/Hadoop/lib/*:.

export CLASSPATH=$CLASSPATH:/usr/local/hive/lib/*:.

The following command is used to execute ~/.bashrc file.

$ source ~/.bashrc

Step 4: Configuring Hive

To configure Hive with Hadoop, you need to edit the hive-env.sh file, which is placed in
the $HIVE_HOME/conf directory. The following commands redirect to Hive config folder and copy the
template file:

$ cd $HIVE_HOME/conf

$ cp hive-env.sh.template hive-env.sh

Edit the hive-env.sh file by appending the following line:

export HADOOP_HOME=/usr/local/hadoop

Hive installation is completed successfully. Now you require an external database server to configure
Metastore. We use Apache Derby database.

Step 5: Downloading and Installing Apache Derby

Follow the steps given below to download and install Apache Derby:

Downloading Apache Derby


The following command is used to download Apache Derby. It takes some time to download.

$ cd ~

$ wget http://archive.apache.org/dist/db/derby/db-derby-10.4.2.0/db-derby-10.4.2.0-bin.tar.gz

The following command is used to verify the download:

$ ls

On successful download, you get to see the following response:


db-derby-10.4.2.0-bin.tar.gz

Extracting and verifying Derby archive

The following commands are used for extracting and verifying the Derby archive:

$ tar zxvf db-derby-10.4.2.0-bin.tar.gz

$ ls

On successful download, you get to see the following response:

db-derby-10.4.2.0-bin db-derby-10.4.2.0-bin.tar.gz
Copying files to /usr/local/derby directory

We need to copy from the super user “su -”. The following commands are used to copy the files from the
extracted directory to the /usr/local/derby directory:
$ su -

passwd:

# cd /home/user

# mv db-derby-10.4.2.0-bin /usr/local/derby

# exit

Setting up environment for Derby

You can set up the Derby environment by appending the following lines to ~/.bashrc file:
export DERBY_HOME=/usr/local/derby

export PATH=$PATH:$DERBY_HOME/bin

Apache Hive

18

export CLASSPATH=$CLASSPATH:$DERBY_HOME/lib/derby.jar:$DERBY_HOME/lib/derbytools.jar

The following command is used to execute ~/.bashrc file:

$ source ~/.bashrc

Create a directory to store Metastore

Create a directory named data in $DERBY_HOME directory to store Metastore data.

$ mkdir $DERBY_HOME/data
Derby installation and environmental setup is now complete.

Step 6: Configuring Metastore of Hive

Configuring Metastore means specifying to Hive where the database is stored. You can do this by editing the
hive-site.xml file, which is in the $HIVE_HOME/conf directory. First of all, copy the template file using the
following command:

$ cd $HIVE_HOME/conf

$ cp hive-default.xml.template hive-site.xml

Edit hive-site.xml and append the following lines between the <configuration> and </configuration> tags:

<property>

<name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:derby://localhost:1527/metastore_db;create=true </value>
<description>JDBC connect string for a JDBC metastore </description>

</property>

Create a file named jpox.properties and add the following lines into it:

javax.jdo.PersistenceManagerFactoryClass =

org.jpox.PersistenceManagerFactoryImpl

org.jpox.autoCreateSchema = false
org.jpox.validateTables = false

org.jpox.validateColumns = false

org.jpox.validateConstraints = false

org.jpox.storeManagerType = rdbms

org.jpox.autoCreateSchema = true

org.jpox.autoStartMechanismMode = checked

org.jpox.transactionIsolation = read_committed

javax.jdo.option.DetachAllOnCommit = true

javax.jdo.option.NontransactionalRead = true

javax.jdo.option.ConnectionDriverName = org.apache.derby.jdbc.ClientDriver

javax.jdo.option.ConnectionURL = jdbc:derby://hadoop1:1527/metastore_db;create = true

javax.jdo.option.ConnectionUserName = APP

javax.jdo.option.ConnectionPassword = mine

Step 8: Verifying Hive Installation


Before running Hive, you need to create the /tmp folder and a separate Hive folder in HDFS. Here, we use
the /user/hive/warehouse folder. You need to set write permission for these newly created folders as shown
below:

chmod g+w
Now set them in HDFS before verifying Hive. Use the following commands:

$ $HADOOP_HOME/bin/hadoop fs -mkdir /tmp

$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse

$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp

$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse

The following commands are used to verify Hive installation:

$ cd $HIVE_HOME
$ bin/hive

On successful installation of Hive, you get to see the following response:

Logging initialized using configuration in jar:file:/home/hadoop/hive-0.9.0/lib/hive-common-0.9.0.jar!/hive-


log4j.properties

Hive history file=/tmp/hadoop/hive_job_log_hadoop_201312121621_1494929084.txt

………………….

hive>

The following sample command is executed to display all the tables:

hive> show tables;

OK
Time taken: 2.798 seconds

hive>

You might also like