0% found this document useful (0 votes)
9 views

Apache Hive Installation and Basic Usage Guide

This guide provides step-by-step instructions for installing and using Apache Hive on a Hadoop cluster, including prerequisites, downloading Hive, configuring environment variables, and setting up the Hive metastore. It details the creation of necessary HDFS directories, initialization of the metastore schema, and performing basic operations such as creating databases and tables, loading data, and executing queries. The document emphasizes the importance of using an external database for production environments and provides commands for verification of successful installation and data operations.

Uploaded by

NEHA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Apache Hive Installation and Basic Usage Guide

This guide provides step-by-step instructions for installing and using Apache Hive on a Hadoop cluster, including prerequisites, downloading Hive, configuring environment variables, and setting up the Hive metastore. It details the creation of necessary HDFS directories, initialization of the metastore schema, and performing basic operations such as creating databases and tables, loading data, and executing queries. The document emphasizes the importance of using an external database for production environments and provides commands for verification of successful installation and data operations.

Uploaded by

NEHA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Installing and Using Apache Hive on Hadoop

This guide explains how to set up Apache Hive on top of an existing Hadoop
cluster (like the one configured in hadoop_distributed_setup_guide_1) and
perform basic data operations.

Prerequisites:
● A working Hadoop cluster (HDFS & YARN services running).
● The hduser account (or your Hadoop user) configured with environment
variables (HADOOP_HOME, JAVA_HOME, PATH) as described in the
Hadoop setup guide.
● Hadoop services (start-dfs.sh, start-yarn.sh) should be running.
1. Download and Extract Hive (On Master Node as hduser):
● Go to the Apache Hive downloads page
(https://hive.apache.org/downloads.html). Choose a version compatible
with your Hadoop version (check the release notes). Hive 3.1.x often
works well with Hadoop 3.x.
● Download the binary tarball.
# Example for Hive 3.1.3 (adjust version as needed)
wget https://dlcdn.apache.org/hive/hive-3.1.3/apache-hive-3.1.3-
bin.tar.gz

● Extract it (e.g., into /usr/local/ or hduser's home directory). Let's use


/usr/local/.
# Log in as hduser: su - hduser
# Ensure you have permissions or use sudo temporarily for /usr/local
sudo tar -xzf apache-hive-3.1.3-bin.tar.gz -C /usr/local/
cd /usr/local/
# Rename for convenience
sudo mv apache-hive-3.1.3-bin hive
# Change ownership to hduser
sudo chown -R hduser:hadoop /usr/local/hive

2. Configure Hive Environment Variables (On Master Node as


hduser):
● Edit the .bashrc file for hduser.
nano ~/.bashrc

● Add the following lines (ensure Hadoop variables are already present):
# Hive Environment Variables
export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin

# Optional: Specify Hadoop config directory if not default


# export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

● Source the file to apply changes:


source ~/.bashrc

3. Configure Hive (hive-site.xml - On Master Node as hduser):


● Navigate to the Hive configuration directory:
cd $HIVE_HOME/conf

● Create hive-site.xml from the template:


cp hive-default.xml.template hive-site.xml

● Edit hive-site.xml:
nano hive-site.xml

● Add/Modify the following properties within the <configuration> tags.


Important: Remove or comment out properties you don't explicitly set,
as defaults might conflict. This configuration uses embedded Derby for
the metastore (simple for testing, not suitable for production or
concurrent sessions).
<configuration>
<property>
<name>hive.execution.engine</name>
<value>mr</value> <description>Chooses execution engine.
Options are: mr (Map Reduce, default), tez, spark.</description>
</property>

<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value> <description>Location of
default database for the warehouse.</description>
</property>

<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=metastore_db;create=true</value>
<description>JDBC connect string for a JDBC metastore. For
embedded Derby.</description>
</property>

<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.EmbeddedDriver</value>
<description>Driver class name for a JDBC
metastore</description>
</property>

<property>
<name>hive.exec.scratchdir</name>
<value>/tmp/hive</value>
<description>HDFS scratch directory for Hive jobs</description>
</property>

<property>
<name>hive.scratch.dir.permission</name>
<value>733</value>
<description>The permission for the user specific scratch
directory</description>
</property>

<property>
<name>hive.metastore.uris</name>
<value></value> <description>Thrift URI for the remote
metastore. Use 空白 for embedded mode</description>
</property>

<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop-master</value> <description>The hostname of
the YARN ResourceManager.</description>
</property>

<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop-master:9000</value> <description>The
default file system URI.</description>
</property>

<property>
<name>datanucleus.schema.autoCreateAll</name>
<value>true</value>
</property>
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>

</configuration>

○ Note on Metastore: For production or multi-user environments,


configure Hive to use an external database like MySQL or PostgreSQL
instead of Derby. This involves setting different javax.jdo.option.*
properties and potentially running a separate Hive Metastore service.
4. Create Required HDFS Directories (On Master Node as hduser):
● Hive needs specific directories in HDFS. Use the hdfs dfs commands:
# Verify Hadoop services are running first (jps)
hdfs dfs -mkdir /tmp
hdfs dfs -mkdir -p /user/hive/warehouse # The -p flag creates parent dirs
if needed
hdfs dfs -chmod g+w /tmp # Grant group write permissions
hdfs dfs -chmod g+w /user/hive/warehouse # Grant group write
permissions

(You might need to create /tmp/hive as well, depending on exact


configuration)
5. Initialize Metastore Schema (Derby - Run ONCE on Master as
hduser):
● This step creates the necessary tables in the Derby database
(metastore_db) where Hive stores metadata (table definitions, schemas,
etc.).
cd $HIVE_HOME/bin
# Point to the correct Hadoop home if not already set system-wide
# export HADOOP_HOME=/usr/local/hadoop
schematool -dbType derby -initSchema
# Look for "Schema initialization finished" or similar success message.

○ If you encounter errors, check hive-site.xml settings and HDFS


permissions. If it complains about the metastore already existing
(e.g., from a previous attempt), you might need to delete the
metastore_db directory (usually created in the directory where you
run the command, e.g., /usr/local/hive/bin) and retry.
6. Run Hive CLI and Perform Operations (On Master Node as
hduser):
● Start the Hive command-line interface:
hive

You should see log messages and eventually the hive> prompt.
● Create a Database:
CREATE DATABASE IF NOT EXISTS my_test_db;

● Use the Database:


USE my_test_db;

● Create a Table: Let's create a simple employees table.


CREATE TABLE IF NOT EXISTS employees (
id INT,
name STRING,
department STRING,
salary DOUBLE
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',' -- Specify comma as the delimiter
STORED AS TEXTFILE; -- Store data as plain text files

● Prepare Sample Data (Locally):


○ Exit the Hive CLI temporarily (exit; or quit;).
○ Create a local CSV file (e.g., in hduser's home directory).
cd ~ # Go to home directory
nano employees.csv

○ Add some data (no header row):


101,Alice,Engineering,80000
102,Bob,Sales,65000
103,Charlie,Engineering,85000
104,David,HR,60000

○Save and close the file (Ctrl+X, then Y, then Enter in nano).
● Load Data into the Table:
○ Go back into the Hive CLI: hive
○ Make sure you are using the correct database: USE my_test_db;
○ Load the local file into the Hive table. Hive will move the data into the
HDFS directory corresponding to the table
(/user/hive/warehouse/my_test_db.db/employees/).
-- Use the full path to your local file
LOAD DATA LOCAL INPATH '/home/hduser/employees.csv'
OVERWRITE INTO TABLE employees;

LOCAL: Indicates the source file is on the local filesystem where



the Hive client is running.
■ OVERWRITE: Replaces existing data in the table. Use INTO TABLE
(without OVERWRITE) to append.
● Run a Basic Select Query:
SELECT * FROM employees;

SELECT department, AVG(salary)


FROM employees
GROUP BY department;

● Exit Hive:
exit;
-- or
quit;

7. Verification:
● The SELECT queries should display the data you loaded and the
aggregated results.
● You can also check HDFS to see the data file:
hdfs dfs -ls /user/hive/warehouse/my_test_db.db/employees/
# You should see a file named 'employees.csv' (or similar) inside
hdfs dfs -cat
/user/hive/warehouse/my_test_db.db/employees/employees.csv

You have now successfully installed Hive (using Derby metastore), created a
database and table, loaded data from a local file, and executed basic SQL-
like queries against data stored in HDFS, processed potentially using
MapReduce via YARN.

Okay, here's a shortened version of the Hive installation and usage guide:

Installing and Using Apache Hive on Hadoop (Concise Guide)

This guide explains how to set up and use Apache Hive on an existing Hadoop
cluster.

Prerequisites: Working Hadoop cluster (HDFS & YARN), hduser configured


with Hadoop environment variables, Hadoop services running.

1. Download and Extract Hive (On Master Node as hduser):

 Download a compatible Hive binary tarball from the Apache Hive downloads
page.
 Extract to /usr/local/hive.
 Rename: sudo mv apache-hive-3.1.3-bin hive.
 Change ownership: sudo chown -R hduser:hadoop /usr/local/hive.

2. Configure Hive Environment Variables (On Master Node as hduser):

 Edit ~/.bashrc and add HIVE_HOME and update PATH.


 Source the file: source ~/.bashrc.

3. Configure Hive (hive-site.xml - On Master Node as hduser):

 Navigate to $HIVE_HOME/conf.
 Create hive-site.xml from hive-default.xml.template.
 Edit hive-site.xml to include the following properties (for embedded Derby
metastore):

XML
<configuration>
<property>
<name>hive.execution.engine</name>
<value>mr</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:derby:;databaseName=metastore_db;create=true</valu
e>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.EmbeddedDriver</value>
</property>
<property>
<name>hive.exec.scratchdir</name>
<value>/tmp/hive</value>
</property>
<property>
<name>hive.scratch.dir.permission</name>
<value>733</value>
</property>
<property>
<name>hive.metastore.uris</name>
<value></value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop-master</value>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop-master:9000</value>
</property>
<property>
<name>datanucleus.schema.autoCreateAll</name>
<value>true</value>
</property>
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>
</configuration>

 Note on Metastore: For production, use an external database.

4. Create Required HDFS Directories (On Master Node as hduser):


 hdfs dfs -mkdir /tmp
 hdfs dfs -mkdir -p /user/hive/warehouse
 hdfs dfs -chmod g+w /tmp
 hdfs dfs -chmod g+w /user/hive/warehouse

5. Initialize Metastore Schema (Derby - Run ONCE on Master as hduser):


 cd $HIVE_HOME/bin
 schematool -dbType derby -initSchema

6. Run Hive CLI and Perform Operations (On Master Node as hduser):

 Start CLI: hive


 Create Database: CREATE DATABASE IF NOT EXISTS my_test_db;
 Use Database: USE my_test_db;
 Create Table:

SQL
CREATE TABLE IF NOT EXISTS employees (
id INT,
name STRING,
department STRING,
salary DOUBLE
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

 Prepare Sample Data (Locally): Create employees.csv with sample data.


 Load Data: LOAD DATA LOCAL INPATH '/home/hduser/employees.csv'
OVERWRITE INTO TABLE employees;
 Run Queries:
o SELECT * FROM employees;
o SELECT department, AVG(salary) FROM employees GROUP BY
department;
 Exit: exit;
7. Verification:

 Check query results in the Hive CLI.


 Verify data in HDFS: hdfs dfs -ls
/user/hive/warehouse/my_test_db.db/employees/ and hdfs dfs -cat
/user/hive/warehouse/my_test_db.db/employees/employees.csv.

You might also like