Apache Hive Installation and Basic Usage Guide
Apache Hive Installation and Basic Usage Guide
This guide explains how to set up Apache Hive on top of an existing Hadoop
cluster (like the one configured in hadoop_distributed_setup_guide_1) and
perform basic data operations.
Prerequisites:
● A working Hadoop cluster (HDFS & YARN services running).
● The hduser account (or your Hadoop user) configured with environment
variables (HADOOP_HOME, JAVA_HOME, PATH) as described in the
Hadoop setup guide.
● Hadoop services (start-dfs.sh, start-yarn.sh) should be running.
1. Download and Extract Hive (On Master Node as hduser):
● Go to the Apache Hive downloads page
(https://hive.apache.org/downloads.html). Choose a version compatible
with your Hadoop version (check the release notes). Hive 3.1.x often
works well with Hadoop 3.x.
● Download the binary tarball.
# Example for Hive 3.1.3 (adjust version as needed)
wget https://dlcdn.apache.org/hive/hive-3.1.3/apache-hive-3.1.3-
bin.tar.gz
● Add the following lines (ensure Hadoop variables are already present):
# Hive Environment Variables
export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin
● Edit hive-site.xml:
nano hive-site.xml
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value> <description>Location of
default database for the warehouse.</description>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=metastore_db;create=true</value>
<description>JDBC connect string for a JDBC metastore. For
embedded Derby.</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.EmbeddedDriver</value>
<description>Driver class name for a JDBC
metastore</description>
</property>
<property>
<name>hive.exec.scratchdir</name>
<value>/tmp/hive</value>
<description>HDFS scratch directory for Hive jobs</description>
</property>
<property>
<name>hive.scratch.dir.permission</name>
<value>733</value>
<description>The permission for the user specific scratch
directory</description>
</property>
<property>
<name>hive.metastore.uris</name>
<value></value> <description>Thrift URI for the remote
metastore. Use 空白 for embedded mode</description>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop-master</value> <description>The hostname of
the YARN ResourceManager.</description>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop-master:9000</value> <description>The
default file system URI.</description>
</property>
<property>
<name>datanucleus.schema.autoCreateAll</name>
<value>true</value>
</property>
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>
</configuration>
You should see log messages and eventually the hive> prompt.
● Create a Database:
CREATE DATABASE IF NOT EXISTS my_test_db;
○Save and close the file (Ctrl+X, then Y, then Enter in nano).
● Load Data into the Table:
○ Go back into the Hive CLI: hive
○ Make sure you are using the correct database: USE my_test_db;
○ Load the local file into the Hive table. Hive will move the data into the
HDFS directory corresponding to the table
(/user/hive/warehouse/my_test_db.db/employees/).
-- Use the full path to your local file
LOAD DATA LOCAL INPATH '/home/hduser/employees.csv'
OVERWRITE INTO TABLE employees;
● Exit Hive:
exit;
-- or
quit;
7. Verification:
● The SELECT queries should display the data you loaded and the
aggregated results.
● You can also check HDFS to see the data file:
hdfs dfs -ls /user/hive/warehouse/my_test_db.db/employees/
# You should see a file named 'employees.csv' (or similar) inside
hdfs dfs -cat
/user/hive/warehouse/my_test_db.db/employees/employees.csv
You have now successfully installed Hive (using Derby metastore), created a
database and table, loaded data from a local file, and executed basic SQL-
like queries against data stored in HDFS, processed potentially using
MapReduce via YARN.
Okay, here's a shortened version of the Hive installation and usage guide:
This guide explains how to set up and use Apache Hive on an existing Hadoop
cluster.
Download a compatible Hive binary tarball from the Apache Hive downloads
page.
Extract to /usr/local/hive.
Rename: sudo mv apache-hive-3.1.3-bin hive.
Change ownership: sudo chown -R hduser:hadoop /usr/local/hive.
Navigate to $HIVE_HOME/conf.
Create hive-site.xml from hive-default.xml.template.
Edit hive-site.xml to include the following properties (for embedded Derby
metastore):
XML
<configuration>
<property>
<name>hive.execution.engine</name>
<value>mr</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=metastore_db;create=true</valu
e>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.EmbeddedDriver</value>
</property>
<property>
<name>hive.exec.scratchdir</name>
<value>/tmp/hive</value>
</property>
<property>
<name>hive.scratch.dir.permission</name>
<value>733</value>
</property>
<property>
<name>hive.metastore.uris</name>
<value></value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop-master</value>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop-master:9000</value>
</property>
<property>
<name>datanucleus.schema.autoCreateAll</name>
<value>true</value>
</property>
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>
</configuration>
6. Run Hive CLI and Perform Operations (On Master Node as hduser):
SQL
CREATE TABLE IF NOT EXISTS employees (
id INT,
name STRING,
department STRING,
salary DOUBLE
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;