Apache Hive
Apache Hive
Apache Hive is an open source data warehouse system for querying and analyzing large
data sets that are principally stored in Hadoop files. It is commonly a part of compatible
tools deployed as part of the software ecosystem based on the Hadoop framework for
handling large data sets in a distributed computing environment.
Like Hadoop, Hive has roots in batch processing techniques. It was originated in 2007
by developers at Facebook who sought to provide SQL access to Hadoop data for
analytics users. Like Hadoop, Hive was developed to address the need to handle
petabytes of data accumulating via web activity. Release 1.0 became available in
February 2015.
Initially, Hadoop processing relied solely on the MapReduce framework, and this
required users to understand advanced styles of Java programming in order to
successfully query data. The motivation behind Apache Hive was to simplify query
development, and to, in turn, open up Hadoop unstructured data to a wider group of users
in organizations.
Hive has three main functions: data summarization, query and analysis. It supports
queries expressed in a language called HiveQL, or HQL, a declarative SQL-like language
that, in its first incarnation, automatically translated SQL-style queries into MapReduce
jobs executed on the Hadoop platform. In addition, HiveQL supported custom
MapReduce scripts to plug into queries.
When SQL queries are submitted via Hive, they are initially received by a driver
component that creates session handles, forwards requests to a compiler via Java Database
Connectivity/Open Database Connectivity interfaces, which subsequently forwards jobs
for execution. Hive enables data serialization/deserialization and increases flexibility in
schema design by including a system catalog called Hive-Metastore.
How Hive has evolved
Like Hadoop, Hive has evolved to encompass more than just MapReduce. Inclusion of
the YARN resource manager in Hadoop 2.0 helped developers' ability to expand use of
Hive, as it did other Hadoop ecosystem components. Over time, HiveQL has gained
support for the Apache Spark SQL engine as well as the Hive engine, and both HiveQL
and the Hive Engine have added support for distributed process execution via Apache
Tez and Spark.
Early Hive file support comprised text files (also called flat files), SequenceFiles (flat
files consisting of binary key/value pairs) and Record Columnar Files (RCFiles), which
store columns of a table in a columnar database way). Hive columnar storage support has
come to include Optimized Row Columnar (ORC) files and Parquet files.
Hive execution and interactivity were a topic of attention nearly from its inception. That
is because query performance lagged that of more familiar SQL engines. In 2013, to boost
performance, Apache Hive committers began work on the Stinger project, which brought
Apache Tez and directed acyclic graph processing to the warehouse system.
Hadoop was built to organize and store massive amounts of data of all shapes, sizes and
formats. Because of Hadoop’s “schema on read” architecture, a Hadoop cluster is a
perfect reservoir of heterogeneous data—structured and unstructured—from a multitude
of sources.
Data analysts use Hive to query, summarize, explore and analyze that data, then turn it
into actionable business insight.
Hive on LLAP (Live Long and Process) makes use of persistent query servers with
intelligent in-memory caching to avoid Hadoop’s batch-oriented latency and provide as
fast as sub-second query response times against smaller data volumes, while Hive on Tez
continues to provide excellent batch query performance against petabyte-scale data sets.
The tables in Hive are similar to tables in a relational database, and data units are
organized in a taxonomy from larger to more granular units. Databases are comprised of
tables, which are made up of partitions. Data can be accessed via a simple query language
and Hive supports overwriting or appending data.
Within a particular database, data in the tables is serialized and each table has a
corresponding Hadoop Distributed File System (HDFS) directory. Each table can be sub-
divided into partitions that determine how data is distributed within sub-directories of the
table directory. Data within partitions can be further broken down into buckets.
Hive supports all the common primitive data formats such as BIGINT, BINARY,
BOOLEAN, CHAR, DECIMAL, DOUBLE, FLOAT, INT, SMALLINT, STRING,
TIMESTAMP, and TINYINT. In addition, analysts can combine primitive data types to
form complex data types, such as structs, maps and arrays.