0% found this document useful (0 votes)
167 views

Apache Hive

Apache Hive is an open source data warehouse system that allows users to query and analyze large datasets stored in Hadoop files using SQL. It was created at Facebook to provide SQL capabilities for Hadoop and address the need to analyze large amounts of web data. Hive translates SQL queries into MapReduce jobs to enable analysis. It has evolved to support additional execution engines like Spark, Tez, and LLAP to improve performance for interactive queries.

Uploaded by

kual21
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
167 views

Apache Hive

Apache Hive is an open source data warehouse system that allows users to query and analyze large datasets stored in Hadoop files using SQL. It was created at Facebook to provide SQL capabilities for Hadoop and address the need to analyze large amounts of web data. Hive translates SQL queries into MapReduce jobs to enable analysis. It has evolved to support additional execution engines like Spark, Tez, and LLAP to improve performance for interactive queries.

Uploaded by

kual21
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Apache Hive

Apache Hive is an open source data warehouse system for querying and analyzing large
data sets that are principally stored in Hadoop files. It is commonly a part of compatible
tools deployed as part of the software ecosystem based on the Hadoop framework for
handling large data sets in a distributed computing environment.

Like Hadoop, Hive has roots in batch processing techniques. It was originated in 2007
by developers at Facebook who sought to provide SQL access to Hadoop data for
analytics users. Like Hadoop, Hive was developed to address the need to handle
petabytes of data accumulating via web activity. Release 1.0 became available in
February 2015.

How Apache Hive works

Initially, Hadoop processing relied solely on the MapReduce framework, and this
required users to understand advanced styles of Java programming in order to
successfully query data. The motivation behind Apache Hive was to simplify query
development, and to, in turn, open up Hadoop unstructured data to a wider group of users
in organizations.

Hive has three main functions: data summarization, query and analysis. It supports
queries expressed in a language called HiveQL, or HQL, a declarative SQL-like language
that, in its first incarnation, automatically translated SQL-style queries into MapReduce
jobs executed on the Hadoop platform. In addition, HiveQL supported custom
MapReduce scripts to plug into queries.

When SQL queries are submitted via Hive, they are initially received by a driver
component that creates session handles, forwards requests to a compiler via Java Database
Connectivity/Open Database Connectivity interfaces, which subsequently forwards jobs
for execution. Hive enables data serialization/deserialization and increases flexibility in
schema design by including a system catalog called Hive-Metastore.
How Hive has evolved

Like Hadoop, Hive has evolved to encompass more than just MapReduce. Inclusion of
the YARN resource manager in Hadoop 2.0 helped developers' ability to expand use of
Hive, as it did other Hadoop ecosystem components. Over time, HiveQL has gained
support for the Apache Spark SQL engine as well as the Hive engine, and both HiveQL
and the Hive Engine have added support for distributed process execution via Apache
Tez and Spark.

Early Hive file support comprised text files (also called flat files), SequenceFiles (flat
files consisting of binary key/value pairs) and Record Columnar Files (RCFiles), which
store columns of a table in a columnar database way). Hive columnar storage support has
come to include Optimized Row Columnar (ORC) files and Parquet files.

Hive execution and interactivity were a topic of attention nearly from its inception. That
is because query performance lagged that of more familiar SQL engines. In 2013, to boost
performance, Apache Hive committers began work on the Stinger project, which brought
Apache Tez and directed acyclic graph processing to the warehouse system.

More on Apache Hive

Hadoop was built to organize and store massive amounts of data of all shapes, sizes and
formats. Because of Hadoop’s “schema on read” architecture, a Hadoop cluster is a
perfect reservoir of heterogeneous data—structured and unstructured—from a multitude
of sources.

Data analysts use Hive to query, summarize, explore and analyze that data, then turn it
into actionable business insight.

Advantages of using Hive for enterprise SQL in Hadoop


Feature Description

Familiar Query data with a SQL-based language

Fast Interactive response times, even over huge datasets

As data variety and volume grows, more commodity machines


Scalable and
can be added, without a corresponding reduction in
Extensible
performance

Works with traditional data integration and data analytics


Compatible
tools.

How Hive Works

Hive on LLAP (Live Long and Process) makes use of persistent query servers with
intelligent in-memory caching to avoid Hadoop’s batch-oriented latency and provide as
fast as sub-second query response times against smaller data volumes, while Hive on Tez
continues to provide excellent batch query performance against petabyte-scale data sets.

The tables in Hive are similar to tables in a relational database, and data units are
organized in a taxonomy from larger to more granular units. Databases are comprised of
tables, which are made up of partitions. Data can be accessed via a simple query language
and Hive supports overwriting or appending data.

Within a particular database, data in the tables is serialized and each table has a
corresponding Hadoop Distributed File System (HDFS) directory. Each table can be sub-
divided into partitions that determine how data is distributed within sub-directories of the
table directory. Data within partitions can be further broken down into buckets.

Hive supports all the common primitive data formats such as BIGINT, BINARY,
BOOLEAN, CHAR, DECIMAL, DOUBLE, FLOAT, INT, SMALLINT, STRING,
TIMESTAMP, and TINYINT. In addition, analysts can combine primitive data types to
form complex data types, such as structs, maps and arrays.

You might also like