0% found this document useful (0 votes)
106 views

Hive Slides-2

This document provides an overview of Hive, a data warehousing solution built on Hadoop. It discusses how Hive addresses challenges faced by data analysts working with large datasets in Hadoop by providing a SQL-like interface called HiveQL. It describes Hive's key components including its data model with tables, partitions, and buckets; architecture with a metastore, driver, and Thrift server; and query execution via compilation into MapReduce jobs. The document also covers pros and cons of Hive and how it compares to the Pig framework.

Uploaded by

Somasekhar Ganti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
106 views

Hive Slides-2

This document provides an overview of Hive, a data warehousing solution built on Hadoop. It discusses how Hive addresses challenges faced by data analysts working with large datasets in Hadoop by providing a SQL-like interface called HiveQL. It describes Hive's key components including its data model with tables, partitions, and buckets; architecture with a metastore, driver, and Thrift server; and query execution via compilation into MapReduce jobs. The document also covers pros and cons of Hive and how it compares to the Pig framework.

Uploaded by

Somasekhar Ganti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Hive - A Warehousing Solution

Over a Map-Reduce Framework


Agenda
• Why Hive???

• What is Hive?

• Hive Data Model

• Hive Architecture

• HiveQL

• Hive SerDe’s

• Pros and Cons

• Hive v/s Pig

• Graphs
Data Analysts with Hadoop
Challenges that Data Analysts
faced
• Data Explosion

- TBs of data generated everyday

Solution – HDFS to store data and Hadoop Map-


Reduce framework to parallelize processing of Data

What is the catch?

- Hadoop Map Reduce is Java intensive

- Thinking in Map Reduce paradigm can get tricky


… Enter Hive!
Hive Key Principles
HiveQL to MapReduce
Hive Framework

Data Analyst

SELECT COUNT(1) FROM Sales;

rowcount, N
rowcount,1 rowcount,1

Sales: Hive table


MR JOB Instance
Hive Data Model

Data in Hive organized into :

• Tables

• Partitions

• Buckets
Hive Data Model Contd.

• Tables
- Analogous to relational tables
- Each table has a corresponding directory in HDFS
- Data serialized and stored as files within that directory
- Hive has default serialization built in which supports
compression and lazy deserialization
- Users can specify custom serialization –deserialization
schemes (SerDe’s)
Hive Data Model Contd.

• Partitions
- Each table can be broken into partitions

- Partitions determine distribution of data within subdirectories

Example -

CREATE_TABLE Sales (sale_id INT, amount FLOAT)

PARTITIONED BY (country STRING, year INT, month INT)

So each partition will be split out into different folders like

Sales/country=US/year=2012/month=12
Hierarchy of Hive Partitions

/hivebase/Sales

/country=US
/country=CANADA

/year=2012 /year=2012
/year=2015
/year=2014
/month=12
/month=11 /month=11
Hive Data Model Contd.

• Buckets
- Data in each partition divided into buckets

- Based on a hash function of the column

- H(column) mod NumBuckets = bucket number

- Each bucket is stored as a file in partition directory


Architecture
Externel Interfaces- CLI, WebUI,
JDBC, ODBC programming interfaces

Thrift Server – Cross Language service


framework .

Metastore - Meta data about the Hive


tables, partitions

Driver - Brain of Hive! Compiler,


Optimizer and Execution engine
Hive Thrift Server

• Framework for cross language services


• Server written in Java
• Support for clients written in different languages
- JDBC(java), ODBC(c++), php, perl, python scripts
Metastore

• System catalog which contains metadata about the Hive tables


• Stored in RDBMS/local fs. HDFS too slow(not optimized for random
access)
• Objects of Metastore
 Database - Namespace of tables
 Table - list of columns, types, owner, storage, SerDes
 Partition – Partition specific column, Serdes and storage
Hive Driver

• Driver - Maintains the lifecycle of HiveQL statement


• Query Compiler – Compiles HiveQL in a DAG of map reduce tasks
• Executor - Executes the tasks plan generated by the compiler in proper
dependency order. Interacts with the underlying Hadoop instance
Compiler
• Converts the HiveQL into a plan for execution

• Plans can

- Metadata operations for DDL statements e.g. CREATE

- HDFS operations e.g. LOAD

• Semantic Analyzer – checks schema information, type checking,


implicit type conversion, column verification

• Optimizer – Finding the best logical plan e.g. Combines multiple


joins in a way to reduce the number of map reduce jobs, Prune
columns early to minimize data transfer

• Physical plan generator – creates the DAG of map-reduce jobs


HiveQL
DDL :
CREATE DATABASE
CREATE TABLE
ALTER TABLE
SHOW TABLE
DESCRIBE

DML:
LOAD TABLE
INSERT
QUERY:
SELECT
GROUP BY
JOIN
MULTI TABLE INSERT
Hive SerDe

• SELECT Query

 Hive built in Serde: Record


Avro, ORC, Regex etc Reader

 Can use Custom Hive Table


Deserialize
SerDe’s (e.g. for
unstructured data
like audio/video
data, Hive Row Object
semistructured End User
Object Inspector
XML data) Map Fields
Good Things

• Boon for Data Analysts

• Easy Learning curve

• Completely transparent to underlying Map-Reduce

• Partitions(speed!)

• Flexibility to load data from localFS/HDFS into


Hive Tables
Cons and Possible
Improvements
• Extending the SQL queries support(Updates, Deletes)

• Parallelize firing independent jobs from the work DAG

• Table Statistics in Metastore

• Explore methods for multi query optimization

• Perform N- way generic joins in a single map reduce job

• Better debug support in shell


Hive v/s Pig
Similarities:
 Both High level Languages which work on top of map reduce framework
 Can coexist since both use the under lying HDFS and map reduce

Differences:
 Language
 Pig is a procedural ; (A = load ‘mydata’; dump A)
 Hive is Declarative (select * from A)

 Work Type
 Pig more suited for adhoc analysis (on demand analysis of click stream
search logs)
 Hive a reporting tool (e.g. weekly BI reporting)
Hive v/s Pig
Differences:

 Users
 Pig – Researchers, Programmers (build complex data pipelines,
machine learning)
 Hive – Business Analysts
 Integration
 Pig - Doesn’t have a thrift server(i.e no/limited cross language support)
 Hive - Thrift server

 User’s need
 Pig – Better dev environments, debuggers expected
 Hive - Better integration with technologies expected(e.g JDBC, ODBC)
Head-to-Head
(the bee, the pig, the elephant)

Version: Hadoop – 0.18x, Pig:786346, Hive:786346


REFERENCES

• https://hive.apache.org/

• https://cwiki.apache.org/confluence/display/Hive/Presentati
ons

• https://developer.yahoo.com/blogs/hadoop/comparing-pig-
latin-sql-constructing-data-processing-pipelines-444.html

• http://www.qubole.com/blog/big-data/hive-best-practices/

• Hortonworks tutorials (youtube)

• Graph :
https://issues.apache.org/jira/secure/attachment/12411185/h
ive_benchmark_2009-06-18.pdf

You might also like