LectureNotes Hive Final
LectureNotes Hive Final
CIS 612
SUNNIE CHUNG
APACHE HIVE IS
Data warehouse infrastructure built on top of
Hadoop enabling data summarization and ad-hoc
3
HIVE
Tables
Tables
Analogous to tables in relational database
Each table has a corresponding HDFS directory
Hive provides built-in serialization formats which exploit compression
and lazy-serialization
Partitions
Each table can have one or more partitions (Horizontal Partitions)
Example:
Table T in the directory : /wh/T.
If Tis partitioned on columns ds = ‘20090101’, and ctry = ‘US’, will be
stored /wh/T/ds=20090101/ctry=US.
Buckets
Data in each partition may in turn be divided into buckets based on the
hash of a column in the table
Each bucket is stored as a file in the partition directory
TABLE SCHEMA EXAMPLE
CREATE TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
6
HIVE QUERY LANGUAGE
SQL like language: HiveQL
DDL : to create tables with specific serialization
Compute daily
statistics on the
frequency of status
updates based on
gender and school
ADVANTAGES OF HIVE
Familiar: hundreds of unique users can
simultaneously query the data using a language
familiar to SQL users.
Bottom
Top
Figure 2: Query plan with 3 map-reduce
jobs for multi-table insert query
HIVE ARCHITECTURE
Compile
The compiler converts the string(DDL/DML/query statement)
to a plan.
The parser transforms a query string to a parse tree
representation
The semantic analyzer transforms the parse tree to a block-based
exchange,stock_symbol,date,stock_price_open,stock_price_hig
h,stock_price_low,stock_price_close,stock_volume,stock_price_
adj_close
NASDAQ,BBND,2010-02-08,2.92,2.98,2.86,2.96,483800,2.96
NASDAQ,BBND,2010-02-05,2.85,2.94,2.79,2.93,884000,2.93
NASDAQ,BBND,2010-02-04,2.83,2.88,2.78,2.83,1333300,2.83 21
….
CREATE TABLE TO HOLD THE DATA:
Describe table:
hive> DESCRIBE DATABASE financials;
OK
Financials
hdfs://localhost:54310/user/hive/warehouse/financials.db
Use database:
hive> USE financials;
Drop database: 23
hive> DROP DATABASE IF EXISTS financials;
HOW TO LOAD DATA INTO HIVE TABLE
Use LOAD DATA to import data into a Hive
table
25
HIVE SUPPORTS THE FOLLOWINGS:
WHERE Clause
UNION All and DISTINCT
LIMIT Clause
26
OUTPUT DATA
27
LOAD FILE INTO TABLE:
29
DEFINITION: ACID
Atomicity
Atomicity requires that each transaction be "all or nothing": if one part of the transaction
fails, the entire transaction fails, and the database state is left unchanged. An atomic
system must guarantee atomicity in each and every situation, including power failures,
errors, and crashes. To the outside world, a committed transaction appears (by its effects on
32
PROJECTS & TOOLS ON HADOOP
HBase
Hive
Jaql
ZooKeeper
AVRO
UIMA
Sqoop
33
HIVE TUTORIAL
https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-HiveTutorial
[7] Jiong Xie, Shu Yin, Xiaojun Ruan, Zhiyang Ding, Yun Tian, James Majors,
Adam Manzanares, Xiao Qin, " Improving MapReduce Performance
through Data Placement in Heterogeneous Hadoop Clusters", 19th
International Heterogeneity in Computing Workshop, Atlanta, Georgia,
April 2010
35
REFERENCES
[8]Dhruba Borthakur, The Hadoop Distributed File System:
Architecture and Design, The Apache Software Foundation 2007.
[9] "Apache Hadoop",
http://en.wikipedia.org/wiki/Apache_Hadoop
36