0% found this document useful (0 votes)
147 views

7 Hive Notes

Hive is a data warehouse infrastructure built on top of Hadoop. It provides data summarization, query, and analysis capabilities. Hive comprises a metastore to store metadata, a query language called HiveQL, a compiler to parse and optimize queries into a physical execution plan of MapReduce jobs, and an execution engine to run the jobs. Hive allows users to query data using SQL-like queries and manages the execution via MapReduce for scalable data processing.

Uploaded by

Sandeep Boyina
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
147 views

7 Hive Notes

Hive is a data warehouse infrastructure built on top of Hadoop. It provides data summarization, query, and analysis capabilities. Hive comprises a metastore to store metadata, a query language called HiveQL, a compiler to parse and optimize queries into a physical execution plan of MapReduce jobs, and an execution engine to run the jobs. Hive allows users to query data using SQL-like queries and manages the execution via MapReduce for scalable data processing.

Uploaded by

Sandeep Boyina
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Background

• Yahoo worked on Pig to facilitate application


deployment on Hadoop.
– Their need mainly was focused on unstructured data

• Simultaneously Facebook started working on


deploying warehouse solutions on Hadoop that
resulted in Hive.
– The size of data being collected and analyzed in
industry for business intelligence (BI) is growing
rapidly making traditional warehousing solution
prohibitively expensive.
Hive Architecture
HIVE: Components
• Shell: allows interactive queries
• Driver: session handles, fetch, execute
• Compiler: parse, plan, optimize
• Execution engine: DAG of stages (MR, HDFS,
metadata)
• Metastore: schema, location in HDFS, owner,
creation time, access attributes, SerDe
Compilation Process
 Parser
- Converts query string to parse tree
 Semantic Analyzer
- Converts parse tree to block-based internal query representation
- Verification, expansion of select * and type checking
 Logical Plan Generator
- Converts internal query to logical plan
 Optimizer
- Combines multiple joins
- Adds repartition operator
- Minimize data transfers and pruning
 Physical Plan Generator
- Converts logical plan into physical plan (DAG of map-reduce jobs)
Hive QL
Commands to work with scripts
Bash Shell

Hive Shell
Type System
Primitive types
– Integers:TINYINT, SMALLINT, INT, BIGINT.
– Boolean: BOOLEAN.
– Floating point numbers: FLOAT, DOUBLE .
– String: STRING.
Complex types
– Structs: {a INT; b INT}.
– Maps: M['group'].
– Arrays: ['a', 'b', 'c'], A[1] returns 'b'.
Data Model Types - Tables
– Analogous to tables in relational database
– Actual data is stored in a Hadoop Filesystem and each
table has a corresponding HDFS dir
– Metadata is always stored in a meta store
– Managed Tables
• Hive physically moves data into its warehouse
– External Tables
• Hive refers data from existing location in HDFS
Example

– Managed Tables
• $ CREATE TABLE managed_table (dummy STRING);

– External Tables
• $ CREATE EXTERNAL TABLE external_table (dummy STRING)
LOCATION '/user/tom/external_table';
 To List schema of a table

DESRIBE <tablename>;
Importing Data into HIVE
DDL Contd.,
Query Operations
Hive QL – Group By
pv_users
pageid age
pageid age count
1 25
1 25 1
2 25
2 25 2
1 32
1 32 1
2 25

SELECT pageid, age, count(1)


FROM pv_users
GROUP BY pageid, age;
Hive QL – Group By with Distinct
page_view
pageid userid time

1 111 9:08:01 pageid count_distinct_userid


2 111 9:08:13 1 2
1 222 9:08:14 2 1
2 111 9:08:20

SELECT pageid, COUNT(DISTINCT userid)


FROM page_view GROUP BY pageid
Hive QL – Inner Join
page_view
user
pageid userid time pageid age
userid age gender
1 111 9:08:01 1 25
X 111 25 female =
2 111 9:08:13 2 25
222 32 male
1 222 9:08:14 1 32

• SQL:
SELECT pv.pageid, u.age
FROM page_view pv JOIN user u ON (pv.userid =
u.userid);
Outer Joins
• Left Outer Join
SELECT pv.pageid, u.age FROM page_view pv LEFT OUTER
JOIN user u ON (pv.userid = u.userid);

• Right Outer Join


SELECT pv.pageid, u.age FROM page_view pv RIGHT OUTER
JOIN user u ON (pv.userid = u.userid);
• Full Outer Join
SELECT pv.pageid, u.age FROM page_view pv FULL OUTER
JOIN user u ON (pv.userid = u.userid);
Semi Join
• Semi Joins
– Hive does not support IN sub queries
hive> SELECT * from things WHERE things.id IN (SELECT id from
sales);
– So solution is semi joins
hive> SELECT * from things LEFT SEMI JOIN ON (sales.id = things.id);
Exporting Data From HIVE
Exporting data from HIVE into HDFS
Files, Tables and Local Files
FROM pv_users
INSERT INTO TABLE pv_gender_sum
SELECT pv_users.gender, count_distinct(pv_users.userid)
GROUP BY(pv_users.gender)
INSERT INTO DIRECTORY ‘/user/facebook/tmp/pv_age_sum.dir’
SELECT pv_users.age, count_distinct(pv_users.userid)
GROUP BY(pv_users.age)
INSERT INTO LOCAL DIRECTORY ‘/home/me/pv_age_sum.dir’
FIELDS TERMINATED BY ‘,’ LINES TERMINATED BY \013
SELECT pv_users.age, count_distinct(pv_users.userid)
GROUP BY(pv_users.age);

You might also like