0% found this document useful (0 votes)
113 views

Big Data Analytics: Welcome

The document provides an introduction to Hive, including: 1. Hive is a data warehousing tool used to query large volumes of structured data stored in Hadoop. It uses HDFS for storage, MapReduce for execution, and stores metadata in an RDBMS. 2. Hive allows SQL-like queries (HiveQL) to be run over large datasets in parallel via automatically generated MapReduce jobs, enabling analysis of data too large for traditional databases. 3. Hive supports various data types including primitive types, collection types (arrays, maps, structs), and file formats like text, RCFile, ORC for storage. Managed and external tables are used for schema management and data access.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
113 views

Big Data Analytics: Welcome

The document provides an introduction to Hive, including: 1. Hive is a data warehousing tool used to query large volumes of structured data stored in Hadoop. It uses HDFS for storage, MapReduce for execution, and stores metadata in an RDBMS. 2. Hive allows SQL-like queries (HiveQL) to be run over large datasets in parallel via automatically generated MapReduce jobs, enabling analysis of data too large for traditional databases. 3. Hive supports various data types including primitive types, collection types (arrays, maps, structs), and file formats like text, RCFile, ORC for storage. Managed and external tables are used for schema management and data access.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 69

WELCOME

BIG DATA ANALYTICS

BY

SAIYED FAIAYAZ WARIS


ASSISTANT PROFESSOR
DEPARTMENT OF CSE
VFSTR DEEMED TO BE UNIVERSITY
UNIT –IV ( HIVE - TOPIC)

• Introduction to Hive: Introduction to Hive, Hive Architecture ,


Hive Data Types, Hive File Format, Hive Query Language
(HQL), User-Defined Function (UDF) in Hive.
LEARNING OBJECTIVES AND LEARNING OUTCOMES

Learning Objectives Learning Outcomes

Introduction to Hive

1. To study the Hive Architecture a) To understand the hive architecture.


b) To create databases, tables and
2. To study the Hive File format execute data manipulation language
statements on it.
3. To study the Hive Query Language c) To differentiate between static and
dynamic partitions.
d) To differentiate between managed
and external tables.
INTRODUCTION TO HIVE
WHY HIVE ?
• Built on HiveQL (HQL), a SQL-like query

language.

• Interprets HiveQL and generates MapReduce jobs


that run on the cluster.

• Enables easy data summarization, ad-hoc reporting


and querying, and analysis of large volumes of data
WHAT IS HIVE?

Hive is a Data Warehousing tool. Hive is used to query


structured data built on top of Hadoop. Facebook created Hive
component to manage their ever-growing volumes of data.
Hive makes use of the following:
1. HDFS for Storage

2. Map Reduce for execution

3. Stores metadata in an RDBMS.


MAP REDUCE VERSUS HIVE
• All queries will be converted to map reduce jobs for
execution, so why we can’t we write Map reduce
ourselves?
• Understanding Internal of Hadoop framework is must to
write Map reduce jobs.
• Engineer having SQL knowledge can quickly write hive
scripts and get the result
HIVE ADVATAGE
The advantage of using hive are:

It can be used as an ETL

Providing a capability of Querying and Analysis

Can Handle large Data sets

SQL(Filter, Join, Group By)on the top of Map and Reduce.


WHERE NOT TO USE HIVE

• Hive should not be used:

If the Data does not cross GB

If We don’t find Schema

If We need a response in a seconds and for Low latency application.

If RDBMS solve, doesn’t Invest time in Hive.


HIVE FEATURES
Similarity with SQL:
 Hive is very similar to RDBMS SQL like queries.
 it is also called as HQL or Hive Ql based on SQL-92 Framework.
Difference from SQL:
Hive Query execute on Hadoop Infrastructure rather than a Traditional
Database.
This allow Hive to handle Huge Datasets.
The Internal Execution of Hive Query is a via Series of automatically
generate map Reduce jobs.
HIVE ARCHITECTURE
JOB EXECUTION INSIDE HIVE
STEPS :
• Execute Query-The Hive interface such as Command Line or Web UI sends
query to Driver (any database driver such as JDBC, ODBC, etc.) to execute.
• Get Plan-The driver takes the help of query compiler that parses the query
to check the syntax and query plan or the requirement of query.
• Get Metadata-The compiler sends metadata request to Metastore (any
database).
• Send Metadata-Metastore sends metadata as a response to the compiler.
• Send Plan-The compiler checks the requirement and resends the plan to the
driver.Up to here, the parsing and compiling of a query is complete.
• Execute Plan-The driver sends the execute plan to the execution engine.
• Execute Job-Internally, the process of execution job is a Map Reduce job.
The execution engine sends the job to Job Tracker, which is in Name node
and it assigns this job to Task Tracker, which is in Data node. Here, the
query executes Map Reduce job.

• Metadata Ops-Meanwhile in execution, the execution engine can execute


metadata operations with Meta store.

• Fetch Result-The execution engine receives the results from Data nodes
HIVEQL TO MAPREDUCE
Hive Framework

Data Analyst

SELECT COUNT(1) FROM Sales;

rowcount, N
rowcount,1 rowcount,1

Sales: Hive table


MR JOB Instance
HIVE DATA TYPES
Numeric Data Type
TINYINT 1 - byte signed integer
SMALLINT 2 -byte signed integer
INT 4 - byte signed integer
BIGINT 8 - byte signed integer
FLOAT 4 - byte single-precision floating-point
DOUBLE 8 - byte double-precision floating-point number

String Types
STRING  
VARCHAR Only available starting with Hive 0.12.0
CHAR Only available starting with Hive 0.13.0
Strings can be expressed in either single quotes (‘) or double quotes (“)

Miscellaneous Types
BOOLEAN  
BINARY Only available starting with Hive
HIVE DATA TYPES
Collection Data Types

STRUCT Similar to ‘C’ struct. Fields are accessed using dot notation.
E.g.: struct('John', 'Doe')

MAP A collection of key - value pairs. Fields are accessed using [] notation.
E.g.: map('first', 'John', 'last', 'Doe')

ARRAY Ordered sequence of same types. Fields are accessed using array index.
E.g.: array('John', 'Doe')
HIVE FILE FORMAT

Why File format is required !!!!


• File format: It is the way in which information is stored or encoded in a
computer file.
• In HDFS it takes relevant amount of time to read and write data back to the
location in HDFS.
• File format are designed to allow to speed up the read and write process and
also its provides compression support
TYPES HIVE FILE FORMAT
• Text File:
The default file format is text file.

• Sequential File:
Sequential files are flat files that store binary key-value pairs.

• RCFile (Record Columnar File)


 RC File stores the data in Column Oriented Manner which ensures
that Aggregation operation is not an expensive operation.
 It is very useful for the Analytic of data .
• ORC File (Optimized Columnar file):
Optimized data storage and Increase the Performance.
TEXT FILE FORMAT
SEQUENTIAL FILE FORMAT
HIVE QUERY LANGUAGE (HQL)

The Hive Query Language (Hive QL) is a query language for Hive to
process and analyze structured data in a Metastore

1. Create and manage tables and partitions.

2. Support various Relational, Arithmetic, and Logical Operators.

3. Evaluate functions.

4. Download the contents of a table to a local directory or result of

queries to HDFS directory.


CREATE DATABASE STATEMENT
• Create Database is a statement used to create a database in Hive. A data.
base in Hive is a namespace or a collection of tables.
hive> CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>;
Or
hive> CREATE SCHEMA userdb;
SHOW DATABASES

hive> drop database journaldev;

hive> Describe database jornaldev;


HQL: CREATING TABLES

• We have an active database present where we can create some tables as well. To do this,
first switch to the DB you want to use:

• hive> use journaldev;

Hive deals with two types of table structures like Internal and External tables depending

on the loading and design of schema in Hive.

Note: if the processing data available in local file system then we need to use internal table.
• There are two types of table:
Managed table:
External table
Internal or Managed table:
By Default, any table if you create is a managed table.
Hive> show tables;
trnxrecords
Customer
hive> describe formatted trnxrecords;
 its going to say that it is Managed_ table.
Managed table means table is managed by hive.
Location of the table: user/hive/warehouse/vignan.db/trnxrecrords.
• To Create Internal Table( Managed table) :
• Hive>CREATE TABLE guruhive_internaltable (id INT,Name STRING);
• Row format delimited
• Fields terminated by '\t';
• Load the data into internal table:
Hive>LOAD DATA INPATH '/user/guru99hive/data.txt' INTO table
guruhive_internaltable;
• Display the content of the table : 
Hive>select * from guruhive_internaltable;
To drop the internal table:
Hive>DROP TABLE from guruhive_internaltable;
EXTERNAL TABLE

• Data will be available in HDFS.The table is going to create on HDFS data.

When to Choose External Table:

• If processing data available in HDFS .

• Useful when the files are being used outside of Hive


• To Create External Table:
Hive>CREATE EXTERNAL TABLE guruhive_external(id INT,Name
STRING) Row format delimited Fields terminated by '\t'
LOCATION '/user/cloudera/guruhive_external;
• To Load Data:
Hive>LOAD DATA INPATH '/user/cloudera/data.txt' INTO TABLE
guruhive_external;
To Display the content of the Table:
Hive>select * from guruhive_external;
To drop the internal table:
Hive>DROP TABLE guruhive_external;
DIFFERENCE BETWEEN MANAGED TABLE
AND EXTERNAL TABLE.

1. No need to specify the location in case of managed table and if you drop
the managed table both data and table will be deleted.
Hive> drop table guruhive_internaltable ;//internal table
2. We need to specify the location in the external table and if the drop table
then only table will be dropped but data will be available.
hive>drop table guruhive_external ;//External table
ANALYZING DATA USING HIVE
• Let Us take some Transactional Data which is stored in A Desktop in a Linux
System
• $ CD Desktop
• Desktop$ vi txnsl.txt
Data Inside Txnsl.txt :
• 00000004,20-09-2020,40002613,0.98.81,Teamsport,Fieldhockey,Guntur,Andhapradesh,Credit

Custome Amount
Transaction Id Date of Category Product City State Credit
r Id Spent in
Transactio of Sport name
dollar
n
Just Imagine that We have 50,000 records in the file Txnsl.txt
• Let some customer data which is placed in the Desktop of Linux system.
• $ CD Desktop
• Desktop$ vi custs.txt
• $ vi custs.txt
• Data Inside custs.txt :
4000001,pavan,kumar,55,Pilot

Custome First Last Age Profession


r ID name name

• 50002341,Siva,Prasad,32,Lecturer
• 6234567,Nish,kala, 22, Software engineer
Imagine that we have around 55000 records like this in
the file Custs.txt
CREATE AND LOAD DATA INTO TXNRECORDS TABLE

• Just type hive and hit hive


• hive> …….Hive shell ( Command Line Interface of Hive)
• hive> Create database vignan;
• hive> Show databases;
Vignan Create a Table :
hive> use vignan; Hive> create table txnrecords(txno int,txndate string,custno
int,amount double,category string,product string, city string,
state string,spendby string)
Row format delimated
Fields terminated by ‘,’
Stored as textfile;
LOAD DATA INTO TXNRECORDS TABLE
• Hive> load data local inpath ‘/home/cloudera/Desktop/txnsl.txt’ into
table txnrecords.
• To Display records from the table:
• Hive> select * from txnrecords limit 20;
Note:
Whenever we Install Hive in any Platform, It create a folder
called warehouse in hdfs
We can this using command :
$hdfs dfs –ls/user/hive…..Hit this command
/user/hive /warehouse ……… Warehouse folder will be created.
Inside Dataware house we can found all the databases so we can say that
Data ware house is the collection of all databases.
•Whenever we create a database then database will be created in the
Following path:
•/user/hive/warehouse/vignan.db

WAREHOUS Folder / Directory


E

VIGNAN Folder /Directory

Folder/Directory
TXNRECOR
DS

Txnsl.txt Files
• We can see this Information using commands:

• $ hdfs dfs –ls/user/hive/warehouse/vignan.db

• User/hive/warehouse/vignan.db/txnrecords---------- Table

• $hdfs dfs –ls/user/hive/warehouse/vignan.db/txnrecords

• /txnsl.txt………….file

Note: Hive is Just giving an Projection

The actual data is just stored in Hadoop.


NOW CONTINUE THE PROBLEM ………….
• Create and load data into customer table :
Create table:
hive> create table customer (custno string,firstname string,lastname string, age int,
profession string)
Row format delimated
Fields terminated by ‘,’
Stored as textfile;
Load table:
hive> load data local inpath ‘/home/cloudera/Desktop/custs.txt’ into table customer.
hive> select count(*) from txnrecords; …… Here it will fire map reduce jobs.
will display 50,000 records.
CREATE ONE MORE TABLE WHICH
CAN STORE RESULT OF EXITING TABLE
AFTER JOIN OPERATION
hive> create table out1 (custno int,firstname string, age int,
profession string,amount double,product string )
Row format delimated
Fields terminated by ‘,’
Stored as textfile;
PERFORM INNER JOIN OPERATION
(CUSTOMER AND LOAD DATA INTO NEW
TABLE OUT1

Insert data into out1


table
Out 1

Customer table Txn records table


• hive> insert overwrite table out1

select a.cust no, a.first name, a.age, a.profession, b.amount,

b.product

From customer a Join txnrecords b on a.custno=b.custno;

hive> select * from out1 limit 20;

• 00000004,Siva 32,Lecturer ,98.8,book …….Like this 20


record will be displayed.
To create one more table for classification of customer:
hive> create table out2 (custno int,firstname string, age int,
profession string,amount double,product string,level string )
Row format delimated
Fields terminated by ‘,’
Stored as textfile;
Load Data :
Hive> insert overwrite table out2
Select *,case
When age<30 then ‘low’
When age >=30 and age <50 then ‘middle’
When age >=50 then ‘old’
Else ‘other’ end
From out 1;
•To Display Content from Table:
•Hive> select * from out2 limit 20;
4007024 rajnikanth 59 actor 560.33 weightlifting old
Like this 20 records will be displaying.
Create another table to store final result:
hive> create table out3 (level string, amount double)
Row format delimated
Fields terminated by ‘,’
Stored as textfile;
Load Data:
Load data into out3 from out2 after performing certain operation
hive> insert overwrite table out3
Select level,sum(amount) from out2
Group by level;
FINAL ANALYZING RESULT

•Select * from out3;


•Low 725222331.2
•Middle 567890023.65
•Old 34567.89
PARTITONING IN HIVE
• Hive organizes tables into partitions. It is a way of dividing a table into related parts
based on the values of partitioned columns such as date, city, and department. Using
partition, it is easy to query a portion of the data.

Why we need partitioning ?


Let us Understand with an example:
Let us create a table called sales data and loading sales data of month
Hive>user/hive/warehouse/vignan.db/salesdata/January.txt
Hive>user/hive/warehouse/vignan.db/salesdata/Feb.txt
Hive>user/hive/warehouse/vignan.db/salesdata/march.txt
Hive>user/hive/warehouse/vignan.db/salesdata/Apr.txt
• Now if you give this query :
Select * from salesdata where month =‘apr’;
Here searching time may increase to find a particular file so, To allow the
query so be fast we need to use partition to solve the Above problem.
Partition my table based on the month column:
Hive>user/hive/warehouse/vignan.db/salesdata/JAN/January.txt
Hive>user/hive/warehouse/vignan.db/salesdata/FEB/Feb.txt
Hive>user/hive/warehouse/vignan.db/salesdata/MAR/march.txt
Hive>user/hive/warehouse/vignan.db/salesdata/APR/Apr.txt
Select * from salesdata where month =‘apr’; here the query is fast.
• There are two types of partitioning:
• Static partitioning: we need to create the partition manually and load the data
• Dynamic Partitioning: Hive will automatically detect the data and create partition.
• By default ,hive will allow static partition.
Why need static partitioning:
Let us take example:
Ram,12345, [email protected]
Srikanth,237789,[email protected] here I create two partition based on country
Prasad,65467,[email protected] INDIA------PARTITION {Ram, Srikant}
Ramesh,3456,[email protected] USA-----------PARTION { Prasad,ramesh}
If we don’t have strict column for India and USA in that case we can partition statically.
Static partitioning:
Create a Table:

• Hive> create table partition_user(First name varchar(64),lastname


varchar(64),address string,city varchar(64),post string,phone1
varchar(64),phone2 string,email string,web string) partitioned by
(Country varchar(64),state varchar(64)) row format delimited fields
terminated by ‘,’;
Load data into a table:
Hive> load data local inpath ‘/home/cloudera/Desktop/static.txt’ into
table partition_user portioned (Country=‘India’,state=‘Andhrapradeh’);
DYNAMIC PORTIONING
Before doing dynamic portioning we need to enable
• hive> set hive.exec.dynamic.partition=true;
• hive> set hive.exec.max.dynamic.partition.mode=nonstrict;
• Hive>set hive.exec.dynamic.partition.pernode=1000;
To create table:
Hive> create table dynamic_user(First name varchar(64),lastname
varchar(64),address string, country varchar(64),city varchar(64),state
varchar(64),post string,phone1 varchar(64),phone2 string,email string,web
string) row format delimited fields terminated by ‘,’ stored as textfile;
• Load the data into table:

Hive> load data local inpath

‘/home/cloudera/Desktop/customer_data.txt’ into table dynamic_user.

Hive> Select firstname,phone1,city from dynamic_user where

country=‘india’ and state=‘andhrapradesh’ order by city limit

5;----------- this query is without partitioning

This may take around 85 .025 seconds to get result.


Create table for dynamic -partition:
• Hive> create table partition_user1(First name varchar(64),lastname
varchar(64),address string,city varchar(64),post string,phone1
varchar(64),phone2 string,email string,web string) partitioned by
(Country varchar(64),state varchar(64)) row format delimited fields
terminated by ‘,’;
• Hive> insert into table partition_user1 partition(country,state) select
firstname,lastname,address,city,post,phone1,phone2,email,web,country,st
ate from dynamic_user;
BUCKTING
• The bucketing in Hive is a data organizing technique. It is similar to
partitioning in Hive with an added functionality that it divides large datasets

into more manageable parts known as buckets. So, we can use bucketing in

Hive when the implementation of partitioning becomes difficult. However,

we can also divide partitions further in buckets.


WORKING OF BUCKETING IN HIVE
EXAMPLE

• hive> create database bucketedb;

• hive> use bucketedb;

• hive> create table simple_table(id int, firstname string, lastname string)

> row formatted delimited field terminated by ‘,’

> stored as a text file;


• Let us take a file which contains some data “ say file name is Ivcse.txt
1,NAFEEZ BASHA, MOHAMMAD
2,KATRAGUNTA, PAVAN KUMAR
3,AMULOTHU ,MARUTHIKUMAR
4,CHERUKURI, KAVYA
5,CHOPPARAPU. SRAVANTHI
6,EMANI ,CHANDRIKA
7,GOGINENI ,JEEVANA
8,GORANTLA, NAVYA
9,GORIPARTHY,POOJITHA
10,JARABANI ,HARI KRISHNA
11,KILARI, BHAVANA
12,KODALI ,TEJASRI
13,KOLLIPARA ,BHAVISHYA

14,KOLLIPARANAGA ,RAJASRI KAVYA


15,KONKA ,CHANDANA
16,MALLINA ,RAMYASRI
17,MUNGARA, SUPRAJA

18,PABBISETTY MINISH ,VENKATA ADINARAYANA


19,PAGOLU ,KARTHIKEYA
20,PAIDI ,PADMAJA
• Loading a data:
• Hive> load data local inpath ‘/home/cloudera/Desktop/Ivcse.txt’ into table
Sample_table
• Hive> create table bucket_cse(id int,firstname string,lastname string)
> clustered by id into 5 packets

> row formatted delimited field terminated by ‘,’

> stored as a text file;

Hive> insert overwrite table bucket_cse Analyze this query and find

which data is stored in which bucket

Select * from sample_table ;


•$ hdfs dfs –cat
/user/hive/warehouse/bucketed.db/b
ucket_cse/000000_0
• we can see the data inside bucket 0
USER DEFINED FUNCTION IN HIVE

• User Defined Functions, also known as UDF, allow you to create


custom functions to process records or groups of records. Hive
comes with a comprehensive library of functions
• There are however some omissions, and some specific cases for
which UDFs are the solution.
IN BUILT FUNCTION IN HIVE
• hive> Show Function; // show all in built function in hive
• hive> Describe function function_name;// Particular function name
can be displayed
• Hive> Describe function concat ;
For example:
select concat(fname,lname) as fullname from emp;
HOW TO CREATE USER DEFINE FUNCTION
IN HIVE
• Let us take some problem to understand this one:
Suppose if you data in a table like this:
21 $$murali Krishna
22 $himsh kiran
Here we want to remove $ from the last time for that I need to create user
Define function in hive
For that I need to write java program and insert jar file into hive shell for use.
Public class RemoveChar Exetends UDF {
Text colValue= newText();
Public text Evaluate (Text str, string charRemove)
{
If(str==null)
return str;
Covalue.set(StringUtils.strip(strip.Tostring(),charRemove));
returncolvalue;
}
• hive>add jar/home/Hadoop/downloads/hiveUDF.jar
• Create temporary function removeCharacter
as’RemoveChar’;
• Select removecharacter(fn,’$’) from Temporary_Table;
murali // $ is removed using UDF
HIVE FUNCTION TYPES

• Function Type :

1. Standard function: reverse(),ucase(),round(), floor()

2. Aggregate function: sum(), avg(),max(),min().

3. Table Generating Function: explode(), Array();


• Create table arrays(id1 int,id2 array<string>,id3 array<int>)
• Row format delimated
• Field terminated by ‘\t’
• Collection item terminated by ‘,’;

Suppose you have data which is look like this


1 1,2 1,2,3
2 3,4 4,5
3 5,6 6.7
4 7,8
• Hiv> select * from arrays;
• 1 [“1”,”2”] [1.2,3]
• 2 [“3”,”4”] [4.5]
• 3 [“5”.”6”] [6,7]
• 4 [“7”,”8”] Null
• Select explode(id2) as id from arrays;
•1
•2
•3
•4
•5
•6
•7
•8

You might also like