module 4 HIVE1ppt
module 4 HIVE1ppt
set Prints a list of configuration variables that are overridden by the user or Hive.
add FILE[S] <filepath> <filepath>* Adds one or more files, jars, or archives to the list of resources in the distributed cache.
add JAR[S] <filepath> <filepath>*
add ARCHIVE[S] <filepath> <filepath>*
add FILE[S] <ivyurl> <ivyurl>* As of Hive 1.2.0, adds one or more files, jars or archives to the list of resources in the distributed cache using an Ivy
add JAR[S] <ivyurl> <ivyurl>* URL of the form ivy://group:module:version?query_string.
add ARCHIVE[S] <ivyurl> <ivyurl>*
list FILE[S]list JAR[S]list ARCHIVE[S] Lists the resources already added to the distributed cache.
list FILE[S] <filepath>* Checks whether the given resources are already added to the distributed cache or not.
list JAR[S] <filepath>*
list ARCHIVE[S] <filepath>*
delete FILE[S] <filepath>* Removes the resource(s) from the distributed cache.
delete JAR[S] <filepath>*
delete ARCHIVE[S] <filepath>*
delete FILE[S] <ivyurl> <ivyurl>* As of Hive 1.2.0, removes the resource(s) which were added using the <ivyurl> from the distributed cache.
delete JAR[S] <ivyurl> <ivyurl>*
delete ARCHIVE[S] <ivyurl> <ivyurl>*
dfs <dfs command> Executes a dfs command from the Hive shell.
<query string> Executes a Hive query and prints results to standard output.
–hiveconf property=value Use value for the given configuration property. Properties that are
listed in hive.conf.restricted.list cannot be reset with hiveconf.
Usage: beeline –hiveconf prop1=value1
–hivevar name=value Hive variable name and value. This is a Hive-specific setting in
which variables can be set at the session level and referenced in
Hive commands or queries. Usage: beeline –hivevar var1=value1
• Comparison with Traditional Databases
• Hive is intended to manage large-scale data analytics and querying on top
of the Hadoop environment, while RDBMS is generally used to manage
structured databases.
• What is RDBMS?
• RDBMS stands for Relational Database Management System. RDBMS is
a type of database management system that is specifically designed for
relational databases. RDBMS is a subset of DBMS. A relational database
refers to a database that stores data in a structured format using rows and
columns and that structured form is known as a table. There are certain
rules defined in RDBMS that are known as Codd’s rule.
• Characteristics of RDBMS
• Structured Storage: Data is stored in a tabular format with rows and
columns.
• Fixed Schema: The database’s preset structure is immutable and cannot
be altered dynamically.
• Data Normalization: To lessen dependencies and redundancies, data
must be kept in a normalized format.
• SQL-Based: Data is defined and altered using Structured Query
Language (SQL).
• What is Hive?
• Hive is a data warehouse software system that provides data query and
analysis. Hive gives an interface like SQL to query data stored in various
databases and file systems that integrate with Hadoop. Hive helps with
querying and managing large datasets real fast. It is an ETL tool for
Hadoop ecosystem.
• Characteristics of Hive
• Data Warehouse Tool: Designed to manage and analyze large datasets
quickly
• Schema Flexibility: Schemas are flexible in that they may change and be
defined at runtime.
• Can handle a combination of structured, semi-structured, and unstructured
data. Supports Both Normalized and Denormalized Data.
• HQL-Based: Makes use of the Hive Query Language (HQL), a distributed
storage protocol developed by Hadoop that is comparable to SQL.
• The Hive Query Language (HiveQL) is a query language for Hive to
process and analyze structured data in a Metastore.
• how to use the SELECT statement with WHERE clause.
• SELECT statement is used to retrieve the data from a table. WHERE
clause works similar to a condition. It filters the data using the condition
and gives you a finite result. The built-in operators and functions generate
an expression, which fulfils the condition.
• Syntax
• Given below is the syntax of the SELECT query:
• SELECT [ALL | DISTINCT] select_expr, select_expr, ... FROM
table_reference [WHERE where_condition] [GROUP BY col_list] [HAVING
having_condition] [CLUSTER BY col_list | [DISTRIBUTE BY col_list]
[SORT BY col_list]] [LIMIT number];
• Example
• Let us take an example for SELECT…WHERE clause. Assume we have
the employee table as given below, with fields named Id, Name, Salary,
Designation, and Dept. Generate a query to retrieve the employee details
who earn a salary of more than Rs 30000.
• | ID | Name | Salary | Designation | Dept |
• + |1201 | Gopal | 45000 | Technical manager | TP |
• |1202 | Manisha | 45000 | Proofreader | PR | |
• 1203 | Masthanvali | 40000 | Technical writer | TP | |
• 1204 | Krian | 40000 | Hr Admin | HR | |
• 1205 | Kranthi | 30000 | Op Admin | Admin |
• The following query retrieves the employee details using the above
scenario:
• hive> SELECT * FROM employee WHERE salary>30000;On successful
execution of the query, you get to see the following response:
• | ID | Name | Salary | Designation | Dept
• |1201 | Gopal | 45000 | Technical manager | TP |
• |1202 | Manisha | 45000 | Proofreader | PR |
• |1203 | Masthanvali | 40000 | Technical writer | TP |
• |1204 | Krian | 40000 | Hr Admin | HR
Querying Data
• Hive offers a sql like querying language for ETL purposes.
• HQL offers a SQL like environment for working with Hive tables ,databases and
queries.
• We may connect several types of clauses with hive in order to execute various
types of data modification and querying for improved communication with
external nodes..
• Hive supports the JDBC connection.
• Hive queries provides the following capabilities
• Data modelling such as creation of databases,tables etc.
• ETL functionalities such as extraction,transformation and loading data into
tables.
• Faster querying tool on top of Hadoop.
• User specific custom scripts for ease of code.
• Different clauses used in Hive
• Cluster by
• Distribute by
• Group by
• Order by
• Sort by
• Group by query
• Group by clause use columns on Hive tables for grouping particular
column values mentioned with the group by. For whatever the column
name we are defining a “groupby” clause the query will selects and display
results by grouping the particular column values.
• Eg:
• SELECT Department, count(*) FROM employees_guru GROUP BY
Department;
• SORT BY
• - The SORT by clause sorts the data per reducer (not globally).
• - If we have N number of reducers, we can have N number of sorted
output files
• - These files can have overlapping data ranges.
• SELECT Col1, Col2,……ColN FROM TableName SORT BY Col1 <ASC
| DESC>, Col2 <ASC | DESC>, …. ColN <ASC | DESC>
• Eg:
• SELECT SalesYear, Amount FROM tbl_Sales SORT BY SalesYear;
• ORDER BY
• - ORDER BY clause orders the data globally.
• - Because it ensures the global ordering of the data, all the data need to
be passed from a single reducer only.
• - As a result, the order by clause outputs one single file only.
• - If the dataset is large, bringing all the data on one single reducer can
impact performance.
• So, we should always avoid the ORDER BY clause in the hive queries.
• - If global ordering is required we can order by once on the final dataset.
• SELECT Col1, Col2,……ColN FROM TableName ORDER BY Col1
<ASC | DESC>, Col2 <ASC | DESC>, …. ColN <ASC | DESC>
• Eg:
• SELECT SalesYear, Amount FROM tbl_Sales order BY SalesYear;
• DISTRIBUTE BY
• - DISTRIBUTE BY clause is used to distribute the input rows among
reducers.
• - It ensures that all rows for the same key columns are going to the same
reducer.
• From above example - (2017- X, 2017-Y, 2017-Z ) all these pairs will end
up in the same reducer
• Hence the output files will not contain overlapping data ranges
• - However, the DISTRIBUTE BY clause does not sort the data either at the
reducer level or globally.
• - If we have N number of reducers, we can have N number of unsorted
output files
• SELECT Col1, Col2,……ColN FROM TableName DISTRIBUTE BY
Col1, Col2, ….. ColN
• Eg:
• SELECT SalesYear, Amount FROM tbl_Sales DISTRIBUTE BY SalesYear;
• CLUSTER BY
• - CLUSTER BY <col> is equivalent to DISTRIBUTE BY <col> + SORT BY
<col>
• - The CLUSTER BY clause distributes the data based on the key column
and then sorts the output data by putting the same key column values
adjacent to each other.
• - So, the output of the CLUSTER BY clause is sorted at the reducer level.
• - If we have N number of reducers, we can have N number of sorted
output files
• - Because it distributes the data based on key col, it ensures that we are
getting non-overlapping data ranges in the final outputs.
SELECT Col1, Col2,……ColN FROM TableName CLUSTER BY Col1,
Col2, ….. ColN
Eg:
SELECT SalesYear, Amount FROM tbl_Sales CLUSTER BY SalesYear;
• The SORT BY and ORDER BY clauses are used to define the order of the
output data.
• DISTRIBUTE BY and CLUSTER BY clauses are used to distribute the
data to multiple reducers based on the key columns.
• Built-in functions
• These are functions that are already available in Hive. First, we have to
check the application requirement, and then we can use these built-in
functions in our applications. We can call these functions directly in our
application.
• The syntax and types are mentioned in the following section.
• Types of Built-in Functions in HIVE
• Collection Functions
• Date Functions
• Mathematical Functions
• Conditional Functions
• String Functions
• Misc. Functions
• Collection Functions
• These functions are used for collections. Collections mean the grouping of
elements and returning single or array of elements depends on return type
mentioned in function name.
Return Type Function Name Description
It will fetch and give the
INT size(Map<K.V>) components number in the map
type
It will fetch and give the
INT size(Array<T>) elements number in the array
type
It will fetch and gives an array
Array<K> Map_keys(Map<K.V>) containing the keys of the input
map. Here array is in unordered
It will fetch and gives an array
containing the values of the
Array<V> Map_values(Map<K.V>)
input map. Here array is in
unordered
sorts the input array in
Array<t> Sort_array(Array<T>) ascending order of array and
• Date Functions
• These are used to perform Date Manipulations and Conversion of Date
types from one type to another type:
Function Name Return Type Description
Unix_Timestamp() BigInt We will get current Unix timestamp in seconds
It will fetch and give the date part of a timestamp
To_date(string timestamp) string
string:
It will fetch and give the year part of a date or a
year(string date) INT
timestamp string
It will fetch and give the quarter of the year for a
quarter(date/timestamp/string) INT
date, timestamp, or string in the range 1 to 4
It will give the month part of a date or a timestamp
month(string date) INT
string
hour(string date) INT It will fetch and gives the hour of the timestamp
minute(string date) INT It will fetch and gives the minute of the timestamp
Date_sub(string starting date, int days) string It will fetch and gives Subtraction of number of
days to starting date
Current_date date It will fetch and gives the current date at the start
of query evaluation
LAST _day(string date) string It will fetch and gives the last day of the month
which the date belongs to
trunc(string date, string format) string It will fetch and gives date truncated to the unit
specified by the format.
Supported formats in this :
MONTH/MON/MM, YEAR/YYYY/YY.
• Mathematical Functions
• These functions are used for Mathematical Operations. Instead of
creatingUDFs , we have some inbuilt mathematical functions in Hive.
Function Name Return Type Description
It will fetch and returns the
round(DOUBLE X) DOUBLE
rounded BIGINT value of X
It will fetch and returns X rounded
round(DOUBLE X, INT d) DOUBLE
to d decimal places
It will fetch and returns the
bround(DOUBLE X) DOUBLE rounded BIGINT value of X using
HALF_EVEN rounding mode
It will fetch and returns the
floor(DOUBLE X) BIGINT maximum BIGINT value that is
equal to or less than X value
It will fetch and returns the
ceil(DOUBLE a),
BIGINT minimum BIGINT value that is
ceiling(DOUBLE a)
equal to or greater than X value
It will fetch and returns a random
rand(), rand(INT seed) DOUBLE number that is distributed
uniformly from 0 to 1
• Conditional Functions
• These functions used for conditional values checks.