0% found this document useful (0 votes)
8 views

module 4 HIVE1ppt

Big Data and Hadoop(HIVE CONCEPTS)

Uploaded by

Shruthi Iyer
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

module 4 HIVE1ppt

Big Data and Hadoop(HIVE CONCEPTS)

Uploaded by

Shruthi Iyer
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

• Hive shell

• $HIVE_HOME/bin/hive is a shell utility which can be used to run Hive


queries in either interactive or batch mode. HiveServer2 (introduced in
Hive 0.11) has its own CLI called Beeline, which is a JDBC client based on
SQLLine.
• Hive Command Line Options
• To get help, run “hive -H” or “hive –help”. Usage (as it is in Hive 0.9.0)
• usage: hive
Option Explanation

-d,–define <key=value> Variable subsitution to apply to hive commands. e.g. -d


A=B or –define A=B

-e <quoted-query-string> SQL from command line

-f <filename> SQL from files

-H,–help Print help information

-h <hostname> Connecting to Hive Server on remote host

–hiveconf <property=value> Use value for given property

–hivevar <key=value> Variable subsitution to apply to hive commands. e.g. –


hivevar A=B

-i <filename> Initialization SQL file

-p <port> Connecting to Hive Server on port number

-S,–silent Silent mode in interactive shell

-v,–verbose Verbose mode (echo executed SQL to the console)


• Examples
• Example of running a query from the command line
• $HIVE_HOME/bin/hive -e ‘select a.foo from pokes a’
• Example of setting Hive configuration variables
• $HIVE_HOME/bin/hive -e ‘select a.foo from pokes a’ –hiveconf
hive.exec.scratchdir=/opt/my/hive_scratch –hiveconf
mapred.reduce.tasks=1
• Example of dumping data out from a query into a file using silent mode
• $HIVE_HOME/bin/hive -S -e ‘select a.foo from pokes a’ > a.txt
• Example of running a script non-interactively from local disk
• $HIVE_HOME/bin/hive -f /home/my/hive-script.sql
• Example of running a script non-interactively from a Hadoop supported
filesystem (starting in Hive 0.14)
• $HIVE_HOME/bin/hive -f hdfs://<namenode>:<port>/hive-script.sql
• Hive CLI is a legacy tool which had two main use cases. The first is that it
served as a thick client for SQL on Hadoop and the second is that it served
as a command line tool for Hive Server (the original Hive server, now often
referred to as “HiveServer1”). Hive Server has been deprecated and
removed from the Hive code base as of Hive 1.0.0 and replaced with
HiveServer2, so the second use case no longer applies. For the first use
case, Beeline provides or is supposed to provide equal functionality, yet is
implemented differently from Hive CLI.
• Ideally, Hive CLI should be deprecated as the Hive community has long
recommended using the Beeline plus HiveServer2 configuration; however,
because of the wide use of Hive CLI, we instead are replacing Hive CLI’s
implementation with a new Hive CLI on top of Beeline plus embedded
HiveServer2 so that the Hive community only needs to maintain a single
code path. In this way, the new Hive CLI is just an alias to Beeline at both
the shell script level and the high code level. The goal is that no or minimal
changes are required from existing user scripts using Hive CLI.
• The hiverc File
• The CLI when invoked without the -i option will attempt to load
$HIVE_HOME/bin/.hiverc and $HOME/.hiverc as initialization files.
• Hive Batch Mode Commands
• When $HIVE_HOME/bin/hive is run with the -e or -f option, it executes SQL
commands in batch mode.
• hive -e ‘<query-string>’ executes the query string.
• hive -f <filepath> executes one or more SQL queries from a file.
• Hive Interactive Shell Commands
• When $HIVE_HOME/bin/hive is run without either the -e or -f option, it
enters interactive shell mode. Use “;” (semicolon) to terminate commands.
Comments in scripts can be specified using the “–” prefix.
Command Description
quit Use quit or exit to leave the interactive shell.
exit
reset Resets the configuration to the default values (as of Hive 0.10).

set <key>=<value> Sets the value of a particular configuration variable (key).


If you misspell the variable name, the CLI will not show an error.

set Prints a list of configuration variables that are overridden by the user or Hive.

set -v Prints all Hadoop and Hive configuration variables.

add FILE[S] <filepath> <filepath>* Adds one or more files, jars, or archives to the list of resources in the distributed cache.
add JAR[S] <filepath> <filepath>*
add ARCHIVE[S] <filepath> <filepath>*

add FILE[S] <ivyurl> <ivyurl>* As of Hive 1.2.0, adds one or more files, jars or archives to the list of resources in the distributed cache using an Ivy
add JAR[S] <ivyurl> <ivyurl>* URL of the form ivy://group:module:version?query_string.
add ARCHIVE[S] <ivyurl> <ivyurl>*

list FILE[S]list JAR[S]list ARCHIVE[S] Lists the resources already added to the distributed cache.

list FILE[S] <filepath>* Checks whether the given resources are already added to the distributed cache or not.
list JAR[S] <filepath>*
list ARCHIVE[S] <filepath>*
delete FILE[S] <filepath>* Removes the resource(s) from the distributed cache.
delete JAR[S] <filepath>*
delete ARCHIVE[S] <filepath>*
delete FILE[S] <ivyurl> <ivyurl>* As of Hive 1.2.0, removes the resource(s) which were added using the <ivyurl> from the distributed cache.
delete JAR[S] <ivyurl> <ivyurl>*
delete ARCHIVE[S] <ivyurl> <ivyurl>*

! <command> Executes a shell command from the Hive shell.

dfs <dfs command> Executes a dfs command from the Hive shell.

<query string> Executes a Hive query and prints results to standard output.

source <filepath> Executes a script file inside the CLI.


• Example
• hive> set mapred.reduce.tasks=32;
• hive> set;
• hive> select a.* from tab1;
• hive> !ls;
• hive> dfs -ls;
• Beeline – New Command Line Shell
• HiveServer2 supports a new command shell Beeline that works with
HiveServer2. It’s a JDBC client that is based on the SQLLine CLI. The Beeline
shell works in both embedded mode as well as remote mode. In the
embedded mode, it runs an embedded Hive (similar to Hive CLI) whereas
remote mode is for connecting to a separate HiveServer2 process over
Thrift. Starting in Hive 0.14, when Beeline is used with HiveServer2, it also
prints the log messages from HiveServer2 for queries it executes to
STDERR.
Option Description
-u <database URL> The JDBC URL to connect to. Usage: beeline -u db_URL

-n <username> The username to connect as. Usage: beeline -n valid_user

-p <password> The password to connect as. Usage: beeline -p valid_password

-d <driver class> The driver class to use. Usage: beeline -d driver_class

-e <query> Query that should be executed. Double or single quotes enclose


the query string. This option can be specified multiple times. Usage:
beeline -e “query_string“

-f <file> Script file that should be executed. Usage: beeline -f filepath

–hiveconf property=value Use value for the given configuration property. Properties that are
listed in hive.conf.restricted.list cannot be reset with hiveconf.
Usage: beeline –hiveconf prop1=value1

–hivevar name=value Hive variable name and value. This is a Hive-specific setting in
which variables can be set at the session level and referenced in
Hive commands or queries. Usage: beeline –hivevar var1=value1
• Comparison with Traditional Databases
• Hive is intended to manage large-scale data analytics and querying on top
of the Hadoop environment, while RDBMS is generally used to manage
structured databases.
• What is RDBMS?
• RDBMS stands for Relational Database Management System. RDBMS is
a type of database management system that is specifically designed for
relational databases. RDBMS is a subset of DBMS. A relational database
refers to a database that stores data in a structured format using rows and
columns and that structured form is known as a table. There are certain
rules defined in RDBMS that are known as Codd’s rule.
• Characteristics of RDBMS
• Structured Storage: Data is stored in a tabular format with rows and
columns.
• Fixed Schema: The database’s preset structure is immutable and cannot
be altered dynamically.
• Data Normalization: To lessen dependencies and redundancies, data
must be kept in a normalized format.
• SQL-Based: Data is defined and altered using Structured Query
Language (SQL).
• What is Hive?
• Hive is a data warehouse software system that provides data query and
analysis. Hive gives an interface like SQL to query data stored in various
databases and file systems that integrate with Hadoop. Hive helps with
querying and managing large datasets real fast. It is an ETL tool for
Hadoop ecosystem.
• Characteristics of Hive
• Data Warehouse Tool: Designed to manage and analyze large datasets
quickly
• Schema Flexibility: Schemas are flexible in that they may change and be
defined at runtime.
• Can handle a combination of structured, semi-structured, and unstructured
data. Supports Both Normalized and Denormalized Data.
• HQL-Based: Makes use of the Hive Query Language (HQL), a distributed
storage protocol developed by Hadoop that is comparable to SQL.
• The Hive Query Language (HiveQL) is a query language for Hive to
process and analyze structured data in a Metastore.
• how to use the SELECT statement with WHERE clause.
• SELECT statement is used to retrieve the data from a table. WHERE
clause works similar to a condition. It filters the data using the condition
and gives you a finite result. The built-in operators and functions generate
an expression, which fulfils the condition.
• Syntax
• Given below is the syntax of the SELECT query:
• SELECT [ALL | DISTINCT] select_expr, select_expr, ... FROM
table_reference [WHERE where_condition] [GROUP BY col_list] [HAVING
having_condition] [CLUSTER BY col_list | [DISTRIBUTE BY col_list]
[SORT BY col_list]] [LIMIT number];
• Example
• Let us take an example for SELECT…WHERE clause. Assume we have
the employee table as given below, with fields named Id, Name, Salary,
Designation, and Dept. Generate a query to retrieve the employee details
who earn a salary of more than Rs 30000.
• | ID | Name | Salary | Designation | Dept |
• + |1201 | Gopal | 45000 | Technical manager | TP |
• |1202 | Manisha | 45000 | Proofreader | PR | |
• 1203 | Masthanvali | 40000 | Technical writer | TP | |
• 1204 | Krian | 40000 | Hr Admin | HR | |
• 1205 | Kranthi | 30000 | Op Admin | Admin |
• The following query retrieves the employee details using the above
scenario:
• hive> SELECT * FROM employee WHERE salary>30000;On successful
execution of the query, you get to see the following response:
• | ID | Name | Salary | Designation | Dept
• |1201 | Gopal | 45000 | Technical manager | TP |
• |1202 | Manisha | 45000 | Proofreader | PR |
• |1203 | Masthanvali | 40000 | Technical writer | TP |
• |1204 | Krian | 40000 | Hr Admin | HR
Querying Data
• Hive offers a sql like querying language for ETL purposes.
• HQL offers a SQL like environment for working with Hive tables ,databases and
queries.
• We may connect several types of clauses with hive in order to execute various
types of data modification and querying for improved communication with
external nodes..
• Hive supports the JDBC connection.
• Hive queries provides the following capabilities
• Data modelling such as creation of databases,tables etc.
• ETL functionalities such as extraction,transformation and loading data into
tables.
• Faster querying tool on top of Hadoop.
• User specific custom scripts for ease of code.
• Different clauses used in Hive
• Cluster by
• Distribute by
• Group by
• Order by
• Sort by
• Group by query
• Group by clause use columns on Hive tables for grouping particular
column values mentioned with the group by. For whatever the column
name we are defining a “groupby” clause the query will selects and display
results by grouping the particular column values.
• Eg:
• SELECT Department, count(*) FROM employees_guru GROUP BY
Department;
• SORT BY
• - The SORT by clause sorts the data per reducer (not globally).
• - If we have N number of reducers, we can have N number of sorted
output files
• - These files can have overlapping data ranges.
• SELECT Col1, Col2,……ColN FROM TableName SORT BY Col1 <ASC
| DESC>, Col2 <ASC | DESC>, …. ColN <ASC | DESC>
• Eg:
• SELECT SalesYear, Amount FROM tbl_Sales SORT BY SalesYear;
• ORDER BY
• - ORDER BY clause orders the data globally.
• - Because it ensures the global ordering of the data, all the data need to
be passed from a single reducer only.
• - As a result, the order by clause outputs one single file only.
• - If the dataset is large, bringing all the data on one single reducer can
impact performance.
• So, we should always avoid the ORDER BY clause in the hive queries.
• - If global ordering is required we can order by once on the final dataset.
• SELECT Col1, Col2,……ColN FROM TableName ORDER BY Col1
<ASC | DESC>, Col2 <ASC | DESC>, …. ColN <ASC | DESC>
• Eg:
• SELECT SalesYear, Amount FROM tbl_Sales order BY SalesYear;
• DISTRIBUTE BY
• - DISTRIBUTE BY clause is used to distribute the input rows among
reducers.
• - It ensures that all rows for the same key columns are going to the same
reducer.
• From above example - (2017- X, 2017-Y, 2017-Z ) all these pairs will end
up in the same reducer
• Hence the output files will not contain overlapping data ranges
• - However, the DISTRIBUTE BY clause does not sort the data either at the
reducer level or globally.
• - If we have N number of reducers, we can have N number of unsorted
output files
• SELECT Col1, Col2,……ColN FROM TableName DISTRIBUTE BY
Col1, Col2, ….. ColN
• Eg:
• SELECT SalesYear, Amount FROM tbl_Sales DISTRIBUTE BY SalesYear;
• CLUSTER BY
• - CLUSTER BY <col> is equivalent to DISTRIBUTE BY <col> + SORT BY
<col>
• - The CLUSTER BY clause distributes the data based on the key column
and then sorts the output data by putting the same key column values
adjacent to each other.
• - So, the output of the CLUSTER BY clause is sorted at the reducer level.
• - If we have N number of reducers, we can have N number of sorted
output files
• - Because it distributes the data based on key col, it ensures that we are
getting non-overlapping data ranges in the final outputs.
SELECT Col1, Col2,……ColN FROM TableName CLUSTER BY Col1,
Col2, ….. ColN
Eg:
SELECT SalesYear, Amount FROM tbl_Sales CLUSTER BY SalesYear;
• The SORT BY and ORDER BY clauses are used to define the order of the
output data.
• DISTRIBUTE BY and CLUSTER BY clauses are used to distribute the
data to multiple reducers based on the key columns.
• Built-in functions
• These are functions that are already available in Hive. First, we have to
check the application requirement, and then we can use these built-in
functions in our applications. We can call these functions directly in our
application.
• The syntax and types are mentioned in the following section.
• Types of Built-in Functions in HIVE
• Collection Functions
• Date Functions
• Mathematical Functions
• Conditional Functions
• String Functions
• Misc. Functions
• Collection Functions
• These functions are used for collections. Collections mean the grouping of
elements and returning single or array of elements depends on return type
mentioned in function name.
Return Type Function Name Description
It will fetch and give the
INT size(Map<K.V>) components number in the map
type
It will fetch and give the
INT size(Array<T>) elements number in the array
type
It will fetch and gives an array
Array<K> Map_keys(Map<K.V>) containing the keys of the input
map. Here array is in unordered
It will fetch and gives an array
containing the values of the
Array<V> Map_values(Map<K.V>)
input map. Here array is in
unordered
sorts the input array in
Array<t> Sort_array(Array<T>) ascending order of array and
• Date Functions
• These are used to perform Date Manipulations and Conversion of Date
types from one type to another type:
Function Name Return Type Description
Unix_Timestamp() BigInt We will get current Unix timestamp in seconds
It will fetch and give the date part of a timestamp
To_date(string timestamp) string
string:
It will fetch and give the year part of a date or a
year(string date) INT
timestamp string
It will fetch and give the quarter of the year for a
quarter(date/timestamp/string) INT
date, timestamp, or string in the range 1 to 4
It will give the month part of a date or a timestamp
month(string date) INT
string
hour(string date) INT It will fetch and gives the hour of the timestamp
minute(string date) INT It will fetch and gives the minute of the timestamp
Date_sub(string starting date, int days) string It will fetch and gives Subtraction of number of
days to starting date
Current_date date It will fetch and gives the current date at the start
of query evaluation
LAST _day(string date) string It will fetch and gives the last day of the month
which the date belongs to
trunc(string date, string format) string It will fetch and gives date truncated to the unit
specified by the format.
Supported formats in this :
MONTH/MON/MM, YEAR/YYYY/YY.
• Mathematical Functions
• These functions are used for Mathematical Operations. Instead of
creatingUDFs , we have some inbuilt mathematical functions in Hive.
Function Name Return Type Description
It will fetch and returns the
round(DOUBLE X) DOUBLE
rounded BIGINT value of X
It will fetch and returns X rounded
round(DOUBLE X, INT d) DOUBLE
to d decimal places
It will fetch and returns the
bround(DOUBLE X) DOUBLE rounded BIGINT value of X using
HALF_EVEN rounding mode
It will fetch and returns the
floor(DOUBLE X) BIGINT maximum BIGINT value that is
equal to or less than X value
It will fetch and returns the
ceil(DOUBLE a),
BIGINT minimum BIGINT value that is
ceiling(DOUBLE a)
equal to or greater than X value
It will fetch and returns a random
rand(), rand(INT seed) DOUBLE number that is distributed
uniformly from 0 to 1
• Conditional Functions
• These functions used for conditional values checks.

Function Name Return Type Description


It will fetch and gives value
if(Boolean testCondition, T True when
valueTrue, T T Test Condition is of true,
valueFalseOrNull) gives value False Or Null
otherwise.
It will fetch and gives true if
ISNULL( X) Boolean X is NULL and false
otherwise.
It will fetch and gives true if
ISNOTNULL(X ) Boolean X is not NULL and false
otherwise.
• String Functions
• String manipulations and string operations these functions can be called.
Function Name Return Type Description
It will give the reversed string of
reverse(string X) string
X
It will fetch and gives str, which is
rpad(string str, int length, string
string right-padded with pad to a length
pad)
of length(integer value)
It will fetch and returns the string
resulting from trimming spaces
from the end (right hand side) of
rtrim(string X) string
X
For example, rtrim(‘ results ‘)
results in ‘ results’
It will fetch and gives a string of n
space(INT n) string
spaces.
Splits str around pat (pat is a
split(STRING str, STRING pat) array
regular expression).
Str_to_map(text[, delimiter1, It will split text into key-value
map<String ,String>
delimiter2]) pairs using two delimiters.
• What is Hive UDF?
• Basically, we can use two different interfaces for writing Apache Hive User
Defined Functions.
• Simple API
• Complex API
• As long as our function reads and returns primitive types, we can use the
simple API (org.apache.hadoop.hive.ql.exec.UDF). In other words, it
means basic Hadoop & Hive writable types. Such as Text, IntWritable,
LongWritable, DoubleWritable, etc.
• a. Simple API
• Basically, with the simpler UDF API, building a Hive User Defined Function
involves little more than writing a class with one function (evaluate).
However, let’s see an example to understand it well:
class SimpleUDFExample extends UDF
{
public Text evaluate(Text input)
{
return new Text("Hello " + input.toString());
}
}
• i. TESTING SIMPLE Hive UDF
• Moreover, we can test it with regular testing tools, like JUnit, since the Hive
UDF is simple one function.
public class SimpleUDFExampleTest
{
@Test
public void testUDF()
{
SimpleUDFExample example = new SimpleUDFExample();
Assert.assertEquals("Hello world", example.evaluate(new
Text("world")).toString());
}
}
b. Complex API
• To write code for objects that are not writable types. Like struct, map and
array types. Hence the org.apache.hadoop.hive.ql.udf.generic.
GenericUDF API offers a way.
• // this is like the evaluate method of the simple API. It takes the actual
arguments and returns the result
• abstract Object evaluate(GenericUDF.DeferredObject[] arguments);
• // Doesn't really matter, we can return anything but should be a string
representation of the function.
• abstract String getDisplayString(String[] children);
• // called once, before any evaluate() calls. You receive an array of object
inspectors that represent the arguments of the function
• // this is where you validate that the function is receiving the correct
argument types and the correct number of arguments.
• abstract ObjectInspector initialize(ObjectInspector[] arguments);
class ComplexUDFExample extends GenericUDF
{
ListObjectInspector listOI;
StringObjectInspector elementOI;
@Override
public String getDisplayString(String[] arg0)
{
return "arrayContainsExample()"; // this should probably be better
}
@Override
public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException
{
if (arguments.length != 2)
{
throw new UDFArgumentLengthException("arrayContainsExample only takes 2 arguments:
List<T>, T");
}
// 1. Check we received the right object types.
ObjectInspector a = arguments[0];
ObjectInspector b = arguments[1];
if (!(a instanceof ListObjectInspector) || !(b instanceof StringObjectInspector))
{
throw new UDFArgumentException("first argument must be a list / array, second argument must be a
string");
}
this.listOI = (ListObjectInspector) a;
this.elementOI = (StringObjectInspector) b;
// 2. Check that the list contains strings
if(!(listOI.getListElementObjectInspector() instanceof StringObjectInspector))
{
throw new UDFArgumentException("first argument must be a list of strings");
}
// the return type of our function is a boolean, so we provide the correct object inspector
return PrimitiveObjectInspectorFactory.javaBooleanObjectInspector;
@Override
public Object evaluate(DeferredObject[] arguments) throws HiveException
{
// get the list and string from the deferred objects using the object inspectors
List<String> list = (List<String>) this.listOI.getList(arguments[0].get());
String arg = elementOI.getPrimitiveJavaObject(arguments[1].get());
// check for nulls
if (list == null || arg == null)
{
return null;
}
// see if our list contains the value we need
for(String s: list)
{
if (arg.equals(s)) return new Boolean(true);
}
return new Boolean(false);
• ii. TESTING Complex Hive UDF
• However, in the setup, there is the complex part of testing the function.
public class ComplexUDFExampleTest
{
@Test
public void testComplexUDFReturnsCorrectValues() throws HiveException
{
// set up the models we need
ComplexUDFExample example = new ComplexUDFExample();
ObjectInspector stringOI = PrimitiveObjectInspectorFactory.javaStringObjectInspector;
ObjectInspector listOI = ObjectInspectorFactory.getStandardListObjectInspector(stringOI);
JavaBooleanObjectInspector resultInspector = (JavaBooleanObjectInspector) example.initialize(new
ObjectInspector[]{listOI, stringOI});
// create the actual UDF arguments
List<String> list = new ArrayList<String>();
list.add("a");
list.add("b");
• Object result = example.evaluate(new DeferredObject[]{new DeferredJavaObject(list),
new DeferredJavaObject("a")});
Assert.assertEquals(true, resultInspector.get(result));
// the value doesn't exist
Object result2 = example.evaluate(new DeferredObject[]{new
DeferredJavaObject(list), new DeferredJavaObject("d")});
Assert.assertEquals(false, resultInspector.get(result2));
// arguments are null
Object result3 = example.evaluate(new DeferredObject[]{new
DeferredJavaObject(null), new DeferredJavaObject(null)});
Assert.assertNull(result3);
}
}

You might also like