Apache Hive
Apache Hive
The Apache HiveTM data warehouse software facilitates querying and managing large datasets residing in distributed storage. Built on top of Apache HadoopTM , it provides Tools to enable easy data extract/transform/load (ETL) A mechanism to impose structure on a variety of data formats Access to files stored either directly in Apache HDFSTM or in other data storage systems such as Apache HBaseTM Query execution via MapReduce
Hive defines a simple SQL-like query language, called QL, that enables users familiar with SQL to query the data. At the same time, this language also allows programmers who are familiar with the MapReduce framework to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language. QL can also be extended with custom scalar functions (UDF's), aggregations (UDAF's), and table functions (UDTF's). Hive does not mandate read or written data be in the "Hive format"---there is no such thing. Hive works equally well on Thrift, control delimited, or your specialized data formats. Please see File Format and SerDe in the Developer Guide for details. Hive is not designed for OLTP workloads and does not offer real-time queries or row-level updates. It is best used for batch jobs over large sets of append-only data (like web logs). What Hive values most are scalability (scale out with more machines added dynamically to the Hadoop cluster), extensibility (with MapReduce framework and UDF/UDAF/UDTF), fault-tolerance, and loose-coupling with its input formats.
User Documentation
Hive Tutorial HiveQL Language Manual (Queries, DML, DDL, and CLI) Hive Operators and Functions Hive Web Interface Hive Client (JDBC, ODBC, Thrift, etc) HiveServer2 Client Hive Change Log
Avro SerDe
Administrator Documentation
Installing Hive Configuring Hive Setting Up Metastore Setting Up Hive Web Interface Setting Up Hive Server (JDBC, ODBC, Thrift, etc.) Hive on Amazon Web Services Hive on Amazon Elastic MapReduce
For more information, please see the official Hive website. Apache Hive, Apache Hadoop, Apache HBase, Apache HDFS, Apache, the Apache feather logo, and the Apache Hive project logo are trademarks of The Apache Software Foundation.
Page: HiveDeveloperFAQ Page: HiveServer2 Clients Page: OperatorsAndFunctions Page: PluginDeveloperKit Page: RCFileCat
Table of Contents Installation and Configuration Requirements Installing Hive from a Stable Release Building Hive from Source Compile hive on hadoop 23 Running Hive Configuration management overview Runtime configuration Hive, Map-Reduce and Local-Mode Error Logs DDL Operations Metadata Store DML Operations SQL Operations Example Queries SELECTS and FILTERS GROUP BY JOIN MULTITABLE INSERT STREAMING Simple Example Use Cases MovieLens User Ratings Apache Weblog Data
DISCLAIMER: Hive has only been tested on unix(linux) and mac systems using Java 1.6 for now although it may very well work on other similar platforms. It does not work on Cygwin. Most of our testing has been on Hadoop 0.20 - so we advise running it against this version even though it may compile/work against other versions
Running Hive
Hive uses hadoop that means: you must have hadoop in your path OR export HADOOP_HOME=<hadoop-install-dir>
In addition, you must create /tmp and /user/hive/warehouse (aka hive.metastore.warehouse.dir) and set them chmod g+w in HDFS before a table can be created in Hive.
Commands to perform this setup $ $ $ $ $HADOOP_HOME/bin/hadoop $HADOOP_HOME/bin/hadoop $HADOOP_HOME/bin/hadoop $HADOOP_HOME/bin/hadoop fs fs fs fs -mkdir -mkdir -chmod g+w -chmod g+w /tmp /user/hive/warehouse /tmp /user/hive/warehouse
I also find it useful but not necessary to set HIVE_HOME $ export HIVE_HOME=<hive-install-dir> To use hive command line interface (cli) from the shell: $ $HIVE_HOME/bin/hive
Runtime configuration
Hive queries are executed using map-reduce queries and, therefore, the behavior of such queries can be controlled by the hadoop configuration variables. The cli command 'SET' can be used to set any hadoop (or hive) configuration variable. For example: hive> SET mapred.job.tracker=myhost.mycompany.com:50030; hive> SET -v; The latter shows all the current settings. Without the -v option only the variables that differ from the base hadoop configuration are displayed
While this usually points to a map-reduce cluster with multiple nodes, Hadoop also offers a nifty option to run map-reduce jobs locally on the user's workstation. This can be very useful to run queries over small data sets - in such cases local mode execution is usually significantly faster than submitting jobs to a large cluster. Data is accessed transparently from HDFS. Conversely, local mode only runs with one reducer and can be very slow processing larger data sets. Starting v-0.7, Hive fully supports local mode execution. To enable this, the user can enable the following option: hive> SET mapred.job.tracker=local; In addition, mapred.local.dir should point to a path that's valid on the local machine (for example /tmp/<username>/mapred/local). (Otherwise, the user will get an exception allocating local disk space). Starting v-0.7, Hive also supports a mode to run map-reduce jobs in local-mode automatically. The relevant options are: hive> SET hive.exec.mode.local.auto=false; note that this feature is disabled by default. If enabled - Hive analyzes the size of each map-reduce job in a query and may run it locally if the following thresholds are satisfied: The total input size of the job is lower than: hive.exec.mode.local.auto.inputbytes.max (128MB by default) The total number of map-tasks is less than: hive.exec.mode.local.auto.tasks.max (4 by default) The total number of reduce tasks required is 1 or 0.
So for queries over small data sets, or for queries with multiple map-reduce jobs where the input to subsequent jobs is substantially smaller (because of reduction/filtering in the prior job), jobs may be run locally. Note that there may be differences in the runtime environment of hadoop server nodes and the machine running the hive client (because of different jvm versions or different software libraries). This can cause unexpected behavior/errors while running in local mode. Also note that local mode execution is done in a separate, child jvm (of the hive client). If the user so wishes, the maximum amount of memory for this child jvm can be controlled via the option hive.mapred.local.mem. By default, it's set to zero, in which case Hive lets Hadoop determine the default memory limits of the child jvm.
Error Logs
Hive uses log4j for logging. By default logs are not emitted to the console by the CLI. The default logging level is WARN and the logs are stored in the folder: /tmp/<user.name>/hive.log
If the user wishes - the logs can be emitted to the console by adding the arguments shown below: bin/hive -hiveconf hive.root.logger=INFO,console
Alternatively, the user can change the logging level only by using: bin/hive -hiveconf hive.root.logger=INFO,DRFA
Note that setting hive.root.logger via the 'set' command does not change logging properties since they are determined at initialization time. Hive also stores query logs on a per hive session basis in /tmp/<user.name>/, but can be configured in hive-site.xml with the hive.querylog.location property. Logging during Hive execution on a Hadoop cluster is controlled by Hadoop configuration. Usually Hadoop will produce one log file per map and reduce task stored on the cluster machine(s) where the task was executed. The log files can be obtained by clicking through to the Task Details page from the Hadoop JobTracker Web UI. When using local mode (using mapred.job.tracker=local), Hadoop/Hive execution logs are produced on the client machine itself. Starting v-0.6 - Hive uses the hive-execlog4j.properties (falling back to hive-log4j.properties only if it's missing) to determine where these logs are delivered by default. The default configuration file produces one log file per query executed in local mode and stores it under /tmp/<user.name>. The intent of providing a separate configuration file is to enable administrators to centralize execution log capture if desired (on a NFS file server for example). Execution logs are invaluable for debugging run-time errors. Error logs are very useful to debug problems. Please send them with any bugs (of which there are many!) to [email protected].
DDL Operations
Creating Hive tables and browsing through them hive> CREATE TABLE pokes (foo INT, bar STRING); Creates a table called pokes with two columns, the first being an integer and the other a string hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING); Creates a table called invites with two columns and a partition column called ds. The partition column is a virtual column. It is not part of the data itself but is derived from the partition that a particular dataset is loaded into. By default, tables are assumed to be of text input format and the delimiters are assumed to be ^A(ctrl-a). hive> SHOW TABLES; lists all the tables hive> SHOW TABLES '.*s'; lists all the table that end with 's'. The pattern matching follows Java regular expressions. Check out this link for documentation http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html hive> DESCRIBE invites; shows the list of columns As for altering tables, table names can be changed and additional columns can be dropped:
hive> ALTER TABLE pokes ADD COLUMNS (new_col INT); hive> ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment'); hive> ALTER TABLE events RENAME TO 3koobecaf; Dropping tables: hive> DROP TABLE pokes;
Metadata Store
Metadata is in an embedded Derby database whose disk storage location is determined by the hive configuration variable named javax.jdo.option.ConnectionURL. By default (see conf/hive-default.xml), this location is ./metastore_db Right now, in the default configuration, this metadata can only be seen by one user at a time. Metastore can be stored in any database that is supported by JPOX. The location and the type of the RDBMS can be controlled by the two variables javax.jdo.option.ConnectionURL and javax.jdo.option.ConnectionDriverName. Refer to JDO (or JPOX) documentation for more details on supported databases. The database schema is defined in JDO metadata annotations file package.jdo at src/contrib/hive/metastore/src/model. In the future, the metastore itself can be a standalone server. If you want to run the metastore as a network server so it can be accessed from multiple nodes try HiveDerbyServerMode.
DML Operations
Loading data from flat files into Hive: hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes; Loads a file that contains two columns separated by ctrl-a into pokes table. 'local' signifies that the input file is on the local file system. If 'local' is omitted then it looks for the file in HDFS. The keyword 'overwrite' signifies that existing data in the table is deleted. If the 'overwrite' keyword is omitted, data files are appended to existing data sets. NOTES: NO verification of data against the schema is performed by the load command. If the file is in hdfs, it is moved into the Hive-controlled file system namespace. The root of the Hive directory is specified by the option hive.metastore.warehouse.dir in hive-default.xml. We advise users to create this directory before trying to create tables via Hive. hive> LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15'); hive> LOAD DATA LOCAL INPATH './examples/files/kv3.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-08');
The two LOAD statements above load data into two different partitions of the table invites. Table invites must be created as partitioned by the key ds for this to succeed. hive> LOAD DATA INPATH '/user/myname/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15'); The above command will load data from an HDFS file/directory to the table. Note that loading data from HDFS will result in moving the file/directory. As a result, the operation is almost instantaneous.
SQL Operations
Example Queries
Some example queries are shown below. They are available in build/dist/examples/queries. More are available in the hive sources at ql/src/test/queries/positive
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/sum' SELECT SUM(a.pc) FROM pc1 a; Sum of a column. avg, min, max can also be used. Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).
GROUP BY
hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(*) WHERE a.foo > 0 GROUP BY a.bar; hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(*) FROM invites a WHERE a.foo > 0 GROUP BY a.bar; Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).
JOIN
hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo;
MULTITABLE INSERT
FROM src INSERT OVERWRITE TABLE INSERT OVERWRITE TABLE and src.key < 200 INSERT OVERWRITE TABLE src.key WHERE src.key >= INSERT OVERWRITE LOCAL src.key >= 300; dest1 SELECT src.* WHERE src.key < 100 dest2 SELECT src.key, src.value WHERE src.key >= 100 dest3 PARTITION(ds='2008-04-08', hr='12') SELECT 200 and src.key < 300 DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE
STREAMING
hive> FROM invites a INSERT OVERWRITE TABLE events SELECT TRANSFORM(a.foo, a.bar) AS (oof, rab) USING '/bin/cat' WHERE a.ds > '2008-08-09'; This streams the data in the map phase through the script /bin/cat (like hadoop streaming). Similarly - streaming can be used on the reduce side (please see the Hive Tutorial for examples)
FIELDS TERMINATED BY '\t' STORED AS TEXTFILE; Then, download and extract the data files: wget http://www.grouplens.org/system/files/ml-data.tar+0.gz tar xvzf ml-data.tar+0.gz And load it into the table that was just created: LOAD DATA LOCAL INPATH 'ml-data/u.data' OVERWRITE INTO TABLE u_data; Count the number of rows in table u_data: SELECT COUNT(*) FROM u_data; Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*). Now we can do some complex data analysis on the table u_data: Create weekday_mapper.py: import sys import datetime for line in sys.stdin: line = line.strip() userid, movieid, rating, unixtime = line.split('\t') weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday() print '\t'.join([userid, movieid, rating, str(weekday)]) Use the mapper script: CREATE TABLE u_data_new ( userid INT, movieid INT, rating INT, weekday INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; add FILE weekday_mapper.py; INSERT OVERWRITE TABLE u_data_new SELECT TRANSFORM (userid, movieid, rating, unixtime) USING 'python weekday_mapper.py' AS (userid, movieid, rating, weekday) FROM u_data; SELECT weekday, COUNT(*) FROM u_data_new GROUP BY weekday; Note that if you're using Hive 0.5.0 or earlier you will need to use COUNT(1) in place of COUNT(*).
How do I import ASCII logfiles (HTTP, etc) into Hive? Exporting Data from Hive Hive Data Model What is the difference between a native table and an external table? What are dynamic partitions? Can a Hive table contain data in more than one format? Is it possible to set the data format on a per-partition basis? JDBC Driver Does Hive have a JDBC Driver? ODBC Driver Does Hive have an ODBC driver?
I see errors like: Server access Error: Connection timed out url=http://archive.apache.org/dist/hadoop/core/hadoop0.20.1/hadoop-0.20.1.tar.gz
Run the following commands: cd ~/.ant/cache/hadoop/core/sources wget http://archive.apache.org/dist/hadoop/core/hadoop-0.20.1/hadoop0.20.1.tar.gz
I am using MySQL as metastore and I see errors: "com.mysql.jdbc.exceptions.jdbc4.!CommunicationsException: Communications link failure"
This is usually caused by MySQL servers closing connections after the connection is idling for some time. Run the following command on the MySQL server will solve the problem "set global wait_status=120;" 1. When using MySQL as a metastore I see the error "com.mysql.jdbc.exceptions.MySQLSyntaxErrorException: Specified key was too long; max key length is 767 bytes".
This is a known limitation of MySQL 5.0 and UTF8 databases. One option is to use another character set, such as 'latin1', which is known to work.
HiveQL
Are HiveQL identifiers (e.g. table names, column names, etc) case sensitive?
No. Hive is case insensitive. Executing: SELECT * FROM MyTable WHERE myColumn = 3 is strictly equivalent to select * from mytable where mycolumn = 3
How do I import XML data into Hive? How do I import CSV data into Hive? How do I import JSON data into Hive? How do I import Thrift data into Hive? How do I import Avro data into Hive? How do I import delimited text data into Hive? How do I import fixed-width data into Hive? How do I import ASCII logfiles (HTTP, etc) into Hive?
JDBC Driver
ODBC Driver
Does Hive have an ODBC driver?
Hive Tutorial
Hive Tutorial Concepts What is Hive What Hive is NOT Data Units Type System Primitive Types Complex Types Built in operators and functions Built in operators Built in functions Language capabilities Usage and Examples Creating Tables Browsing Tables and Partitions Loading Data Simple Query Partition Based Query Joins Aggregations Multi Table/File Inserts Dynamic-partition Insert Inserting into local files Sampling Union all Array Operations
Map(Associative Arrays) Operations Custom map/reduce scripts Co-Groups Altering Tables Dropping Tables and Partitions
Concepts
What is Hive
Hive is a data warehousing infrastructure based on the Hadoop. Hadoop provides massive scale out and fault tolerance capabilities for data storage and processing (using the map-reduce programming paradigm) on commodity hardware. Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides a simple query language called Hive QL, which is based on SQL and which enables users familiar with SQL to do ad-hoc querying, summarization and data analysis easily. At the same time, Hive QL also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.
Data Units
In the order of granularity - Hive data is organized into: Databases: Namespaces that separate tables and other data units from naming confliction. Tables: Homogeneous units of data which have the same schema. An example of a table could be page_views table, where each row could comprise of the following columns (schema):
timestamp - which is of INT type that corresponds to a unix timestamp of when the page was viewed. userid - which is of BIGINT type that identifies the user who viewed the page. page_url - which is of STRING type that captures the location of the page. referer_url - which is of STRING that captures the location of the page from where the user arrived at the current page. IP - which is of STRING type that captures the IP address from where the page request was made. Partitions: Each Table can have one or more partition Keys which determines how the data is stored. Partitions - apart from being storage units - also allow the user to efficiently identify the rows that satisfy a certain criteria. For example, a date_partition of type STRING and country_partition of type STRING. Each unique value of the partition keys defines a partition of the Table. For example all "US" data from "2009-12-23" is a partition of the page_views table. Therefore, if you run analysis on only the "US" data for 2009-12-23, you can run that query only on the relevant partition of the table thereby speeding up the analysis significantly. Note however, that just because a partition is named 2009-12-23 does not mean that it contains all or only data from that date; partitions are named after dates for convenience but it is the user's job to guarantee the relationship between partition name and data content!). Partition columns are virtual columns, they are not part of the data itself but are derived on load. Buckets (or Clusters): Data in each partition may in turn be divided into Buckets based on the value of a hash function of some column of the Table. For example the page_views table may be bucketed by userid, which is one of the columns, other than the partitions columns, of the page_view table. These can be used to efficiently sample the data.
Note that it is not necessary for tables to be partitioned or bucketed, but these abstractions allow the system to prune large quantities of data during query processing, resulting in faster query execution.
Type System
Primitive Types
Types are associated with the columns in the tables. The following Primitive types are supported: Integers TINYINT - 1 byte integer SMALLINT - 2 byte integer INT - 4 byte integer BIGINT - 8 byte integer Boolean type BOOLEAN - TRUE/FALSE Floating point numbers FLOAT - single precision DOUBLE - Double precision String type STRING - sequence of characters in a specified character set
The Types are organized in the following hierarchy (where the parent is a super type of all the children instances): Type
Primitive Type Number DOUBLE BIGINT INT TINYINT FLOAT INT TINYINT STRING BOOLEAN This type hierarchy defines how the types are implicitly converted in the query language. Implicit conversion is allowed for types from child to an ancestor. So when a query expression expects type1 and the data is of type2 type2 is implicitly converted to type1 if type1 is an ancestor of type2 in the type hierarchy. Apart from these fundamental rules for implicit conversion based on type system, Hive also allows the special case for conversion: <STRING> to <DOUBLE>
Explicit type conversion can be done using the cast operator as shown in the Built in functions section below.
Complex Types
Complex Types can be built up from primitive types and other composite types using: Structs: the elements within the type can be accessed using the DOT (.) notation. For example, for a column c of type STRUCT {a INT; b INT} the a field is accessed by the expression c.a Maps (key-value tuples): The elements are accessed using ['element name'] notation. For example in a map M comprising of a mapping from 'group' -> gid the gid value can be accessed using M['group'] Arrays (indexable lists): The elements in the array have to be in the same type. Elements can be accessed using the [n] notation where n is an index (zero-based) into the array. For example for an array A having the elements ['a', 'b', 'c'], A[1] retruns 'b'.
Using the primitive types and the constructs for creating complex types, types with arbitrary levels of nesting can be created. For example, a type User may comprise of the following fields: gender - which is a STRING. active - which is a BOOLEAN.
A != B
A<B
A <= B
A>B
types A >= B all primitive types all types all types TRUE if expression A is greater than or equal to expression B otherwise FALSE
TRUE if expression A evaluates to NULL otherwise FALSE FALSE if expression A evaluates to NULL otherwise TRUE
strings
TRUE if string A matches the SQL simple regular expression B, otherwise FALSE. The comparison is done character by character. The _ character in B matches any character in A (similar to . in posix regular expressions), and the % character in B matches an arbitrary number of characters in A (similar to .* in posix regular expressions). For example, 'foobar' LIKE 'foo' evaluates to FALSE where as 'foobar' LIKE 'foo___' evaluates to TRUE and so does 'foobar' LIKE 'foo%'. To escape % use \ (% matches one % character). If the data contains a semi-colon, and you want to search for it, it needs to be escaped, columnValue LIKE 'a\;b' TRUE if string A matches the Java regular expression B (See Java regular expressions syntax), otherwise FALSE. For example, 'foobar' rlike 'foo' evaluates to FALSE whereas 'foobar' rlike '^f.*r$' evaluates to TRUE Same as RLIKE
A RLIKE B
strings
A REGEXP B strings
Arithmetic Operators - The following operators support various common arithmetic operations on the operands. All of them return number types. Operand types Description
all number Gives the result of adding A and B. The type of the result is the same as the common types parent(in the type hierarchy) of the types of the operands. e.g. since every integer is a float, therefore float is a containing type of integer so the + operator on a float and an int will result in a float. all number Gives the result of subtracting B from A. The type of the result is the same as the types common parent(in the type hierarchy) of the types of the operands. all number Gives the result of multiplying A and B. The type of the result is the same as the types common parent(in the type hierarchy) of the types of the operands. Note that if the multiplication causing overflow, you will have to cast one of the operators to a type higher in the type hierarchy. all number Gives the result of dividing B from A. The type of the result is the same as the common types parent(in the type hierarchy) of the types of the operands. If the operands are integer
A-B
A*B
A/B
types, then the result is the quotient of the division. A%B all number Gives the reminder resulting from dividing A by B. The type of the result is the same as types the common parent(in the type hierarchy) of the types of the operands. all number Gives the result of bitwise AND of A and B. The type of the result is the same as the types common parent(in the type hierarchy) of the types of the operands. all number Gives the result of bitwise OR of A and B. The type of the result is the same as the types common parent(in the type hierarchy) of the types of the operands. all number Gives the result of bitwise XOR of A and B. The type of the result is the same as the types common parent(in the type hierarchy) of the types of the operands. all number Gives the result of bitwise NOT of A. The type of the result is the same as the type of types A. Logical Operators - The following operators provide support for creating logical expressions. All of them return boolean TRUE or FALSE depending upon the boolean values of the operands.
A&B
A|B
A^B
~A
Logical Operators Operands types Description A AND B A && B A OR B A | B NOT A !A boolean boolean boolean boolean boolean boolean TRUE if both A and B are TRUE, otherwise FALSE Same as A AND B TRUE if either A or B or both are TRUE, otherwise FALSE Same as A OR B TRUE if A is FALSE, otherwise FALSE Same as NOT A
Operators on Complex Types - The following operators provide mechanisms to access elements in Complex Types Description returns the nth element in the array A. The first element has index 0 e.g. if A is an array comprising of ['foo', 'bar'] then A[0] returns 'foo' and A[1] returns 'bar' returns the value corresponding to the key in the map e.g. if M is a map comprising of {'f' -> 'foo', 'b' -> 'bar', 'all' -> 'foobar'} then M['all'] returns 'foobar' returns the x field of S e.g for struct foobar {int foo, int bar} foobar.foo returns
Operator Operand types A[n] A is an Array and n is an int M is a Map<K, V> and key has type K
M[key]
S.x
S is a struct
Built in functions
Return Type BIGINT BIGINT BIGINT The following built in functions are supported in hive: (Function list in source code: FunctionRegistry.java) Function Name (Signature) Description
returns the rounded BIGINT value of the double returns the maximum BIGINT value that is equal or less than the double returns the minimum BIGINT value that is equal or greater than the double returns a random number (that changes from row to row). Specifiying the seed will make sure the generated random number sequence is deterministic. returns the string resulting from concatenating B after A. For example, concat('foo', 'bar') results in 'foobar'. This function accepts arbitrary number of arguments and return the concatenation of all of them. returns the substring of A starting from start position till the end of string A. For example, substr('foobar', 4) results in 'bar' returns the substring of A starting from start position with the given length e.g. substr('foobar', 4, 2) results in 'ba' returns the string resulting from converting all characters of A to upper case e.g. upper('fOoBaR') results in 'FOOBAR' Same as upper returns the string resulting from converting all characters of B to lower case e.g. lower('fOoBaR') results in 'foobar' Same as lower returns the string resulting from trimming spaces from both ends of A e.g. trim(' foobar ') results in 'foobar' returns the string resulting from trimming spaces from the beginning(left
double
string
string
string
string
string string
ucase(string A) lower(string A)
string string
lcase(string A) trim(string A)
string
ltrim(string A)
hand side) of A. For example, ltrim(' foobar ') results in 'foobar ' string rtrim(string A) returns the string resulting from trimming spaces from the end(right hand side) of A. For example, rtrim(' foobar ') results in ' foobar' returns the string resulting from replacing all substrings in B that match the Java regular expression syntax(See Java regular expressions syntax) with C. For example, regexp_replace('foobar', 'oo|ar', ) returns 'fb' returns the number of elements in the map type returns the number of elements in the array type converts the results of the expression expr to <type> e.g. cast('1' as BIGINT) will convert the string '1' to it integral representation. A null is returned if the conversion does not succeed.
string
string
from_unixtime(int unixtime) convert the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of "1970-01-01 00:00:00" to_date(string timestamp) Return the date part of a timestamp string: to_date("1970-01-01 00:00:00") = "1970-01-01" Return the year part of a date or a timestamp string: year("1970-01-01 00:00:00") = 1970, year("1970-01-01") = 1970 Return the month part of a date or a timestamp string: month("1970-1101 00:00:00") = 11, month("1970-11-01") = 11 Return the day part of a date or a timestamp string: day("1970-11-01 00:00:00") = 1, day("1970-11-01") = 1 Extract json object from a json string based on json path specified, and return json string of the extracted json object. It will return null if the input json string is invalid
string
int
year(string date)
int
month(string date)
int
day(string date)
string
The following built in aggregate functions are supported in Hive: Aggregation Function Name (Signature) count(*), count(expr), count(DISTINCT expr[, expr_.]) Description
count(*) - Returns the total number of retrieved rows, including rows containing NULL values; count(expr) - Returns the number of rows for which the supplied expression is non-NULL; count(DISTINCT expr[, expr]) - Returns the number of rows for which the supplied expression(s) are unique and nonNULL.
DOUBLE
returns the sum of the elements in the group or the sum of the distinct values of the column in the group returns the average of the elements in the group or the average of the distinct values of the column in the group returns the minimum value of the column in the group returns the maximum value of the column in the group
DOUBLE
DOUBLE DOUBLE
Language capabilities
Hive query language provides the basic SQL like operations. These operations work on tables or partitions. These operations are: Ability to filter rows from a table using a where clause. Ability to select certain columns from the table using a select clause. Ability to do equi-joins between two tables. Ability to evaluate aggregations on multiple "group by" columns for the data stored in a table. Ability to store the results of a query into another table. Ability to download the contents of a table to a local (e.g., nfs) directory. Ability to store the results of a query in a hadoop dfs directory. Ability to manage tables and partitions (create, drop and alter). Ability to plug in custom scripts in the language of choice for custom map/reduce jobs.
Creating Tables
An example statement that would create the page_view table mentioned above would be like: CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) STORED AS SEQUENCEFILE; In this example the columns of the table are specified with the corresponding types. Comments can be attached both at the column level as well as at the table level. Additionally the partitioned by clause defines the partitioning columns which are different from the data columns and are actually not stored
with the data. When specified in this way, the data in the files is assumed to be delimited with ASCII 001(ctrl-A) as the field delimiter and newline as the row delimiter. The field delimiter can be parametrized if the data is not in the above format as illustrated in the following example: CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '1' STORED AS SEQUENCEFILE; The row deliminator currently cannot be changed since it is not determined by Hive but Hadoop. e delimiters. It is also a good idea to bucket the tables on certain columns so that efficient sampling queries can be executed against the data set. If bucketing is absent, random sampling can still be done on the table but it is not efficient as the query has to scan all the data. The following example illustrates the case of the page_view table that is bucketed on the userid column: CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY '1' COLLECTION ITEMS TERMINATED BY '2' MAP KEYS TERMINATED BY '3' STORED AS SEQUENCEFILE; In the example above, the table is clustered by a hash function of userid into 32 buckets. Within each bucket the data is sorted in increasing order of viewTime. Such an organization allows the user to do efficient sampling on the clustered column - in this case userid. The sorting property allows internal operators to take advantage of the better-known data structure while evaluating queries with greater efficiency. CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, friends ARRAY<BIGINT>, properties MAP<STRING, STRING> ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY '1' COLLECTION ITEMS TERMINATED BY '2' MAP KEYS TERMINATED BY '3' STORED AS SEQUENCEFILE; In this example the columns that comprise of the table row are specified in a similar way as the definition of types. Comments can be attached both at the column level as well as at the table level. Additionally the partitioned by clause defines the partitioning columns which are different from the data columns and are
actually not stored with the data. The CLUSTERED BY clause specifies which column to use for bucketing as well as how many buckets to create. The delimited row format specifies how the rows are stored in the hive table. In the case of the delimited format, this specifies how the fields are terminated, how the items within collections (arrays or maps) are terminated and how the map keys are terminated. STORED AS SEQUENCEFILE indicates that this data is stored in a binary format (using hadoop SequenceFiles) on hdfs. The values shown for the ROW FORMAT and STORED AS clauses in the above example represent the system defaults. Table names and column names are case insensitive.
Loading Data
There are multiple ways to load data into Hive tables. The user can create an external table that points to a specified location within HDFS. In this particular usage, the user can copy a file into the specified location using the HDFS put or copy commands and create a table pointing to this location with all the relevant row format information. Once this is done, the user can transform the data and insert them into any other Hive table. For example, if the file /tmp/pv_2008-06-08.txt contains comma separated page views served on 2008-06-08, and this needs to be loaded into the page_view table in the appropriate partition, the following sequence of commands can achieve this: CREATE EXTERNAL TABLE page_view_stg(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User', country STRING COMMENT 'country of origination') COMMENT 'This is the staging page view table' ROW FORMAT DELIMITED FIELDS TERMINATED BY '44' LINES TERMINATED BY '12'
STORED AS TEXTFILE LOCATION '/user/data/staging/page_view'; hadoop dfs -put /tmp/pv_2008-06-08.txt /user/data/staging/page_view FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US') SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'US'; In the example above nulls are inserted for the array and map types in the destination tables but potentially these can also come from the external table if the proper row formats are specified. This method is useful if there is already legacy data in HDFS on which the user wants to put some metadata so that the data can be queried and manipulated using Hive. Additionally, the system also supports syntax that can load the data from a file in the local files system directly into a Hive table where the input data format is the same as the table format. If /tmp/pv_2008-0608_us.txt already contains the data for US, then we do not need any additional filtering as shown in the previous example. The load in this case can be done using the following syntax: LOAD DATA LOCAL INPATH /tmp/pv_2008-06-08_us.txt INTO TABLE page_view PARTITION(date='2008-06-08', country='US') The path argument can take a directory (in which case all the files in the directory are loaded), a single file name, or a wildcard (in which case all the matching files are uploaded). If the argument is a directory it cannot contain subdirectories. Similarly - the wildcard must match file names only. In the case that the input file /tmp/pv_2008-06-08_us.txt is very large, the user may decide to do a parallel load of the data (using tools that are external to Hive). Once the file is in HDFS - the following syntax can be used to load the data into a Hive table: LOAD DATA INPATH '/user/data/pv_2008-06-08_us.txt' INTO TABLE page_view PARTITION(date='2008-06-08', country='US') It is assumed that the array and map fields in the input.txt files are null fields for these examples.
Simple Query
For all the active users, one can use the query of the following form: INSERT OVERWRITE TABLE user_active SELECT user.* FROM user WHERE user.active = 1; Note that unlike SQL, we always insert the results into a table. We will illustrate later how the user can inspect these results and even dump them to a local file. You can also run the following query on Hive CLI: SELECT user.* FROM user WHERE user.active = 1; This will be internally rewritten to some temporary file and displayed to the Hive client side.
Joins
In order to get a demographic breakdown (by gender) of page_view of 2008-03-03 one would need to join the page_view table and the user table on the userid column. This can be accomplished with a join as shown in the following query: INSERT OVERWRITE TABLE pv_users SELECT pv.*, u.gender, u.age FROM user u JOIN page_view pv ON (pv.userid = u.id) WHERE pv.date = '2008-03-03'; In order to do outer joins the user can qualify the join with LEFT OUTER, RIGHT OUTER or FULL OUTER keywords in order to indicate the kind of outer join (left preserved, right preserved or both sides preserved). For example, in order to do a full outer join in the query above, the corresponding syntax would look like the following query: INSERT OVERWRITE TABLE pv_users SELECT pv.*, u.gender, u.age FROM user u FULL OUTER JOIN page_view pv ON (pv.userid = u.id) WHERE pv.date = '2008-03-03'; In order check the existence of a key in another table, the user can use LEFT SEMI JOIN as illustrated by the following example. INSERT OVERWRITE TABLE pv_users SELECT u.* FROM user u LEFT SEMI JOIN page_view pv ON (pv.userid = u.id) WHERE pv.date = '2008-03-03'; In order to join more than one tables, the user can use the following syntax: INSERT OVERWRITE TABLE pv_friends SELECT pv.*, u.gender, u.age, f.friends FROM page_view pv JOIN user u ON (pv.userid = u.id) JOIN friend_list f ON (u.id = f.uid) WHERE pv.date = '2008-03-03';
Note that Hive only supports equi-joins. Also it is best to put the largest table on the rightmost side of the join to get the best performance.
Aggregations
In order to count the number of distinct users by gender one could write the following query: INSERT OVERWRITE TABLE pv_gender_sum SELECT pv_users.gender, count (DISTINCT pv_users.userid) FROM pv_users GROUP BY pv_users.gender; Multiple aggregations can be done at the same time, however, no two aggregations can have different DISTINCT columns .e.g while the following is possible INSERT OVERWRITE TABLE pv_gender_agg SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(*), sum(DISTINCT pv_users.userid) FROM pv_users GROUP BY pv_users.gender; however, the following query is not allowed INSERT OVERWRITE TABLE pv_gender_agg SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT pv_users.ip) FROM pv_users GROUP BY pv_users.gender;
Dynamic-partition Insert
In the previous examples, the user has to know which partition to insert into and only one partition can be inserted in one insert statement. If you want to load into multiple partitions, you have to use multi-insert statement as illustrated below. FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US') SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'US' INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='CA') SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'CA' INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='UK') SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'UK'; In order to load data into all country partitions in a particular day, you have to add an insert statement for each country in the input data. This is very inconvenient since you have to have the priori knowledge of the list of countries exist in the input data and create the partitions beforehand. If the list changed for another day, you have to modify your insert DML as well as the partition creation DDLs. It is also inefficient since each insert statement may be turned into a MapReduce Job. Dynamic-partition insert (or multi-partition insert) is designed to solve this problem by dynamically determining which partitions should be created and populated while scanning the input table. This is a newly added feature that is only available from version 0.6.0. In the dynamic partition insert, the input column values are evaluated to determine which partition this row should be inserted into. If that partition has not been created, it will create that partition automatically. Using this feature you need only one insert statement to create and populate all necessary partitions. In addition, since there is only one insert statement, there is only one corresponding MapReduce job. This significantly improves performance and reduce the Hadoop cluster workload comparing to the multiple insert case. Below is an example of loading data to all country partitions using one insert statement: FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country) SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.country There are several syntactic differences from the multi-insert statement: country appears in the PARTITION specification, but with no value associated. In this case, country is a dynamic partition column. On the other hand, ds has a value associated with it, which means it is a static partition column. If a column is dynamic partition column, its value will be coming from the input column. Currently we only allow dynamic partition columns to be the last column(s) in the partition clause because the partition column order indicates its hierarchical order (meaning dt is the root partition, and country is the child partition). You cannot specify a partition clause with (dt, country='US') because that means you need to update all partitions with any date and its country sub-partition is 'US'. An additional pvs.country column is added in the select statement. This is the corresponding input column for the dynamic partition column. Note that you do not need to add an input column for the static partition column because its value is already known in the PARTITION clause. Note that the dynamic partition values are selected by ordering, not name, and taken as the last columns from the select clause.
When there are already non-empty partitions exists for the dynamic partition columns, (e.g., country='CA' exists under some ds root partition), it will be overwritten if the dynamic partition insert saw the same value (say 'CA') in the input data. This is in line with the 'insert overwrite' semantics. However, if the partition value 'CA' does not appear in the input data, the existing partition will not be overwritten. Since a Hive partition corresponds to a directory in HDFS, the partition value has to conform to the HDFS path format (URI in Java). Any character having a special meaning in URI (e.g., '%', ':', '/', '#') will be escaped with '%' followed by 2 bytes of its ASCII value. If the input column is a type different than STRING, its value will be first converted to STRING to be used to construct the HDFS path. If the input column value is NULL or empty string, the row will be put into a special partition, whose name is controlled by the hive parameter hive.exec.default.partition.name. The default value is__HIVE_DEFAULT_PARTITION__. Basically this partition will contain all "bad" rows whose value are not valid partition names. The caveat of this approach is that the bad value will be lost and is replaced by__HIVE_DEFAULT_PARTITION__ if you select them Hive. JIRA HIVE1309 is a solution to let user specify "bad file" to retain the input partition column values as well. Dynamic partition insert could potentially resource hog in that it could generate a large number of partitions in a short time. To get yourself buckled, we define three parameters: hive.exec.max.dynamic.partitions.pernode (default value being 100) is the maximum dynamic partitions that can be created by each mapper or reducer. If one mapper or reducer created more than that the threshold, a fatal error will be raised from the mapper/reducer (through counter) and the whole job will be killed. hive.exec.max.dynamic.partitions (default value being 1000) is the total number of dynamic partitions could be created by one DML. If each mapper/reducer did not exceed the limit but the total number of dynamic partitions does, then an exception is raised at the end of the job before the intermediate data are moved to the final destination. hive.exec.max.created.files (default value being 100000) is the maximum total number of files created by all mappers and reducers. This is implemented by updating a Hadoop counter by each mapper/reducer whenever a new file is created. If the total number is exceeding hive.exec.max.created.files, a fatal error will be thrown and the job will be killed. Another situation we want to protect against dynamic partition insert is that the user may accidentally specify all partitions to be dynamic partitions without specifying one static partition, while the original intention is to just overwrite the sub-partitions of one root partition. We define another parameter hive.exec.dynamic.partition.mode=strict to prevent the all-dynamic partition case. In the strict mode, you have to specify at least one static partition. The default mode is strict. In addition, we have a parameter hive.exec.dynamic.partition=true/false to control whether to allow dynamic partition at all. The default value is false. In Hive 0.6, dynamic partition insert does not work with hive.merge.mapfiles=true or hive.merge.mapredfiles=true, so it internally turns off the merge parameters. Merging files in dynamic partition inserts are supported in Hive 0.7 (see JIRA HIVE-1307 for details).
Troubleshooting and best practices: As stated above, there are too many dynamic partitions created by a particular mapper/reducer, a fatal error could be raised and the job will be killed. The error message looks something like: hive> set hive.exec.dynamic.partition.mode=nonstrict; hive> FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION(dt, country)
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, from_unixtimestamp(pvs.viewTime, 'yyyy-MM-dd') ds, pvs.country; ... 2010-05-07 11:10:19,816 Stage-1 map = 0%, reduce = 0% [Fatal Error] Operator FS_28 (id=41): fatal error. Killing the job. Ended Job = job_201005052204_28178 with errors ... The problem of this that one mapper will take a random set of rows and it is very likely that the number of distinct (dt, country) pairs will exceed the limit of hive.exec.max.dynamic.partitions.pernode. One way around it is to group the rows by the dynamic partition columns in the mapper and distribute them to the reducers where the dynamic partitions will be created. In this case the number of distinct dynamic partitions will be significantly reduced. The above example query could be rewritten to: hive> set hive.exec.dynamic.partition.mode=nonstrict; hive> FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION(dt, country) SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, from_unixtimestamp(pvs.viewTime, 'yyyy-MM-dd') ds, pvs.country DISTRIBUTE BY ds, country; This query will generate a MapReduce job rather than Map-only job. The SELECT-clause will be converted to a plan to the mappers and the output will be distributed to the reducers based on the value of (ds, country) pairs. The INSERT-clause will be converted to the plan in the reducer which writes to the dynamic partitions.
Sampling
The sampling clause allows the users to write queries for samples of the data instead of the whole table. Currently the sampling is done on the columns that are specified in the CLUSTERED BY clause of the CREATE TABLE statement. In the following example we choose 3rd bucket out of the 32 buckets of the pv_gender_sum table: INSERT OVERWRITE TABLE pv_gender_sum_sample SELECT pv_gender_sum.* FROM pv_gender_sum TABLESAMPLE(BUCKET 3 OUT OF 32); In general the TABLESAMPLE syntax looks like: TABLESAMPLE(BUCKET x OUT OF y)
y has to be a multiple or divisor of the number of buckets in that table as specified at the table creation time. The buckets chosen are determined if bucket_number module y is equal to x. So in the above example the following tablesample clause TABLESAMPLE(BUCKET 3 OUT OF 16) would pick out the 3rd and 19th buckets. The buckets are numbered starting from 0. On the other hand the tablesample clause TABLESAMPLE(BUCKET 3 OUT OF 64 ON userid) would pick out half of the 3rd bucket.
Union all
The language also supports union all, e.g. if we suppose there are two different tables that track which user has published a video and which user has published a comment, the following query joins the results of a union all with the user table to create a single annotated stream for all the video publishing and comment publishing events: INSERT OVERWRITE TABLE actions_users SELECT u.id, actions.date FROM ( SELECT av.uid AS uid FROM action_video av WHERE av.date = '2008-06-03' UNION ALL SELECT ac.uid AS uid FROM action_comment ac WHERE ac.date = '2008-06-03' ) actions JOIN users u ON(u.id = actions.uid);
Array Operations
Array columns in tables can only be created programmatically currently. We will be extending this soon to be available as part of the create table statement. For the purpose of the current example assume that pv.friends is of the type array<INT> i.e. it is an array of integers.The user can get a specific element in the array by its index as shown in the following command: SELECT pv.friends[2] FROM page_views pv; The select expressions gets the third item in the pv.friends array. The user can also get the length of the array using the size function as shown below: SELECT pv.userid, size(pv.friends) FROM page_view pv;
Maps provide collections similar to associative arrays. Such structures can only be created programmatically currently. We will be extending this soon. For the purpose of the current example assume that pv.properties is of the type map<String, String> i.e. it is an associative array from strings to string. Accordingly, the following query: INSERT OVERWRITE page_views_map SELECT pv.userid, pv.properties['page type'] FROM page_views pv; can be used to select the 'page_type' property from the page_views table. Similar to arrays, the size function can also be used to get the number of elements in a map as shown in the following query: SELECT size(pv.properties) FROM page_view pv;
Schema-less map/reduce: If there is no "AS" clause after "USING map_script", Hive assumes the output of the script contains 2 parts: key which is before the first tab, and value which is the rest after the first tab. Note that this is different from specifying "AS key, value" because in that case value will only contains the portion between the first tab and the second tab if there are multiple tabs. In this way, we allow users to migrate old map/reduce scripts without knowing the schema of the map output. User still needs to know the reduce output schema because that has to match what is in the table that we are inserting to. FROM ( FROM pv_users MAP pv_users.userid, pv_users.date USING 'map_script' CLUSTER BY key) map_output INSERT OVERWRITE TABLE pv_users_reduced REDUCE map_output.dt, map_output.uid USING 'reduce_script' AS date, count; Distribute By and Sort By: Instead of specifying "cluster by", the user can specify "distribute by" and "sort by", so the partition columns and sort columns can be different. The usual case is that the partition columns are a prefix of sort columns, but that is not required. FROM ( FROM pv_users MAP pv_users.userid, pv_users.date USING 'map_script' AS c1, c2, c3 DISTRIBUTE BY c2 SORT BY c2, c1) map_output INSERT OVERWRITE TABLE pv_users_reduced REDUCE map_output.c1, map_output.c2, map_output.c3 USING 'reduce_script' AS date, count;
Co-Groups
Amongst the user community using map/reduce, cogroup is a fairly common operation wherein the data from multiple tables are sent to a custom reducer such that the rows are grouped by the values of certain columns on the tables. With the UNION ALL operator and the CLUSTER BY specification, this can be achieved in the Hive query language in the following way. Suppose we wanted to cogroup the rows from the actions_video and action_comments table on the uid column and send them to the 'reduce_script' custom reducer, the following syntax can be used by the user: FROM ( FROM ( FROM action_video av SELECT av.uid AS uid, av.id AS id, av.date AS date UNION ALL
FROM action_comment ac SELECT ac.uid AS uid, ac.id AS id, ac.date AS date ) union_actions SELECT union_actions.uid, union_actions.id, union_actions.date CLUSTER BY union_actions.uid) map INSERT OVERWRITE TABLE actions_reduced SELECT TRANSFORM(map.uid, map.id, map.date) USING 'reduce_script' AS (uid, id, reduced_val);
Altering Tables
To rename existing table to a new name. If a table with new name already exists then an error is returned: ALTER TABLE old_table_name RENAME TO new_table_name; To rename the columns of an existing table. Be sure to use the same column types, and to include an entry for each preexisting column: ALTER TABLE old_table_name REPLACE COLUMNS (col1 TYPE, ...); To add columns to an existing table: ALTER TABLE tab1 ADD COLUMNS (c1 INT COMMENT 'a new int column', c2 STRING DEFAULT 'def val'); Note that a change in the schema (such as the adding of the columns), preserves the schema for the old partitions of the table in case it is a partitioned table. All the queries that access these columns and run over the old partitions implicitly return a null value or the specified default values for these columns. In the later versions we can make the behavior of assuming certain values as opposed to throwing an error in case the column is not found in a particular partition configurable.
Select Group By Sort/Distribute/Cluster/Order By Transform and Map-Reduce Scripts Operators and User-Defined Functions XPath-specific Functions Joins Lateral View Union Sub Queries Sampling Explain Virtual Columns Locks Import/Export Configuration Properties Authorization Statistics Archiving
Features
Schema Browsing
An alternative to running 'show tables' or 'show extended tables' from the CLI is to use the web based schema browser. The Hive Meta Data is presented in a hierarchical manner allowing you to start at the database level and click to get information about tables including the SerDe, column names, and column types.
No local installation
Any user with a web browser can work with Hive. This has the usual web interface benefits, In particular a user wishing to interact with hadoop or hive requires access to many ports. A remote or VPN user would only require access to the hive web interface running by default on 0.0.0.0 tcp/9999.
Configuration
Hive Web Interface made its first appearance in the 0.2 branch. If you have 0.2 or the SVN trunk you already have it. You should not need to edit the defaults for the Hive web interface. HWI uses: <property> <name>hive.hwi.listen.host</name> <value>0.0.0.0</value> <description>This is the host address the Hive Web Interface will listen on</description> </property> <property>
<name>hive.hwi.listen.port</name> <value>9999</value> <description>This is the port the Hive Web Interface will listen on</description> </property> <property> <name>hive.hwi.war.file</name> <value>${HIVE_HOME}/lib/hive_hwi.war</value> <description>This is the WAR file with the jsp content for Hive Web Interface</description> </property> You probably want to setup HiveDerbyServerMode to allow multiple sessions at the same time.
Start up
When initializing Hive with no arguments that CLI is invoked. Hive has an extension architecture used to start other hive demons. Jetty requires apache ant to start HWI. You should define ANT_LIB as an environment variable or add that to the hive invocation. export ANT_LIB=/opt/ant/lib bin/hive --service hwi Java has no direct way of demonizing. In a production environment you should create a wrapper script. nohup bin/hive --service hwi > /dev/null 2> /dev/null & If you want help on the service invocation or list of parameters you can add bin/hive --service hwi --help
Authentication
Hadoop currently uses environmental properties to determine user name and group vector. Thus Hive and Hive Web Interface can not enforce more stringent security then Hadoop can. When you first connect to the Hive Web Interface the user is prompted for a user name and groups. This feature was added to support installations using different schedulers. If you want to tighten up security you are going to need to patch the source Hive Session Manager or you may be able to tweak the JSP to accomplish this.
Accessing
In order to access the Hive Web Interface, go to <Hive Server Address>:9999/hwi on your web browser.
Result file
The result file is local to the web server. A query that produces massive output should set the result file to /dev/null.
Debug Mode
The debug mode is used when the user is interested in having the result file not only contain the result of the hive query but the other messages.
Set Processor
In the CLI a command like 'SET x=5' is not processed by the the Query Processor it is processed by the Set Processor. Use the form 'x=5' not 'set x=5'
Walk through
Authorize
Unable to render embedded object: File (1_hwi_authorize.png) not found. Unable to render embedded object: File (2_hwi_authorize.png) not found.
Schema Browser
Unable to render embedded object: File (3_schema_table.png) not found. Unable to render embedded object: File (4_schema_browser.png) not found.
Diagnostics
Unable to render embedded object: File (5_diagnostic.png) not found.
Running a query
Unable to render embedded object: File (6_newsession.png) not found. Unable to render embedded object: File (7_session_runquery.png) not found. Unable to render embedded object: File (8_session_query_1.png) not found. Unable to render embedded object: File (9_file_view.png) not found. Command Line JDBC JDBC Client Sample Code Running the JDBC Sample Code JDBC Client Setup for a Secure Cluster
This page describes the different clients supported by Hive. The command line client currently only supports an embedded server. The JDBC and thrift-java clients support both embedded and standalone servers. Clients in other languages only support standalone servers. For details about the standalone server see Hive Server.
Command Line
Operates in embedded mode only, i.e., it needs to have access to the hive libraries. For more details see Getting Started.
JDBC
For embedded mode, uri is just "jdbc:hive://". For standalone server, uri is "jdbc:hive://host:port/dbname" where host and port are determined by where the hive server is run. For example, "jdbc:hive://localhost:10000/default". Currently, the only dbname supported is "default".
public class HiveJdbcClient { private static String driverName = "org.apache.hadoop.hive.jdbc.HiveDriver"; /** * @param args * @throws SQLException */ public static void main(String[] args) throws SQLException { try { Class.forName(driverName); } catch (ClassNotFoundException e) { // TODO Auto-generated catch block e.printStackTrace(); System.exit(1); } Connection con = DriverManager.getConnection("jdbc:hive://localhost:10000/default", "", ""); Statement stmt = con.createStatement();
String tableName = "testHiveDriverTable"; stmt.executeQuery("drop table " + tableName); ResultSet res = stmt.executeQuery("create table " + tableName + " (key int, value string)"); // show tables String sql = "show tables '" + tableName + "'"; System.out.println("Running: " + sql); res = stmt.executeQuery(sql); if (res.next()) { System.out.println(res.getString(1)); } // describe table sql = "describe " + tableName; System.out.println("Running: " + sql); res = stmt.executeQuery(sql); while (res.next()) { System.out.println(res.getString(1) + "\t" + res.getString(2)); } // load data into table // NOTE: filepath has to be local to the hive server // NOTE: /tmp/a.txt is a ctrl-A separated file with two fields per line String filepath = "/tmp/a.txt"; sql = "load data local inpath '" + filepath + "' into table " + tableName; System.out.println("Running: " + sql); res = stmt.executeQuery(sql); // select * query sql = "select * from " + tableName; System.out.println("Running: " + sql); res = stmt.executeQuery(sql); while (res.next()) { System.out.println(String.valueOf(res.getInt(1)) + "\t" + res.getString(2)); } // regular hive query sql = "select count(1) from " + tableName; System.out.println("Running: " + sql); res = stmt.executeQuery(sql); while (res.next()) { System.out.println(res.getString(1)); } } }
# hive_jdbc.jar # hive_metastore.jar # hive_service.jar # libfb303.jar # log4j-1.2.15.jar # # from hadoop/build # hadoop-*-core.jar # # To run the program in embedded mode, we need the following additional jars in the classpath # from hive/build/dist/lib # antlr-runtime-3.0.1.jar # derby.jar # jdo2-api-2.1.jar # jpox-core-1.2.2.jar # jpox-rdbms-1.2.2.jar # # as well as hive/build/dist/conf $ java -cp $CLASSPATH HiveJdbcClient # Alternatively, you can run the following bash script, which will seed the data file # and build your classpath before invoking the client. #!/bin/bash HADOOP_HOME=/your/path/to/hadoop HIVE_HOME=/your/path/to/hive echo -e '1\x01foo' > /tmp/a.txt echo -e '2\x01bar' >> /tmp/a.txt HADOOP_CORE={{ls $HADOOP_HOME/hadoop-*-core.jar}} CLASSPATH=.:$HADOOP_CORE:$HIVE_HOME/conf for i in ${HIVE_HOME}/lib/*.jar ; do CLASSPATH=$CLASSPATH:$i done java -cp $CLASSPATH HiveJdbcClient
Python
Operates only on a standalone server. Set (and export) PYTHONPATH to build/dist/lib/py. The python modules imported in the code below are generated by building hive.
Please note that the generated python module names have changed in hive trunk. #!/usr/bin/env python import sys from from from from from from try: transport = TSocket.TSocket('localhost', 10000) transport = TTransport.TBufferedTransport(transport) protocol = TBinaryProtocol.TBinaryProtocol(transport) client = ThriftHive.Client(protocol) transport.open() client.execute("CREATE TABLE r(a STRING, b INT, c DOUBLE)") client.execute("LOAD TABLE LOCAL INPATH '/path' INTO TABLE r") client.execute("SELECT * FROM r") while (1): row = client.fetchOne() if (row == None): break print row client.execute("SELECT * FROM r") print client.fetchAll() transport.close() except Thrift.TException, tx: print '%s' % (tx.message) hive import ThriftHive hive.ttypes import HiveServerException thrift import Thrift thrift.transport import TSocket thrift.transport import TTransport thrift.protocol import TBinaryProtocol
PHP
Operates only on a standalone server. <?php // set THRIFT_ROOT to php directory of the hive distribution $GLOBALS['THRIFT_ROOT'] = '/lib/php/'; // load the required files for connecting to Hive require_once $GLOBALS['THRIFT_ROOT'] . 'packages/hive_service/ThriftHive.php'; require_once $GLOBALS['THRIFT_ROOT'] . 'transport/TSocket.php'; require_once $GLOBALS['THRIFT_ROOT'] . 'protocol/TBinaryProtocol.php'; // Set up the transport/protocol/client $transport = new TSocket('localhost', 10000); $protocol = new TBinaryProtocol($transport); $client = new ThriftHiveClient($protocol); $transport->open();
// run queries, metadata calls etc $client->execute('SELECT * from src'); var_dump($client->fetchAll()); $transport->close();
ODBC
Operates only on a standalone server. See Hive ODBC.
0: jdbc:hive2://localhost:10000> show tables; show tables; +-------------------+ | tab_name | +-------------------+ | primitives | | src | | src1 | | src_json | | src_sequencefile | | src_thrift | | srcbucket | | srcbucket2 | | srcpart | +-------------------+ 9 rows selected (1.079 seconds)
JDBC
HiveServere2 has a new JDBC driver. It supports both embedded and remote access to HiveServer2. The JDBC connection URL format has prefix is jdbc:hive2:// and the Driver class is org.apache.hive.jdbc.HiveDriver. Note that this is different from the old hiveserver. For remote server, the URL format is jdbc:hive2://<host>:<port>/<db> (default port for HiveServer2 is 10000). For embedded server, the URL format is jdbc:hive2:// (no host or port).
public class HiveJdbcClient { private static String driverName = "org.apache.hive.jdbc.HiveDriver"; /** * @param args * @throws SQLException */ public static void main(String[] args) throws SQLException { try { Class.forName(driverName); } catch (ClassNotFoundException e) { // TODO Auto-generated catch block e.printStackTrace(); System.exit(1); } //replace "hive" here with the name of the user the queries should run as
Connection con = DriverManager.getConnection("jdbc:hive2://localhost:10000/default", "hive", ""); Statement stmt = con.createStatement(); String tableName = "testHiveDriverTable"; stmt.execute("drop table if exists " + tableName); stmt.execute("create table " + tableName + " (key int, value string)"); // show tables String sql = "show tables '" + tableName + "'"; System.out.println("Running: " + sql); ResultSet res = stmt.executeQuery(sql); if (res.next()) { System.out.println(res.getString(1)); } // describe table sql = "describe " + tableName; System.out.println("Running: " + sql); res = stmt.executeQuery(sql); while (res.next()) { System.out.println(res.getString(1) + "\t" + res.getString(2)); } // load data into table // NOTE: filepath has to be local to the hive server // NOTE: /tmp/a.txt is a ctrl-A separated file with two fields per line String filepath = "/tmp/a.txt"; sql = "load data local inpath '" + filepath + "' into table " + tableName; System.out.println("Running: " + sql); stmt.execute(sql); // select * query sql = "select * from " + tableName; System.out.println("Running: " + sql); res = stmt.executeQuery(sql); while (res.next()) { System.out.println(String.valueOf(res.getInt(1)) + "\t" + res.getString(2)); } // regular hive query sql = "select count(1) from " + tableName; System.out.println("Running: " + sql); res = stmt.executeQuery(sql); while (res.next()) { System.out.println(res.getString(1)); } } }
# To run the program in standalone mode, we need the following jars in the classpath # from hive/build/dist/lib # hive-jdbc*.jar # hive-service*.jar # libfb303-0.9.0.jar# libthrift-0.9.0.jar# log4j-1.2.16.jar# slf4j-api1.6.1.jar# slf4j-log4j12-1.6.1.jar# commons-logging-1.0.4.jar# # # Following additional jars are needed for the kerberos secure mode # hive-exec*.jar # commons-configuration-1.6.jar # and from hadoop - hadoop-*core.jar # To run the program in embedded mode, we need the following additional jars in the classpath # from hive/build/dist/lib # hive-exec*.jar # hive-metastore*.jar # antlr-runtime-3.0.1.jar # derby.jar # jdo2-api-2.1.jar # jpox-core-1.2.2.jar # jpox-rdbms-1.2.2.jar # # from hadoop/build # hadoop-*-core.jar # as well as hive/build/dist/conf, any HIVE_AUX_JARS_PATH set, and hadoop jars necessary to run MR jobs (eg lzo codec) $ java -cp $CLASSPATH HiveJdbcClient # Alternatively, you can run the following bash script, which will seed the data file # and build your classpath before invoking the client. The script adds all the # additional jars needed for using HiveServer2 in embedded mode as well. #!/bin/bash HADOOP_HOME=/your/path/to/hadoop HIVE_HOME=/your/path/to/hive echo -e '1\x01foo' > /tmp/a.txt echo -e '2\x01bar' >> /tmp/a.txt HADOOP_CORE={{ls $HADOOP_HOME/hadoop-*-core.jar}} CLASSPATH=.:$HIVE_HOME/conf:`hadoop classpath` for i in ${HIVE_HOME}/lib/*.jar ; do CLASSPATH=$CLASSPATH:$i done java -cp $CLASSPATH HiveJdbcClient {noformat}
This page documents changes that are visible to users. Hive Trunk (0.8.0-dev) Hive 0.7.1 Hive 0.7.0 Hive 0.6.0 Hive 0.5.0
Supports arbitrarily nested schemas. Translates all Avro data types into equivalent Hive types. Most types map exactly, but some Avro types don't exist in Hive and are automatically converted by the AvroSerde. Understands compressed Avro files. Transparently converts the Avro idiom of handling nullable types as Union[T, null] into just T and returns null when appropriate. Writes any Hive table to Avro files. Has worked reliably against our most convoluted Avro schemas in our ETL process.
Requirements
The AvroSerde has been built and tested against Hive 0.9.1 and Avro 1.5.
boolean boolean int long float double bytes string record map list union int bigint float double Array[smallint] string struct map array union Unions of [T, null] transparently convert to nullable T, other types translate directly to Hive converts these to signed bytes.
Hive's unions of those types. However, unions were introduced in Hive 7 and are not currently able to be used in where/group-by statements. They are essentially look-atonly. Because the AvroSerde transparently converts [T,null], to nullable T, this limitation only applies to unions of multiple types or unions not of a single type and null. enum fixed string Array[smallint] Hive has no concept of enums Hive converts the bytes to signed int
At this point, the Avro-backed table can be worked with in Hive like any other table.
Example
Consider the following Hive table, which coincidentally covers all types of Hive data types, making it a good example: CREATE TABLE test_serializer(string1 STRING, int1 INT, tinyint1 TINYINT, smallint1 SMALLINT, bigint1 BIGINT, boolean1 BOOLEAN, float1 FLOAT, double1 DOUBLE, list1 ARRAY<STRING>, map1 MAP<STRING,INT>, struct1 STRUCT<sint:INT,sboolean:BOOLEAN,sstring:STRING>, union1 uniontype<FLOAT, BOOLEAN, STRING>, enum1 STRING, nullableint INT, bytes1 ARRAY<TINYINT>, fixed1 ARRAY<TINYINT>) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY ':' MAP KEYS TERMINATED BY '#' LINES TERMINATED BY '\n' STORED AS TEXTFILE; To save this table as an Avro file, create an equivalent Avro schema (the namespace and actual name of the record are not important): {
"namespace": "com.linkedin.haivvreo", "name": "test_serializer", "type": "record", "fields": [ { "name":"string1", "type":"string" }, { "name":"int1", "type":"int" }, { "name":"tinyint1", "type":"int" }, { "name":"smallint1", "type":"int" }, { "name":"bigint1", "type":"long" }, { "name":"boolean1", "type":"boolean" }, { "name":"float1", "type":"float" }, { "name":"double1", "type":"double" }, { "name":"list1", "type":{"type":"array", "items":"string"} }, { "name":"map1", "type":{"type":"map", "values":"int"} }, { "name":"struct1", "type":{"type":"record", "name":"struct1_name", "fields": [ { "name":"sInt", "type":"int" }, { "name":"sBoolean", "type":"boolean" }, { "name":"sString", "type":"string" } ] } }, { "name":"union1", "type":["float", "boolean", "string"] }, { "name":"enum1", "type":{"type":"enum", "name":"enum1_values", "symbols":["BLUE","RED", "GREEN"]} }, { "name":"nullableint", "type":["int", "null"] }, { "name":"bytes1", "type":"bytes" }, { "name":"fixed1", "type":{"type":"fixed", "name":"threebytes", "size":3} } ] } If the table were backed by a csv such as: why 4 3 1 hell 2 0 o 0 the re ano 9 4 1 the 8 0 r 1 rec ord thir d rec ord 4 5 1 5 0 2 1412 341 tr u e 42 85.23423 .4 424 3 alpha:bet Earth#42:Contr 17:true:A 0:3.1 a:gamma ol#86:Bob#31 be 4145 Linkedin 9 BL UE 72 0:1:2 :3:4: 5 50:5 1:53
9999 999
fa 99 0.000000 ls .8 09 e 9
beta
Earth#101
RE D
N 6:7:8 UL :9:10 L
54:5 5:56
9999 9999 9
tr u e
one can write it out to Avro with: CREATE TABLE as_avro ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' TBLPROPERTIES ( 'avro.schema.url'='file:///path/to/the/schema/test_serializer.avsc'); insert overwrite table as_avro select * from test_serializer; The files that are written by the Hive job are valid Avro files, however, MapReduce doesn't add the standard .avro extension. If you copy these files out, you'll likely want to rename them with .avro. Hive is very forgiving about types: it will attempt to store whatever value matches the provided column in the equivalent column position in the new table. No matching is done on column names, for instance. Therefore, it is incumbent on the query writer to make sure the the target column types are correct. If they are not, Avro may accept the type or it may throw an exception, this is dependent on the particular combination of types.
Use avro.schema.url
Specifies a url to access the schema from. For http schemas, this works for testing and small-scale clusters, but as the schema will be accessed at least once from each task in the job, this can quickly turn the job into a DDOS attack against the URL provider (a web server, for instance). Use caution when using this parameter for anything other than testing. The schema can also point to a location on HDFS, for instance: hdfs://your-nn:9000/path/to/avsc/file. the AvroSerde will then read the file from HDFS, which should provide resiliency against many reads at once. Note that the serde will read this file from every mapper, so it's a good idea to turn the replication of the schema file to a high value to provide good locality for the readers. The schema file itself should be relatively small, so this does not add a significant amount of overhead to the process.
}'); Note that the value is enclosed in single quotes and just pasted into the create statement.
FAQ
Why do I get error-error-error-error-error-error-error and a message to check avro.schema.literal and avro.schema.url when describing a table or running a query against a table?
The AvroSerde returns this message when it has trouble finding or parsing the schema provided by either the avro.schema.literal or avro.avro.schema.url value. It is unable to be more specific because Hive
expects all calls to the serde config methods to be successful, meaning we are unable to return an actual exception. By signaling an error via this message, the table is left in a good state and the incorrect value can be corrected with a call to alter table T set TBLPROPERTIES.
nstalling Hive
Installing Hive is simple and only requires having Java 1.6 and Ant installed on your machine. Hive is available via SVN at http://svn.apache.org/repos/asf/hive/trunk. You can download it by running the following command. $ svn co http://svn.apache.org/repos/asf/hive/trunk hive To build hive, execute the following command on the base directory: $ ant package It will create the subdirectory build/dist with the following contents: README.txt: readme file. bin/: directory containing all the shell scripts lib/: directory containing all required jar files) conf/: directory with configuration files examples/: directory with sample input and query files
Subdirectory build/dist should contain all the files necessary to run hive. You can run it from there or copy it to a different location, if you prefer. In order to run Hive, you must have hadoop in your path or have defined the environment variable HADOOP_HOME with the hadoop installation directory. Moreover, we strongly advise users to create the HDFS directories /tmp and /user/hive/warehouse (aka hive.metastore.warehouse.dir) and set them chmod g+w before tables are created in Hive. To use hive command line interface (cli) go to the hive home directory (the one with the contents of build/dist) and execute the following command: $ bin/hive Metadata is stored in an embedded Derby database whose disk storage location is determined by the hive configuration variable named javax.jdo.option.ConnectionURL. By default (see conf/hivedefault.xml), this location is ./metastore_db Using Derby in embedded mode allows at most one user at a time. To configure Derby to run in server mode, look at HiveDerbyServerMode.
Configuring Hive
A number of configuration variables in Hive can be used by the administrator to change the behavior for their installations and user sessions. These variables can be configured in any of the following ways, shown in the order of preference: Using the set command in the cli for setting session level values for the configuration variable for all statements subsequent to the set command. e.g. set hive.exec.scratchdir=/tmp/mydir; sets the scratch directory (which is used by hive to store temporary output and plans) to /tmp/mydir for all subseq Using -hiveconf option on the cli for the entire session. e.g. bin/hive -hiveconf hive.exec.scratchdir=/tmp/mydir In hive-site.xml. This is used for setting values for the entire Hive configuration. e.g. <property> <name>hive.exec.scratchdir</name> <value>/tmp/mydir</value> <description>Scratch space for Hive jobs</description> </property>
hive-default.xml.template contains the default values for various configuration variables that come with prepackaged in a Hive distribution. In order to override any of the values, create hivesite.xml instead and set the value in that file as shown above. Please note that this file is not used by Hive at all (as of Hive 0.9.0) and so it might be out of date or out of sync with the actual values. The canonical list of configuration options is now only managed in the HiveConf java class. hive-default.xml.template is located in the conf directory in your installation root. hivesite.xml should also be created in the same directory. Broadly the configuration variables are categorized into:
The data format to use for DDL output text (e.g. DESCRIBE table). One of "text" (for human readable text) or "json" (for a json object). (as of Hive 0.9.0) Wrapper around any invocations to script operator e.g. if this is set to python, the script passed to the script operator will be invoked as python <script command>. If the value is null or not set, the script is invoked as <script command>. null
hive.exec.script.wrapper
null /tmp/<user.name>/hive
the plans for different map/reduce stages for the query as well as to stored the intermediate outputs of these stages. hive.exec.submitviachild Determines whether the map/reduce jobs should be submitted through a separate jvm in the non local mode. false - By default jobs are submitted through the same jvm as the compiler
hive.exec.script.maxerrsize Maximum number of serialization errors allowed in a user script invoked through TRANSFORM or MAP or REDUC E constructs. hive.exec.compress.output Determines whether the output of the final map/reduce job in a query is compressed or not. hive.exec.compress.interm ediate Determines whether the output of the intermediate map/reduce jobs in a query is compressed or not. The location of hive_cli.jar that is used when submitting jobs in a separate jvm. The location of the plugin jars that contain implementations of user defined functions and serdes.
100000
false
false
hive.jar.path
hive.aux.jars.path
hive.partition.pruning
A strict value for this variable indicates nonstrict that an error is thrown by the compiler in case no partition predicate is provided on a partitioned table. This is used to protect against a user inadvertently issuing a query against all the partitions of the table. Determines whether the map side aggregation is on or not. true
hive.map.aggr
hive.join.emit.interval hive.map.aggr.hash.percen tmemory hive.default.fileformat Default file format for CREATE TABLE
1000 (float)0.5
TextFile
statement. Options are TextFile, SequenceFile and RCFile hive.merge.mapfiles Merge small files at the end of a maponly job. Merge small files at the end of a mapreduce job. Size of merged files at the end of the job. When the average output file size of a job is less than this number, Hive will start an additional map-reduce job to merge the output files into bigger files. This is only done for map-only jobs if hive.merge.mapfiles is true, and for map-reduce jobs if hive.merge.mapredfiles is true. true
hive.merge.mapredfiles
false
hive.merge.size.per.task
256000000
hive.merge.smallfiles.avgsi ze
16000000
hive.querylog.enable.plan. progress
Whether to log the plan's progress every true time a job's progress is checked. These logs are written to the location specified byhive.querylog.location (as of Hive 0.10) Directory where structured hive query logs are created. One file per session is created in this directory. If this variable set to empty string structured log will not be created. /tmp/<user.name>
hive.querylog.location
hive.querylog.plan.progres s.interval
The interval to wait between logging the 60000 plan's progress in milliseconds. If there is a whole number percentage change in the progress of the mappers or the reducers, the progress is logged regardless of this value. The actual interval will be the ceiling of (this value divided by the value of hive.exec.counters.pull.in terval) multiplied by the value ofhive.exec.counters.pull.in terval i.e. if it is not divide evenly by the value of hive.exec.counters.pull.in
tervalit will be logged less frequently than specified. This only has an effect if hive.querylog.enable.plan. progress is set totrue. (as of Hive 0.10) hive.stats.autogather A flag to gather statistics automatically during the INSERT OVERWRITE command. (as of Hive 0.7.0) true
hive.stats.dbclass
The default database that stores jdbc:derby temporary hive statistics. Valid values are hbase and jdbc while jdbc shoul d have a specification of the Database to use, separatey by a colon (e.g. jdbc:mysql (as of Hive 0.7.0) jdbc:derby:;databaseName=TempStatsSto re;create=true
hive.stats.dbconnectionstri The default connection string for the ng database that stores temporary hive statistics. (as of Hive 0.7.0) hive.stats.jdbcdriver The JDBC driver for the database that stores temporary hive statistics. (as of Hive 0.7.0) Whether queries will fail because stats cannot be collected completely accurately. If this is set to true, reading/writing from/into a partition may fail becuase the stats could not be computed accurately (as of Hive 0.10.0) If enabled, enforces inserts into bucketed tables to also be bucketed Substitutes variables in Hive statements which were previously set using the set command, system variables or environment variables. See HIVE1096 for details. (as of Hive 0.7.0) The maximum replacements the substitution engine will do. (as of Hive 0.10.0)
org.apache.derby.jdbc.EmbeddedDriver
hive.stats.reliable
false
hive.enforce.bucketing
false
hive.variable.substitute
true
hive.variable.substitute.de pth
40
hadoop.config.dir
$HADOOP_HOME/conf
fs.default.name map.input.file mapred.job.tracker The url to the jobtracker. If this is set to local then map/reduce is run in the local mode.
mapred.reduce.tasks The number of reducers for each map/reduce stage in the query plan. mapred.job.name The name of the map/reduce job
null
The id of the Hive Session. The query string passed to the map/reduce job. The id of the plan for the map/reduce stage. 50
hive.jobname.length The maximum length of the jobname. hive.table.name The name of the hive table. This is passed to the user scripts through the script operator. The name of the hive partition. This is passed to the user scripts through the
hive.partition.name
script operator. hive.alias The alias being processed. This is also passed to the user scripts through the script operator.
Temporary Folders
Hive uses temporary folders both on the machine running the Hive client and the default HDFS instance. These folders are used to store per-query temporary/intermediate data sets and are normally cleaned up by the hive client when the query is finished. However, in cases of abnormal hive client termination, some data may be left behind. The configuration details are as follows: On the HDFS cluster this is set to /tmp/hive-<username> by default and is controlled by the configuration variable hive.exec.scratchdir On the client machine, this is hardcoded to /tmp/<username>
Note that when writing data to a table/partition, Hive will first write to a temporary location on the target table's filesystem (using hive.exec.scratchdir as the temporary location) and then move the data to the target table. This applies in all cases - whether tables are stored in HDFS (normal case) or in file systems like S3 or even NFS.
Log Files
Hive client produces logs and history files on the client machine. Please see Error Logs on configuration details. Introduction Embedded Metastore Local Metastore Remote Metastore
Introduction
All the metadata for Hive tables and partitions are stored in Hive Metastore. Metadata is persisted using JPOX ORM solution so any store that is supported by it. Most of the commercial relational databases and many open source datstores are supported. Any datastore that has JDBC driver can probably be used. You can find an E/R diagram for the metastore here. There are 3 different ways to setup metastore server using different Hive configurations. The relevant configuration parameters are Config Param javax.jdo.option.ConnectionURL Description JDBC connection string for the data store which contains metadata
javax.jdo.option.ConnectionDriverName JDBC Driver class name for the data store which contains metadata hive.metastore.uris Hive connects to this URI to make metadata requests for a remote metastore local or remote metastore (Removed as of Hive 0.10: If hive.metastore.uris is empty local mode is assumed, remote otherwise) URI of the default location for native tables
hive.metastore.local
hive.metastore.warehouse.dir
These variables were carried over from old documentation without a guarantee that they all still exist: Variable Name Description Default Value
hive.metastore.metadb.dir hive.metastore.usefilestore hive.metastore.rawstore.impl org.jpox.autoCreateSchema Creates necessary schema on startup if one doesn't exist. (e.g. tables, columns...) Set to false after creating it once. Whether the datastore schema is fixed.
org.jpox.fixedDatastore hive.metastore.checkForDefaultDb
hive.metastore.ds.connection.url.hook Name of the hook to use for retriving the JDO connection URL. If empty, the value in javax.jdo.option.ConnectionURL is used as the connection URL hive.metastore.ds.retry.attempts The number of times to retry a call to the backing datastore if there were a connection error The number of miliseconds between datastore retry attempts 1
1000
Minimum number of worker threads in the Thrift server's pool. 200 Maximum number of worker threads in the Thrift server's pool. 10000
Default configuration sets up an embedded metastore which is used in unit tests and is described in the next section. More practical options are described in the subsequent sections.
Embedded Metastore
Mainly used for unit tests and only one process can connect to metastore at a time. So it is not really a practical solution but works well for unit tests. Config Param javax.jdo.option.Connectio nURL Config Value jdbc:derby:;databaseName=../build/test/junit_me tastore_db;create=true Comment derby database located at hive/trunk/ build... Derby embeded JDBC driver class
javax.jdo.option.Connectio nDriverName
org.apache.derby.jdbc.EmbeddedDriver
hive.metastore.uris hive.metastore.local
not needed since this is a local metastore true embeded is local unit test data goes in here on your local filesystem
If you want to run the metastore as a network server so it can be accessed from multiple nodes try HiveDerbyServerMode.
Local Metastore
In local metastore setup, each Hive Client will open a connection to the datastore and make SQL queries against it. The following config will setup a metastore in a MySQL server. Make sure that the server accessible from the machines where Hive queries are executed since this is a local store. Also the jdbc client library is in the classpath of Hive Client. Config Param javax.jdo.option.ConnectionURL Config Value jdbc:mysql://<host name>/<database name>?createDatabaseIfNotExist=true Comment metadata is stored in a MySQL server MySQL JDBC
javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver
driver class javax.jdo.option.ConnectionUserName <user name> user name for connecting to mysql server password for connecting to mysql server
javax.jdo.option.ConnectionPassword
<password>
hive.metastore.uris hive.metastore.local
not needed because this is local store true this is local store default location for Hive tables.
hive.metastore.warehouse.dir
Remote Metastore
In remote metastore setup, all Hive Clients will make a connection a metastore server which in turn queries the datastore (MySQL in this example) for metadata. Metastore server and client communicate using Thrift Protocol. Starting with Hive 0.5.0, you can start a thrift server by executing the following command: hive --service metastore In versions of Hive earlier than 0.5.0, it's instead necessary to run the thrift server via direct execution of Java: $JAVA_HOME/bin/java -Xmx1024m Dlog4j.configuration=file://$HIVE_HOME/conf/hms-log4j.properties Djava.library.path=$HADOOP_HOME/lib/native/Linux-amd64-64/ -cp $CLASSPATH org.apache.hadoop.hive.metastore.HiveMetaStore If you execute Java directly, then JAVA_HOME, HIVE_HOME, HADOOP_HOME must be correctly set; CLASSPATH should contain Hadoop, Hive (lib and auxlib), and Java jars. Server Configuration Parameters Config Param javax.jdo.option.ConnectionURL Config Value jdbc:mysql://<host name>/<database name>?createDatabaseIfNotExist=true Comment metadata is stored in a MySQL server MySQL JDBC driver class
javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver
javax.jdo.option.ConnectionUserName
<user name>
user name for connecting to mysql server password for connecting to mysql server default location for Hive tables.
javax.jdo.option.ConnectionPassword
<password>
hive.metastore.warehouse.dir
Client Configuration Parameters Config Param hive.metastore.uris hive.metastore.local Config Value Comment
thrift://<host_name>:<port> host and port for the thrift metastore server false this is local store default location for Hive tables.
If you are using MySQL as the datastore for metadata, put MySQL client libraries in HIVE_HOME/lib before starting Hive Client or HiveMetastore Server.
Features
Schema Browsing
An alternative to running 'show tables' or 'show extended tables' from the CLI is to use the web based schema browser. The Hive Meta Data is presented in a hierarchical manner allowing you to start at the database level and click to get information about tables including the SerDe, column names, and column types.
No local installation
Any user with a web browser can work with Hive. This has the usual web interface benefits, In particular a user wishing to interact with hadoop or hive requires access to many ports. A remote or VPN user would only require access to the hive web interface running by default on 0.0.0.0 tcp/9999.
Configuration
Hive Web Interface made its first appearance in the 0.2 branch. If you have 0.2 or the SVN trunk you already have it. You should not need to edit the defaults for the Hive web interface. HWI uses: <property> <name>hive.hwi.listen.host</name> <value>0.0.0.0</value> <description>This is the host address the Hive Web Interface will listen on</description> </property> <property> <name>hive.hwi.listen.port</name> <value>9999</value> <description>This is the port the Hive Web Interface will listen on</description> </property> <property> <name>hive.hwi.war.file</name> <value>${HIVE_HOME}/lib/hive_hwi.war</value> <description>This is the WAR file with the jsp content for Hive Web Interface</description> </property> You probably want to setup HiveDerbyServerMode to allow multiple sessions at the same time.
Start up
When initializing Hive with no arguments that CLI is invoked. Hive has an extension architecture used to start other hive demons. Jetty requires apache ant to start HWI. You should define ANT_LIB as an environment variable or add that to the hive invocation. export ANT_LIB=/opt/ant/lib bin/hive --service hwi
Java has no direct way of demonizing. In a production environment you should create a wrapper script. nohup bin/hive --service hwi > /dev/null 2> /dev/null & If you want help on the service invocation or list of parameters you can add bin/hive --service hwi --help
Authentication
Hadoop currently uses environmental properties to determine user name and group vector. Thus Hive and Hive Web Interface can not enforce more stringent security then Hadoop can. When you first connect to the Hive Web Interface the user is prompted for a user name and groups. This feature was added to support installations using different schedulers. If you want to tighten up security you are going to need to patch the source Hive Session Manager or you may be able to tweak the JSP to accomplish this.
Accessing
In order to access the Hive Web Interface, go to <Hive Server Address>:9999/hwi on your web browser.
Debug Mode
The debug mode is used when the user is interested in having the result file not only contain the result of the hive query but the other messages.
Set Processor
In the CLI a command like 'SET x=5' is not processed by the the Query Processor it is processed by the Set Processor. Use the form 'x=5' not 'set x=5'
Walk through
Authorize
Unable to render embedded object: File (1_hwi_authorize.png) not found. Unable to render embedded object: File (2_hwi_authorize.png) not found.
Schema Browser
Unable to render embedded object: File (3_schema_table.png) not found. Unable to render embedded object: File (4_schema_browser.png) not found.
Diagnostics
Unable to render embedded object: File (5_diagnostic.png) not found.
Running a query
Unable to render embedded object: File (6_newsession.png) not found. Unable to render embedded object: File (7_session_runquery.png) not found. Unable to render embedded object: File (8_session_query_1.png) not found. Unable to render embedded object: File (9_file_view.png) not found.
Background
This document explores the different ways of leveraging Hive on Amazon Web Services namely S3, EC2 and Elastic Map-Reduce. Hadoop already has a long tradition of being run on EC2 and S3. These are well documented in the links below which are a must read:
The second document also has pointers on how to get started using EC2 and S3. For people who are new to S3 - there's a few helpful notes in S3 for n00bs section below. The rest of the documentation below assumes that the reader can launch a hadoop cluster in EC2, copy files into and out of S3 and run some simple Hadoop jobs.
Stock Hadoop AMIs can be used. The user can run any version of Hive on their workstation, launch a Hadoop cluster with the desired Hadoop version etc. on EC2 and start running queries. Map-reduce scripts are automatically pushed by Hive into Hadoop's distributed cache at job submission time and do not need to be copied to the Hadoop machines. Hive Metadata can be stored on local disk painlessly.
However - the one downside of Option 2 is that jar files are copied over to the Hadoop cluster for each map-reduce job. This can cause high latency in job submission as well as incur some AWS network transmission costs. Option 1 seems suitable for advanced users who have figured out a stable Hadoop and Hive (and potentially external libraries) configuration that works for them and can create a new AMI with the same.
Considering these factors, the following makes sense in terms of Hive tables: 1. For long-lived tables, use S3 based storage mechanisms 2. For intermediate data and tmp tables, use HDFS [Case Study 1] shows you how to achieve such an arrangement using the S3N filesystem. If the user is running Hive CLI from their personal workstation - they can also use Hive's 'load data local' commands as a convenient alternative (to dfs commands) to copy data from their local filesystems (accessible from their workstation) into tables defined over either HDFS or S3.
mapred.job.tracker need to be changed to point the CLI from one Hadoop cluster to another. Beware though that tables stored in previous HDFS instance will not be accessible as the CLI switches from one cluster to another. Again - more details can be found in [Case Study 1].
Case Studies
1. [Querying files in S3 using EC2, Hive and Hadoop ]
Appendix
<<Anchor(S3n00b)>>
S3 for n00bs
One of the things useful to understand is how S3 is used as a file system normally. Each S3 bucket can be considered as a root of a File System. Different files within this filesystem become objects stored in S3 - where the path name of the file (path components joined with '/') become the S3 key within the bucket and file contents become the value. Different tools like [S3Fox|https:-addons.mozilla.org-en-US-firefox-addon-3247] and native S3 !FileSystem in Hadoop (s3n) show a directory structure that's implied by the common prefixes found in the keys. Not all tools are able to create an empty directory. In particular - S3Fox does (by creating a empty key representing the directory). Other popular tools like aws, s3cmd and s3curl provide convenient ways of accessing S3 from the command line - but don't have the capability of creating empty directories.
5. Providing deep integration, and optimized performance, with AWS services such as S3 and EC2 and AWS features such as Spot Instances, Elastic IPs, and Identity and Access Management (IAM) Please refer to the following link to view the Amazon Elastic MapReduce Getting Started Guide: http://docs.amazonwebservices.com/ElasticMapReduce/latest/GettingStartedGuide/ Amazon Elastic MapReduce provides you with multiple clients to run your Hive cluster. You can launch a Hive cluster using the AWS Management Console, the Amazon Elastic MapReduce Ruby Client, or the AWS Java SDK. You may also install and run multiple versions of Hive on the same cluster, allowing you to benchmark a newer Hive version alongside your previous version. You can also install a newer Hive version directly onto an existing Hive cluster.
Supported versions:
Hadoop Version Hive Version 0.18 0.20 0.4 0.5, 0.7, 0.7.1
Hive Defaults
Thrift Communication port
Hive Version Thrift port 0.4 0.5 0.7 0.7.1 10000 10000 10001 10002
Log File
Hive Version Log location
MetaStore
By default, Amazon Elastic MapReduce uses MySQL, preinstalled on the Master Node, for its Hive metastore. Alternatively, you can use the Amazon Relational Database Service (Amazon RDS) to ensure the metastore is persisted beyond the life of your cluster. This also allows you to share the metastore between multiple Hive clusters. Simply override the default location of the MySQL database to the external persistent storage location.
Hive CLI
EMR configures the master node to allow SSH access. You can log onto the master node and execute Hive commands using the Hive CLI. If you have multiple versions of Hive installed on the cluster you can access each one of them via a separate command: Hive Version Hive command 0.4 0.5 0.7 0.7.1 hive hive-0.5 hive-0.7 hive-0.7.1
EMR sets up a separate Hive metastore and Hive warehouse for each installed Hive version on a given cluster. Hence, creating tables using one version does not interfere with the tables created using another version installed. Please note that if you point multiple Hive tables to same location, updates to one table become visible to other tables.
Hive Server
EMR runs a Thrift Hive server on the master node of the Hive cluster. It can be accessed using any JDBC client (for example, squirrel SQL) via Hive JDBC drivers. The JDBC drivers for different Hive versions can be downloaded via the following links: Hive Version Hive JDBC 0.5 0.7 0.7.1 http://aws.amazon.com/developertools/0196055244487017 http://aws.amazon.com/developertools/1818074809286277 http://aws.amazon.com/developertools/8084613472207189
Here is the process to connect to the Hive Server using a JDBC driver: http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_Hive.html#Hiv eJDBCDriver
Hive S3 Tables
An Elastic MapReduce Hive cluster comes configured for communication with S3. You can create tables and point them to your S3 location and Hive and Hadoop will communicate with S3 automatically using your provided credentials. Once you have moved data to an S3 bucket, you simply point your table to that location in S3 in order to read or process data via Hive. You can also create partitioned tables in S3. Hive on Elastic MapReduce provides support for dynamic partitioning in S3.
Hive Logs
Hive application logs: All Hive application logs are redirected to /mnt/var/log/apps/ directory. Hadoop daemon logs: Hadoop daemon logs are available in /mnt/var/log/hadoop/ folder. Hadoop task attempt logs are available in /mnt/var/log/hadoop/userlogs/ folder on each slave node in the cluster.
Tutorials
The following Hive tutorials are available for you to get started with Hive on Elastic MapReduce: 1. Finding trending topics using Google Books n-grams data and Apache Hive on Elastic MapReduce http://aws.amazon.com/articles/Elastic-MapReduce/5249664154115844 2. Contextual Advertising using Apache Hive and Amazon Elastic MapReduce with High Performance Computing instances http://aws.amazon.com/articles/Elastic-MapReduce/2855 3. Operating a Data Warehouse with Hive, Amazon Elastic MapReduce and Amazon SimpleDB http://aws.amazon.com/articles/Elastic-MapReduce/2854 4. Running Hive on Amazon ElasticMap Reduce http://aws.amazon.com/articles/2857 In addition, Amazon provides step-by-step video tutorials: http://aws.amazon.com/articles/2862
Support
You can ask questions related to Hive on Elastic MapReduce on Elastic MapReduce forums at: https://forums.aws.amazon.com/forum.jspa?forumID=52 Please also refer to the EMR developer guide for more information: http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/ Contributed by: Vaibhav Aggarwal