BDA-Unit 5-notes
BDA-Unit 5-notes
UNIT – V
Pig Latin: Installing and Running Pig, Comparison with Databases, Different Pig Latin
expression, Ways of executing Pig Programs, Built-in functions in Pig, Data Processing
Operators, Pig Latin commands/ Pig in Practice.
Hive: Installing Hive, The Hive Shell, Comparison with Traditional Databases, HiveQL,
Tables, User Defined Functions.
Apache Pig is an abstraction over MapReduce. It is a tool/platform which is used to analyze larger sets of
data representing them as data flows. Pig is generally used with Hadoop; we can perform all the data
manipulation operations in Hadoop using Apache Pig.
To write data analysis programs, Pig provides a high-level language known as Pig Latin. This language
provides various operators using which programmers can develop their own functions for reading,
writing, and processing data.
To analyze data using Apache Pig, programmers need to write scripts using Pig Latin language. All these
scripts are internally converted to Map and Reduce tasks. Apache Pig has a component known as Pig
Engine that accepts the Pig Latin scripts as input and converts those scripts into MapReduce jobs.
Programmers who are not so good at Java normally used to struggle working with Hadoop, especially
while performing any MapReduce tasks. Apache Pig is a boon for all such programmers.
Using Pig Latin, programmers can perform MapReduce tasks easily without having to type
complex codes in Java.
Apache Pig uses multi-query approach, thereby reducing the length of codes. For example, an
operation that would require you to type 200 lines of code (LoC) in Java can be easily done by
typing as less as just 10 LoC in Apache Pig. Ultimately Apache Pig reduces the development time
by almost 16 times.
Pig Latin is SQL-like language and it is easy to learn Apache Pig when you are familiar with
SQL.
Apache Pig provides many built-in operators to support data operations like joins, filters,
ordering, etc. In addition, it also provides nested data types like tuples, bags, and maps that are
missing from MapReduce.
Features of Pig:
Apache Pig comes with the following features −
Rich set of operators − It provides many operators to perform operations like join, sort, filer, etc.
Ease of programming − Pig Latin is similar to SQL and it is easy to write a Pig script if you are
good at SQL.
Optimization opportunities − The tasks in Apache Pig optimize their execution automatically,
so the programmers need to focus only on semantics of the language.
Extensibility − Using the existing operators, users can develop their own functions to read,
process, and write data.
UDF’s − Pig provides the facility to create User-defined Functions in other programming
languages such as Java and invoke or embed them in Pig Scripts.
Handles all kinds of data − Apache Pig analyzes all kinds of data, both structured as well as
unstructured. It stores the results in HDFS.
Pig SQL
SQL is
Pig Latin is a procedural language.
a declarative language.
The data model in Apache Pig is nested The data model used in
relational. SQL is flat relational.
To perform a particular task Programmers using Pig, programmers need to write a Pig script using the Pig
Latin language, and execute them using any of the execution mechanisms (Grunt Shell, UDFs,
Embedded). After execution, these scripts will go through a series of transformations applied by the Pig
Framework, to produce the desired output.
Internally, Apache Pig converts these scripts into a series of MapReduce jobs, and thus, it makes the
programmer’s job easy. The architecture of Apache Pig is shown below.
2. Optimizer – It performs
activities such as merge, split,
joins, Order by, group by, etc. It
basically tries to reduce the
amount of data which is being
send to the next stage.
Atom
Any single value in Pig Latin, irrespective of their data, type is known as an Atom. It is stored as string
and can be used as string and number. int, long, float, double, chararray, and bytearray are the atomic
values of Pig. A piece of data or a simple atomic value is known as a field.
Example − ‘raja’ or ‘30’
Tuple
A record that is formed by an ordered set of fields is known as a tuple, the fields can be of any type. A
tuple is similar to a row in a table of RDBMS.
Bag
A bag is an unordered set of tuples. In other words, a collection of tuples (non-unique) is known as a bag.
Each tuple can have any number of fields (flexible schema). A bag is represented by ‘{}’. It is similar to a
table in RDBMS, but unlike a table in RDBMS, it is not necessary that every tuple contain the same
number of fields or that the fields in the same position (column) have the same type.
Map
A map (or data map) is a set of key-value pairs. The key needs to be of type chararray and should be
unique. The value might be of any type. It is represented by ‘[]’
Example − [name#Raja, age#30]
Relation
A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no guarantee that tuples are
processed in any particular order).
It’s convenient to add Pig’s binary directory to your command-line path. For example:
% export PIG_INSTALL=/home/tom/pig-x.y.z
% export PATH=$PATH:$PIG_INSTALL/bin
You also need to set the JAVA_HOME environment variable to point to a suitable Java installation.
Try typing pig -help to get usage instructions.
Bash
pig -x local
This opens the Grunt Shell, where you can enter Pig Latin commands interactively.
Pig
Bash
Bash
Pig
Bash
pig myscript.pig
Output Verification
Pig
Now you’re ready to write and execute Pig Latin scripts for data analysis!
The organizations looking for open source querying and programming to tame Big data have adopted
Hive and Pig widely. At the same time, it is vital to pick and choose the right platform and tool for
managing your data well. Hence it is essential to understand the differences between Hive vs Pig vs SQL
and choose the best suitable option for the project.
Does the comparison for Hive vs Pig vs SQL direct the winning of the game?
We have seen that there are significant differences in the three- Hive vs Pig vs SQL. All of these
perform specific functions and meet unique requirements of the business. Also, all three requires
proper infrastructure and skills for their efficient use while working on data sets.
Suitable for Suitable for Complex Ideal for Batch Ideal for more
as well as Nested data Processing – OLAP straightforward business
structure (Online Analytical demands for fast data
Processing) analysis
Compatibility Pig works on top of Hive works on top of Not compatible with
MapReduce MapReduce MapReduce
programming
o It executes in a single JVM and is used for development experimenting and prototyping.
o Here, files are installed and run using localhost.
o The local mode works on a local file system. The input and output data stored in the local
file system.
1. $ pig-x local
MapReduce Mode
1. $ pig
Or,
1. $ pig -x mapreduce
These are the following ways of executing a Pig program on local and MapReduce mode: -
o Interactive Mode(Grunt) - In this mode, the Pig is executed in the Grunt shell. To
invoke Grunt shell, run the pig command. Once the Grunt mode executes, we can provide
Pig Latin statements and command interactively at the command line.
o Batch Mode(Script) - In this mode, we can run a script file having a .pig extension.
These files contain Pig Latin commands.
o Embedded Mode - In this mode, we can define our own functions. These functions can
be called as UDF (User Defined Functions). Here, we use programming languages like
Java and Python.
The Load Operator: You can load data into Apache Pig from the file system (HDFS/ Local)
using LOAD operator of Pig Latin.
Relation
We have stored the data in the schema student.
name
Input file We are reading data from the file student_data.txt, which is in the
path /pig_data/ directory of HDFS.
We have used the PigStorage() function. It loads and stores data as
Storage
structured text files. It takes a delimiter using which each entity of a tuple
function
is separated, as a parameter. By default, it takes ‘\t’ as a parameter.
We have stored the data using the following schema.
The Store Operator: You can store the loaded data in the file system using the store operator.
i) The FILTER operator: is used to select the required tuples from a relation based on a
condition.
ii) The DISTINCT operator: is used to remove redundant (duplicate) tuples from a relation.
3. Grouping and Joining Data: Group, Cogroup, Join and Cross operators
i) The GROUP operator is used to group the data in one or more relations. It collects the data having the
same key.
Syntax: grunt> Group_data = GROUP Relation_name BY age;
Example: grunt> group_data = GROUP student_details by age;
Grouping by Multiple Columns
Let us group the relation by age and city as shown below.
grunt> group_multiple = GROUP student_details by (age, city);
Group All
You can group a relation by all the columns as shown below.
grunt> group_all = GROUP student_details All;
ii) The COGROUP operator : works more or less in the same way as the GROUP operator. The only
difference between the two operators is that the group operator is normally used with one relation, while
the cogroup operator is used in statements involving two or more relations.
Grouping Two Relations using Cogroup
Assume that we have two files namely student_details.txt and employee_details.txt in the HDFS
directory /pig_data/
Ex: grunt> cogroup_data = COGROUP student_details by age, employee_details by age;
grunt> Dump cogroup_data;
The cogroup operator groups the tuples from each relation according to age where each group depicts a
particular age value.
iii) The JOIN operator: is used to combine records from two or more relations. While performing a join
operation, we declare one (or a group of) tuple(s) from each relation, as keys. When these keys match, the
two particular tuples are matched, else the records are dropped. Joins can be of the following types −
Self-join
Inner-join
Outer-join − left join, right join, and full join
This chapter explains with examples how to use the join operator in Pig Latin. Assume that we have two
files namely customers.txt and orders.txt in the /pig_data/ directory of HDFS .
Self-join is used to join a table with itself as if the table were two relations, temporarily renaming at least
one relation.
Syntax: grunt> Relation3_name = JOIN Relation1_name BY key, Relation2_name BY key ;
Example
Let us perform self-join operation on the relation customers, by joining the two
relations customers1 and customers2 as shown below.
grunt> customers3 = JOIN customers1 BY id, customers2 BY id;
Inner Join
Inner Join is used quite frequently; it is also referred to as equijoin. An inner join returns rows when
there is a match in both tables.
grunt> result = JOIN relation1 BY columnname, relation2 BY columnname;
Example:grunt> coustomer_orders = JOIN customers BY id, orders BY customer_id;
Outer Join: Unlike inner join, outer join returns all the rows from at least one of the relations. An outer
join operation is carried out in three ways −
Left outer join
Right outer join
Full outer join
The left outer Join operation returns all rows from the left table, even if there are no matches in the right
relation.
grunt> Relation3_name = JOIN Relation1_name BY id LEFT OUTER, Relation2_name BY
customer_id;
Example:grunt> outer_left = JOIN customers BY id LEFT OUTER, orders BY customer_id;
The right outer join operation returns all rows from the right table, even if there are no matches in the
left table.
Syntax: grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id;
Example: grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id;
The full outer join operation returns rows when there is a match in one of the relations.
Syntax: grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;
Example: grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;
iv) The CROSS operator computes the cross-product of two or more relations.
Syntax: grunt> Relation3_name = CROSS Relation1_name, Relation2_name;
Example: grunt> cross_data = CROSS customers, orders;
i) The UNION operator of Pig Latin is used to merge the content of two relations. To perform UNION
operation on two relations, their columns and domains must be identical.
Syntax: grunt> Relation_name3 = UNION Relation_name1, Relation_name2;
Example: grunt> student = UNION student1, student2;
ii) The SPLIT operator is used to split a relation into two or more relations.
Syntax: grunt> SPLIT Relation1_name INTO Relation2_name IF (condition1), Relation2_name
(condition2),
Example:
SPLIT student_details into student_details1 if age<23, student_details2 if (22<age and age>25);
ii) The LIMIT operator is used to get a limited number of tuples from a relation.
Synta.: grunt> Result = LIMIT Relation_name required number of tuples;
Example: grunt> limit_data = LIMIT student_details 4;
The load statement will simply load the data into the specified relation in Apache Pig. To verify
the execution of the Load statement, you have to use the Diagnostic Operators. Pig Latin
provides four different types of diagnostic operators −
Dump operator
Describe operator
Explanation operator
Illustration operator
i) The Dump operator is used to run the Pig Latin statements and display the results on the screen. It is
generally used for debugging Purpose.
Syntax:grunt> Dump Relation_Name
Example: grunt> Dump student
ii)The describe operator is used to view the schema of a relation.
Syntax : grunt> Describe Relation_name
Example: grunt> describe student;
iii)The explain operator is used to display the logical, physical, and MapReduce execution plans of a
relation.
Syntax : grunt> explain Relation_name;
Example: grunt> explain student;
iv)The illustrate operator gives you the step-by-step execution of a sequence of statements.
Syntax: grunt> illustrate Relation _name;
Parallelism
When running in MapReduce mode it’s important that the degree of parallelism
matches the size of the dataset. By default, Pig will sets the number of reducers by
looking at the size of the input, and using one reducer per 1GB of input, up to a maximum
of 999 reducers. You can override these parameters by setting pig.exec.reduc
ers.bytes.per.reducer (the default is 1000000000 bytes) and pig.exec.reducers.max
(default 999).
To explictly set the number of reducers you want for each job, you can use a PARALLEL
clause for operators that run in the reduce phase. These include all the grouping and
joining operators (GROUP, COGROUP, JOIN, CROSS), as well as DISTINCT and
ORDER. The following line sets the number of reducers to 30 for the GROUP:
grouped_records = GROUP records BY year PARALLEL 30;
Alternatively, you can set the default_parallel option, and it will take effect for all
subsequent jobs:
Parameter Substitution
If you have a Pig script that you run on a regular basis, then it’s quite common to want
to be able to run the same script with different parameters. For example, a script that
runs daily may use the date to determine which input files it runs over. Pig supports
parameter substitution, where parameters in the script are substituted with values supplied at runtime.
5.8.Hive: Introduction
What is Hive?
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of
Hadoop to summarize Big Data, and makes querying and analyzing easy.
Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and
developed it further as an open source under the name Apache Hive. It is used by different companies.
For example, Amazon uses it in Amazon Elastic MapReduce
Features of Hive
Architecture of Hive
Hive Client
Hive allows writing applications in various languages, including Java, Python, and C++. It supports different types
of clients such as:-
o Thrift Server - It is a cross-language service provider platform that serves the request from all those
programming languages that supports Thrift.
o JDBC Driver - It is used to establish a connection between hive and Java applications. The JDBC Driver is
present in the class org.apache.hadoop.hive.jdbc.HiveDriver.
o ODBC Driver - It allows the applications that support the ODBC protocol to connect to Hi ve.
Hive Services
o Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can execute Hive queries and
commands.
o Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI. It provides a web-based
GUI for executing Hive queries and commands.
o Hive MetaStore - It is a central repository that stores all the structure information of various tables and
partitions in the warehouse. It also includes metadata of column and its type information, the serializers and
deserializers which is used to read and write data and the corresponding HDFS files where the data is
stored.
o Hive Server - It is referred to as Apache Thrift Server. It accepts the request from different clients and
provides it to Hive Driver.
o Hive Driver - It receives queries from different sources like web UI, CLI, Thrift, and JDBC/ODBC driver.
It transfers the queries to the compiler.
o Hive Compiler - The purpose of the compiler is to parse the query and perform semantic analysis on the
different query blocks and expressions. It converts HiveQL statements into MapReduce jobs.
o Hive Execution Engine - Optimizer generates the logical plan in the form of DAG of map-reduce tasks and
HDFS tasks. In the end, the execution engine executes the incoming tasks in the order of their
dependencies.
The Metastore
o The metastore is the central repository of Hive metadata. The metastore is divided intotwo pieces: a service
and the backing store for the data.
o By default, the metastore service runs in the same JVM as the Hive service and contains an embedded
Derby database instance backed by the local disk. This is called the embedded metastore configuration
o (see Figure 12-2).
% export HIVE_INSTALL=/home/tom/hive-x.y.z-dev
% export PATH=$PATH:$HIVE_INSTALL/bin
Other useful Hive shell features include the ability to run commands on the host operating system by
using a ! prefix to the command and the ability to access Hadoop filesystems using the dfs command.
In a traditional database, a table’s schema is enforced at data load time. If the data being loaded doesn’t
conform to the schema, then it is rejected. This design is sometimes called schema on write, since the data
is checked against the schema when it is written into the database.
Hive, on the other hand, doesn’t verify the data when it is loaded, but rather when a query is issued. This
is called schema on read.
There are trade-offs between the two approaches. Schema on read makes for a very fast initial load, since
the data does not have to be read, parsed, and serialized to disk in the database’s internal format. The load
operation is just a file copy or move. It is more flexible, too: consider having two schemas for the same
underlying data, depending on the analysis being performed
Schema on write makes query time performance faster, since the database can index columns and perform
compression on the data. The trade-off, however, is that it takes longer to load data into the database.
Furthermore, there are many scenarios where the schema is not known at load time, so there are no
indexes to apply, since the queries have not been formulated yet. These scenarios are where Hive shines.
Updates, transactions, and indexes are mainstays of traditional databases. Yet, until recently, these
features have not been considered a part of Hive’s feature set. This is because Hive was built to operate
over HDFS data using MapReduce, where full-table scans are the norm and a table update is achieved by
transforming the data into a new table. For a data warehousing application that runs over large portions of
the dataset, this works well.
However, there are workloads where updates (or insert appends, at least) are needed, or where indexes
would yield significant performance gains. On the transactions front, Hive doesn’t define clear semantics
for concurrent access to tables, which means applications need to build their own application-level
concurrency or locking mechanism. The Hive team is actively working on improvements in all these
areas.
Change is also coming from another direction: HBase integration. HBase has different storage
characteristics to HDFS, such as the ability to do row updates and column indexing, so we can expect to
see these features used by Hive in future releases. It is already possible to access HBase tables from Hive.
Schema is fixed in
Schema varies in it.
Schema RDBMS.
Flexibility
It doesn’t support
It supports automation partition.
Partitioning partitioning.
Support
5.1.2. HiveQL
A query language called HiveQL (Hive Query Language) is used to communicate with Apache Hive, a
Hadoop data warehouse and SQL-like query language. For searching and managing massively distributed
data held in Hadoop's HDFS (Hadoop Distributed File System) or other comparable storage systems.
HiveQL is intended to make big data analysis and querying easier, especially for individuals who are
already comfortable with SQL.
To interact with tables, databases, and queries, Hive provides a SQL like environment through hadoop
hiveql. To execute various types of data processing and querying, we can have different types of Clauses
for improved communication with various nodes outside the ecosystem. HIVE also has JDBC
connectivity.
Hive’s SQL dialect, called HiveQL, does not support the full SQL-92 specification. There are a number of
reasons for this. Being a fairly young project, it has not had time to provide the full repertoire of SQL-92
language constructs
Data Types
Hive data types are divided into the following 5 different categories:
Hive supports both primitive and complex data types. Primitives include numeric, boolean, string, and
timestamp types. The complex data types include arrays, maps, and structs. Hive’s data types are listed
in Table 12-3. Note that the literals shown are those used from within HiveQL; they are not the serialized
form used in the table’s storage format.
Operators and Functions
The usual set of SQL operators is provided by Hive:
relational operators (such as x = 'a' for testing equality, x IS NULL for testing nullity, x LIKE 'a%' for
pattern matching),
logical operators (such as x OR y for logical OR). The operators match those in MySQL, which deviates
from SQL-92 since || is logical OR, not string concatenation. Use the concat function for the latter in
both MySQL and Hive.
Hive comes with a large number of built-in functions—too many to list here—divided into categories
including mathematical and statistical functions, string functions, date functions (for operating on string
representations of dates), conditional functions, aggregate functions, and functions for working with
XML (using the xpath function) and JSON.
You can retrieve a list of functions from the Hive shell by typing SHOW FUNCTIONS. To get brief usage
instructions for a particular function, use the DESCRIBE command:
In the case when there is no built-in function that does what you want, you can write your own;
“User-Defined Functions”.
5.1.3. Tables
A Hive table is logically made up of the data being stored and the associated metadata describing the
layout of the data in the table. The data typically resides in HDFS, although it may reside in any Hadoop
filesystem, including the local filesystem or S3. Hive stores the metadata in a relational database—and
not in HDFS.
Multiple Database/Schema Support
Many relational databases have a facility for multiple namespaces, which allow users and applications to be segregated into different
databases or schemas. Hive supports the same facility, and provides commands such as CREATE DATABASE dbname, USE dbname, and DROP
DATABASE dbname. You can fully qualify a table by writing dbname.tablename. If no database is specified, tables belong to the default
database.
Create Table
We use the create table statement to create a table and the complete syntax is as follows .
Now, the tables have been created. It’s time to load the data into it. We can load the data
from any local file on our system using the following syntax .
Alter Table
In the hive, we can do multiple modifications to the existing tables like renaming the tables,
adding more columns to the table. The commands to alter the table are very much similar to
the SQL commands.
Drop Table
Dropping a table from the Hive meta store deletes the entire table with all rows and columns.
Syntax:
Drop table;
When you create a table in Hive, by default Hive will manage the data, which means that Hive moves the
data into its warehouse directory. Alternatively, you may create an external table, which tells Hive to
refer to the data that is at an existing location outside the warehouse directory.
The difference between the two types of table is seen in the LOAD and DROP semantics. Let’s consider a
managed table first.
When you load data into a managed table, it is moved into Hive’s warehouse directory. For example:
will move the file hdfs://user/tom/data.txt into Hive’s warehouse directory for the managed_table
table, which is hdfs://user/hive/warehouse/managed_table.
then the table, including its metadata and its data, is deleted.
An external table behaves differently. You control the creation and deletion of the data. The location of
the external data is specified at table creation time:
LOCATION '/user/tom/external_table';
Hive assumes that it owns the data for managed tables. For external tables, Hive assumes that it does not manage
the data.
If a managed table or partition is dropped, the data and Dropping the table does not delete the data, although the
metadata associated with that table or partition are deleted. metadata for the table will be deleted.
For Managed tables, Hive stores data into its warehouse For External Tables, Hive stores the data in the LOCATION
directory specified during creation of the table(generally not in
warehouse directory)
Managed table provides ACID/transnational action support. External Table does not provide ACID/transactional action
support.
Tables or partitions may further be subdivided into buckets, to give extra structure to the data that may
be used for more efficient queries. For example, bucketing by user ID means we can quickly evaluate a
user-based query by running it on a randomized sample of the total set of users.
Partitions
A table may be partitioned in multiple dimensions. For example, in addition to partitioning logs by date,
we might also subpartition each date partition by country to permit efficient queries by location.
Partitions are defined at table creation time using the PARTITIONED BY clause, which takes a list of
column definitions.
For the hypothetical log files example, we might define a table with records comprising a timestamp
and the log line itself:
Buckets
Hive Bucketing a.k.a (Clustering) is a technique to split the data into more manageable files, (By
specifying the number of buckets to create). The value of the bucketing column will be hashed by a user-
defined number into buckets.
Bucketing can be created on just one column, you can also create bucketing on a partitioned table to
further split the data which further improves the query performance of the partitioned table.
Each bucket is stored as a file within the table’s directory or the partitions directories. Note that
partition creates a directory and you can have a partition on one or more columns; these are some of
the differences between Hive partition and bucket.
From our example, we already have a partition on state which leads to around 50 subdirectories on a
table directory, and creating a bucketing 10 on zipcode column creates 10 files for each partitioned
subdirectory.
First, let’s see how to tell Hive that a table should be bucketed. We use the CLUSTERED BY clause to
specify the columns to bucket on and the number of buckets:
PARTITIONING BUCKETING
Directory is created on HDFS for File is created on HDFS for each bucket.
each partition.
You can have one or more Partition You can have only one Bucketing column
columns
You can’t manage the number of You can manage the number of buckets to
partitions to create create by specifying the count
There are two dimensions that govern table storage in Hive: the row format and the
file format. The row format dictates how rows, and the fields in a particular row, are
stored. In Hive parlance, the row format is defined by a SerDe, a portmanteau word
for a Serializer-Deserializer
The file format dictates the container format for fields in a row. The simplest format is
a plain text file, but there are row-oriented and column-oriented binary formats available,
too.
The default storage format: Delimited text
When you create a table with no ROW FORMAT or STORED AS clauses, the default format is
delimited text, with a row per line.
Binary storage formats: Sequence files and RCFiles
Hadoop’s sequence file format is a general purpose binary format for sequences of records (key-value
pairs). You can use sequence files inHive by using the declaration STORED AS SEQUENCEFILE in the
CREATE TABLE statement.
One of the main benefits of using sequence files is their support for splittable compression.
If you have a collection of sequence files that were created outside Hive, then Hive will read them with
no extra configuration. If, on the other hand, you want tables populated from Hive to use compressed
sequence files for their storage, you need to set a few properties to enable compression :
hive> SET hive.exec.compress.output=true;
hive> SET mapred.output.compress=true;
hive> SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
hive> INSERT OVERWRITE TABLE ...;
Sequence files are row-oriented. What this means is that the fields in each row are stored together, as the
contents of a single sequence file record.
Hive provides another binary storage format called RCFile, short for Record Columnar File. RCFiles are
similar to sequence files, except that they store data in a columnoriented fashion. RCFile breaks up the
table into row splits, then within each split stores the values for each row in the first column, followed by
the values for each row in the second column, and so on. This is shown diagrammatically in Figure 12-3.
Figure 12-3. Row-oriented versus column-oriented storage
A column-oriented layout permits columns that are not accessed in a query to be skipped.
Consider a query of the table in Figure 12-3 that processes only column 2. With row-oriented storage, like
a sequence file, the whole row (stored in a sequence file record) is loaded into memory, even though only
the second column is actually read.
Importing Data
We’ve already seen how to use the LOAD DATA operation to import data into a Hive table (or partition)
by copying or moving files to the table’s directory. You can also populate a table with data from another
Hive table using an INSERT statement, or at creation time using the CTAS construct, which is an
abbreviation used to refer to CREATE TABLE...AS SELECT.
INSERT OVERWRITE TABLE
Here’s an example of an INSERT statement:
INSERT OVERWRITE TABLE target
SELECT col1, col2
FROM source;
You can specify the partition dynamically, by determining the partition value from the SELECT
statement:
INSERT OVERWRITE TABLE target
PARTITION (dt)
SELECT col1, col2, dt
FROM source;
This is known as a dynamic-partition insert. This feature is off by default, so you need to enable it by
setting hive.exec.dynamic.partition to true first.
Multitable insert
In HiveQL, you can turn the INSERT statement around and start with the FROM clause, for the same
effect:
FROM source
INSERT OVERWRITE TABLE target
SELECT col1, col2;
The reason for this syntax becomes clear when you see that it’s possible to have multiple INSERT clauses
in the same query. This so-called multitable insert is more efficient than multiple INSERT statements.
Altering Tables
You can rename a table using the ALTER TABLE statement:
ALTER TABLE source RENAME TO target;
Hive allows you to change the definition for columns, add new columns, or even replace all existing
columns in a table with a new set.
For example, consider adding a new column:
ALTER TABLE target ADD COLUMNS (col3 STRING);
Dropping Tables
The DROP TABLE statement deletes the data and metadata for a table. In the case of external tables, only
the metadata is deleted—the data is left untouched.
If you want to delete all the data in a table, but keep the table definition (like DELETE orTRUNCATE in
MySQL), then you can simply delete the data files. For example:
hive> dfs -rmr /user/hive/warehouse/my_table;
Hive treats a lack of files (or indeed no directory for the table) as an empty table.
5.1.4User-Defined Functions:
Sometimes the query you want to write can’t be expressed easily (or at all) using the built-in
functions that Hive provides.
By writing a user-defined function (UDF), Hive makes it easy to plug in your own processing
code and invoke it from a Hive query.
UDFs have to be written in Java, the language that Hive itself is written in.
For other languages, consider using a SELECT TRANSFORM query, which allows you to stream
data through a user-defined script.
There are three types of UDF in Hive: (regular) UDFs, UDAFs (user-defined aggregate functions), and
UDTFs (user-defined table-generating functions). They differ in the numbers of rows that they accept as
input and produce as output:
• A UDF operates on a single row and produces a single row as its output. Most functions, such as
mathematical functions and string functions, are of this type.
• A UDAF works on multiple input rows and creates a single output row. Aggregate functions include
such functions as COUNT and MAX.
• A UDTF operates on a single row and produces multiple rows—a table—as output.
1.Writing a UDF:
To illustrate the process of writing and using a UDF, we’ll write a simple UDF to trim characters from the
ends of strings. Hive already has a built-in function called trim, so we’ll call ours strip. The code for the
Strip Java class is shown in Example 12-2.
Example 12-2. A UDF for stripping characters from the ends of strings
package com.hadoopbook.hive;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public class Strip extends UDF {
private Text result = new Text();
public Text evaluate(Text str) {
if (str == null) {
return null;
}
result.set(StringUtils.strip(str.toString()));
return result;
}
public Text evaluate(Text str, String stripChars) {
if (str == null) {
return null;
}
result.set(StringUtils.strip(str.toString(), stripChars));
return result;
}
}
2. Writing a UDAF:
An aggregate function is more difficult to write than a regular UDF, since values are aggregated in
chunks (potentially across many Map or Reduce tasks), so the implementation has to be capable of
combining partial aggregations into a final result.
Figure 12-4. Data flow with partial results for a UDAF
The init() method initializes the evaluator and resets its internal state.
The iterate() method is called every time there is a new value to be aggregated.
The terminatePartial() method is called when Hive wants a result for the partial aggregation.
The merge() method is called when Hive decides to combine one partial aggregation with another
The terminate() method is called when the final result of the aggregation is needed.