Infosphere DataStage Hive Connector To Read Data From Hive Data Sources
Infosphere DataStage Hive Connector To Read Data From Hive Data Sources
This article describes a solution that is based on integration of the IBM InfoSphere DataStage
with Apache Hive. Data can be fetched from various Hive data sources into Information Server
modules for more processing. You will learn how IBM InfoSphere Information Server can be
used to perform read operation on Hive data source. This step-by-step guide helps you create,
configure, compile, and execute DataStage Hive Connector jobs that can read the data from
Apache Hive.
Introduction
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data
summarization, query, and analysis. Apache Hive supports analysis of large datasets stored in
Hadoop's HDFS and compatible file systems such as Amazon S3 filesystem. It supports queries
expressed in a language called HiveQL, which automatically translates SQL-like queries into
MapReduce jobs executed on Hadoop. We need an efficient solution to move information from
different Hive data sources to ETL space to perform further operations.
The integration of IBM InfoSphere DataStage with Apache Hive is achieved by the Infosphere Hive
connector, which is a datastage component. The Hive Connector stage helps in fetching the data
from Hive and then pass this data to other Information Server modules for more ETL processing.
This solution helps the Hive users to make intelligent business decisions based on the data.
This section demonstrates a sample use case which performs read operation on Hive using Hive
Connector Stage. This datastage job includes a Hive Connector stage that specifies details about
accessing Apache Hive and a sequential file stage where data extracted to. Read mode of Hive
CC in a Datastage job supports only one output link.
1. Generated SQL
The detailed description of the steps required to read data using generated SQL mode from Hive is
as follows.
Figure 1. Hive Connector Read job
1. In Properties tab, select "Generated SQL at run time" to yes and provide value for "Table
name" as shown below.
2. If the table is partitioned and if you want to utilize parallelism in the form of partitioned read,
select "Enable Partitioned Reads" to Yes.
Figure 2. Generated SQL Read properties
3. The primary partition key is used by connector to utilize the parallelism. In this case, pc1 is the
primary partition column and the statements generated will be of the following format:
select c1, c2 from part_test4 where pc1=1, where pc1 here is the primary partition column of
the table.
4. Note that the statement generated by Hive Connector is in regular SQL format, not in HiveQL
format. The conversion from SQL to Hive QL will be handled by the driver internally.
5. Under Output, provide the column name and type details of the columns that you want to
extract, as follows:
Figure 3. Column Properties
2. User-defined SQL
The detailed description of the steps required to read data using user-defined SQL mode from
Hive is as follows.
Resources
• Infocenter link: https://www.ibm.com/support/knowledgecenter/SSZJPZ_11.5.0/
com.ibm.swg.im.iis.conn.hive.usage.doc/topics/hive_connector_top_of_nav.html
Alekhya Telekicherla
Pallavi Koganti
Srinivas Mudigonda