0% found this document useful (0 votes)
647 views

Infosphere DataStage Hive Connector To Read Data From Hive Data Sources

IBM Information Server - Learn how to use Hive Connector in your jobs

Uploaded by

michael breion
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
647 views

Infosphere DataStage Hive Connector To Read Data From Hive Data Sources

IBM Information Server - Learn how to use Hive Connector in your jobs

Uploaded by

michael breion
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Infosphere DataStage Hive Connector to read data

from Hive data sources


Alekhya Telekicherla ([email protected]) 22 March 2017
Software Developer
IBM

Pallavi Koganti ([email protected])


Software Developer
IBM

Srinivas Mudigonda ([email protected])


Lead Software Developer
IBM India Pvt Ltd

Sunil Kumar Mogulla ([email protected])


Application Developer
IBM

This article describes a solution that is based on integration of the IBM InfoSphere DataStage
with Apache Hive. Data can be fetched from various Hive data sources into Information Server
modules for more processing. You will learn how IBM InfoSphere Information Server can be
used to perform read operation on Hive data source. This step-by-step guide helps you create,
configure, compile, and execute DataStage Hive Connector jobs that can read the data from
Apache Hive.

Introduction
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data
summarization, query, and analysis. Apache Hive supports analysis of large datasets stored in
Hadoop's HDFS and compatible file systems such as Amazon S3 filesystem. It supports queries
expressed in a language called HiveQL, which automatically translates SQL-like queries into
MapReduce jobs executed on Hadoop. We need an efficient solution to move information from
different Hive data sources to ETL space to perform further operations.

The integration of IBM InfoSphere DataStage with Apache Hive is achieved by the Infosphere Hive
connector, which is a datastage component. The Hive Connector stage helps in fetching the data
from Hive and then pass this data to other Information Server modules for more ETL processing.
This solution helps the Hive users to make intelligent business decisions based on the data.

© Copyright IBM Corporation 2017 Trademarks


Infosphere DataStage Hive Connector to read data from Hive Page 1 of 8
data sources
developerWorks® ibm.com/developerWorks/

Configuring Hive Connector in Read mode


Hive Connector supports normal read and partitioned read in the form of both Generated SQL and
user-defined SQL.

This section demonstrates a sample use case which performs read operation on Hive using Hive
Connector Stage. This datastage job includes a Hive Connector stage that specifies details about
accessing Apache Hive and a sequential file stage where data extracted to. Read mode of Hive
CC in a Datastage job supports only one output link.

1. Generated SQL

The detailed description of the steps required to read data using generated SQL mode from Hive is
as follows.
Figure 1. Hive Connector Read job

Setting up Hive Connector properties

1. In Properties tab, select "Generated SQL at run time" to yes and provide value for "Table
name" as shown below.
2. If the table is partitioned and if you want to utilize parallelism in the form of partitioned read,
select "Enable Partitioned Reads" to Yes.
Figure 2. Generated SQL Read properties

Infosphere DataStage Hive Connector to read data from Hive Page 2 of 8


data sources
ibm.com/developerWorks/ developerWorks®

3. The primary partition key is used by connector to utilize the parallelism. In this case, pc1 is the
primary partition column and the statements generated will be of the following format:
select c1, c2 from part_test4 where pc1=1, where pc1 here is the primary partition column of
the table.
4. Note that the statement generated by Hive Connector is in regular SQL format, not in HiveQL
format. The conversion from SQL to Hive QL will be handled by the driver internally.
5. Under Output, provide the column name and type details of the columns that you want to
extract, as follows:
Figure 3. Column Properties

6. Provide file name details in the Sequential file.


7. Compile and run the job.
Figure 4. Job Execution1

8. The output is seen as follows

Figure 5. Output Rows

2. User-defined SQL

The detailed description of the steps required to read data using user-defined SQL mode from
Hive is as follows.

Infosphere DataStage Hive Connector to read data from Hive Page 3 of 8


data sources
developerWorks® ibm.com/developerWorks/

Figure 6. Hive Connector Read job 2

Setting up Hive Connector properties

1. In Properties tab, set "Generated SQL at run time" to no.


2. Provide the read statement that needs to be executed under "Select Statement" property.
3. If the table is partitioned and if you want to utilize parallelism in the form of partitioned read,
select "Enable Partitioned Reads" to Yes.
Figure 7. User Defined SQL Read properties

i) In case of partitioned read, provide "Select Statement" in the following format


"select c1,c2 from part_test4 where pc1=[[part-value]]" where pc1 is the primary partition
column and [[part-value]] is the placeholder which will be replaced by the values in the
partition column during job run.
ii) Note that the connector accepts only primary or the first partition column of the table as the
partition column for the select statement.
iii) Incase the table is not partitioned, then the job aborts as the user- defined query is no
longer valid.
4. Under Output, provide the column name and type details of the columns that you want to
extract, as follows:

Infosphere DataStage Hive Connector to read data from Hive Page 4 of 8


data sources
ibm.com/developerWorks/ developerWorks®

Figure 8. Column Properties

5. Provide file name details in the Sequential file.


6. Compile and run the job
Figure 9. Job Execution2

7. The output is seen as follows

Figure 10. Output Rows2

Infosphere DataStage Hive Connector to read data from Hive Page 5 of 8


data sources
developerWorks® ibm.com/developerWorks/

Resources
• Infocenter link: https://www.ibm.com/support/knowledgecenter/SSZJPZ_11.5.0/
com.ibm.swg.im.iis.conn.hive.usage.doc/topics/hive_connector_top_of_nav.html

Infosphere DataStage Hive Connector to read data from Hive Page 6 of 8


data sources
ibm.com/developerWorks/ developerWorks®

About the authors

Alekhya Telekicherla

Alekhya Telekicherla is a Software developer working in the IBM InfoSphere


Information Server Connectivity team. She has around 7 years of experience in IBM
in the Data Integration domain, worked on development of various connectors like
MDM, Hive, ODBC and Sybase. She has a Bachelors degree in Computer Science
Engineering from IIT Guwahati.

Pallavi Koganti

Pallavi Koganti is a developer working in the Data Integration portfolio in the


IBM Infosphere Information Server. She has 11 years of experience in software
development. Having worked on various domains like Network and Systems
management to Data Integration, she is always interested in working on latest
technologies. She holds a Masters Degree (MCA) from Andhra University.

Srinivas Mudigonda

Srinivas Mudigonda is a lead developer working in the Data Integration portfolio in


the IBM InfoSphere Information Server. He has over 16 years of experience in the IT
industry and has varied experience ranging from the Distributed File Systems to the
Data Integration domain. He is always fascinated by the latest technologies and is
keen on leveraging the latest technologies in solving the complex customer problems.
He has a Bachelors degree in Electrical and Electronics Engineering (Hons.) from
BITS Pilani.

Sunil Kumar Mogulla

Sunil K Mogulla has around 6 years of experience as a Senior QA in IBM Information


Server, handling various Datastage connectors like Hive, File, JDBC, Oracle, ODBC
and Streams. Involved in implementation of Hadoop solutions using Information
server. He has worked as Oracle PLSQL developer for 3 years and supported in
performance tuning and design areas using Oracle Database. Certified Oracle
Associate with Developer Track includes SQL and PLSQL.

Infosphere DataStage Hive Connector to read data from Hive Page 7 of 8


data sources
developerWorks® ibm.com/developerWorks/

© Copyright IBM Corporation 2017


(www.ibm.com/legal/copytrade.shtml)
Trademarks
(www.ibm.com/developerworks/ibm/trademarks/)

Infosphere DataStage Hive Connector to read data from Hive Page 8 of 8


data sources

You might also like