White Paper - Working With Informatica-Teradata Parallel Transporter
White Paper - Working With Informatica-Teradata Parallel Transporter
1. Introduction
Today's growing data warehouses demand fast, reliable tools are help to acquire and
manage data and flexibility to load large volumes of data from any source at any
time. Challenges come from everywhere: more data sources, growing data volumes,
dynamically changing business requirements and user demands for fresher data.
PowerCenter uses the following techniques when extracting data from and loading
data to the Teradata database:
ETL
(Extract, transform, and load). This technique extracts data from the source systems,
transforms the data within PowerCenter, and loads it to target tables.
ELT
(Extract, load, and then transform). This technique extracts data from the source
systems, loads it to user-defined staging tables in the target database, and
transforms the data within the target system using generated SQL. The SQL queries
include a final insert into the target tables.
ETL-T
(ETL and ELT hybrid). This technique extracts data from the source systems,
transforms the data within PowerCenter, loads the data to user-defined staging tables
in the target database, and further transforms the data within the target system using
generated SQL. The SQL queries include a final insert into the target tables. The ELT-T
technique is optimized within PowerCenter, so that the transformations that better
perform within the database system can be performed there and the Integration
Service performs the other transformations.
Teradata TPump is a highly parallel utility that can continuously move data from
data sources into Teradata tables without locking the affected table. TPump
supports inserts, updates, deletes, and data-driver updates. TPump acquires row
hash locks on a database table instead of table-level locks, so multiple TPump
instances can load data simultaneously to the same table. TPump is often used to
“trickle-load” a database table. Use TPump for low volume, online data loads.
· We can encrypt the data transfer between FastExport and the Teradata
server.
· FastExport is available for sources and pipeline lookups.
· Data encryption.
Enable this attribute to encrypt the data transfer between FastExport and
the Teradata server so that unauthorized users cannot access the data being
transferred across the network.
· Fractional seconds.
PowerCenter creates a staging file or named pipe for data transfer based on
how we configure the connection. Named pipes are generally faster than
staging files because data is transferred as soon as it appears in the pipe. If
we use a staging file, data is not transferred until all data appears in the file.
· A control file.
· A log file.
The load or unload utility creates a log file and writes error messages to it.
The PowerCenter session log indicates whether the session ran successfully,
but does not contain load or unload utility error messages. Use the log file to
debug problems that occur during data loading or extraction.
White Paper|Working with Informatica-Teradata Parallel Transporter 6
By default, loader staging, control, and log files are created in the target file
directory. The FastExport staging, control, and log files are created in the
PowerCenter temporary files directory.
All of these load and unload utilities are included in the Teradata Tools and Utilities
(TTU), available from Teradata.PowerCenter supports these entire standalone load
and unload utilities. Before we can configure a session to use a load or unload utility,
create a loader or FastExport (application) connection in the PowerCenter Workflow
Manager and enter a value for the TDPID in the connection attributes.
For more information about creating connection objects in PowerCenter, see the
APPENDIX A, APPENDIX-B and PowerCenter Workflow Basics Guide.
· Export. Exports large data sets from Teradata tables or views and imports the
data to file or Informatica pipe using the FastExport protocol.
· Load. Bulk loads large volumes of data into empty Teradata database tables using
the FastLoad protocol.
· Update. Batch updates, inserts, upserts, and deletes data in Teradata database
tables using the MultiLoad protocol.
· Stream. Continuously updates, inserts, upserts, and deletes data in near real-time
using the TPump protocol.
Teradata PT is up to 20% faster than the standalone Teradata load and unload
utilities, even though it uses the underlying protocols from the standalone
utilities.
Teradata PT supports recovery for sessions that use the Stream operator when
the source data is repeatable. This feature is especially useful when running real-
time sessions and streaming the changes to Teradata.
Users can invoke Teradata PT through a set of open APIs that communicate with
the database directly, eliminating the need for a staging file or pipe and a control
file.
Teradata PT includes added features are enabled by users and some are
improvements made behind the scenes. They include:
Allowing multiple instances of a single operator to read from the same input
file takes advantage of the file I/O cache monitored and maintained by the
operating system (OS). As long as all instances run at the same relative
speed, the OS will read a block from the file into memory and all instances
will be accessing the same memory. Dramatic improvements in performance
have been seen when using up to five or six instances reading from a single
(very large) file.
The normal operating procedure for consumer operators is for the first
instance to get as much work as possible, provided it can keep up with the
rate at which data arrives through the data streams. The default behavior is
to only write to a single output file.
If multiple instances are specified for the DataConnector operator (as a file
writer) and indicated that each instance should write to a different file, then
there can be some parallelism in data output. However, when exporting data
to flat files, this results in an uneven distribution of data across output files.
This new parallel writing feature allows for the ability to switch to a round-
robin fashion of file writing. This can evenly distribute data across multiple
White Paper|Working with Informatica-Teradata Parallel Transporter 8
files, thereby improving the performance of writing to disk even more. That
performance improvement can be further enhanced by specifying each
output file on a separate disk.
For loading data into table, we will use different kinds of Teradata load utilities
based on data volume. Below table explains the best load utility to use in
Informatica based on data volume.
3.3. Using TPT connection with Load and Update operator in session:
ii. The second connection defines an optional ODBC connection to the target
database. The PowerCenter Integration Service uses the target ODBC
connection to drop log, error, and work tables, truncate target tables, and
create recovery tables in the target database. The PowerCenter Integration
Service does not use the ODBC connection to extract from or load data to
Teradata.
White Paper|Working with Informatica-Teradata Parallel Transporter 10
3. Select appropriate TPT and ODBC connections for the two types as shown
below
4. Make sure the following session properties are set when using TPT
connections:
i. Specify the database and table name for work table.
ii. Truncate table option can be used for Load, stream and update system
operator.
5. We can select any of the following options for 'Mark missing rows' option as
per requirement which specifies how TPT handles rows if it is not present in
the target table.
i. None - If Teradata PT API receives a row marked for update or delete but it
is missing in the target table, it does not mark the row in the error table.
ii. For Update - If Teradata PT API receives a row marked for update but it is
missing in the target table, it marks the row as an error row.
iii. For Delete - If Teradata PT API receives a row marked for delete but it is
missing in the target table, it marks the row as an error row.
iv. Both - If Teradata PT API receives a row marked for update or delete but it
is missing in the target table, it marks the row as an error row.
6. Similarly we can select the following options for Mark Duplicate Rows option
as well.
White Paper|Working with Informatica-Teradata Parallel Transporter 11
i. None - If Teradata PT API receives a row marked for insert or update that
causes a duplicate row in the target table, it does not mark the row in the
error table.
ii. For Insert - If Teradata PT API receives a row marked for insert but it exists
in the target table, it marks the row as an error row.
iii. For Update - If Teradata PT API receives a row marked for update that
causes a duplicate row in the target table, Teradata PT API marks the row as
an error row.
iv. Both - If Teradata PT API receives a row marked for insert or update that
causes a duplicate row in the target table, Teradata PT API marks the row as
an error row.
7. Specify the database name and the table name for log table to be created
8. Specify the database name and the table name for error tables to be created.
9. Drop work/log/error tables option is available. This option can be checked
when we wish to drop the intermediate tables that are created during session
run.
10. Pack, Pack minimum and buffers options are all used when we use Stream
load operator.
11. Tracing options can be selected as per needs. Usually we use the default
values
White Paper|Working with Informatica-Teradata Parallel Transporter 12
Connection that uses export operator can be used only as a source connection.
The below image shows, how to select The TPT export operator for Teradata
source tables in Informatica session
In case there is a session failure where the session uses one of the operators
like Update, Export or Stream then the session can rerun successfully once
all the intermediate tables are dropped.
The Teradata RDBMS offers several tools to load external data into database. The
Teradata load utilities can be divided into two main groups:
The first group of utilities makes use of the transient journal but loading of data
is slower, while the second group is thought to bring the data as fast as possible
into the target tables because of bypassing the transient journal.
The second group utilities include the BTEQ, TPUMP and relational/native
connections to load the data into Teradata tables. These will insert the data
record by record into target table, allowing full transaction handling and rollback
functionality.
These are still quite useful for loading small amounts of data, but as it were
missing some important features like the usage of several parallel load sessions.
White Paper|Working with Informatica-Teradata Parallel Transporter 14
The Image Fig5, it is Informatica Workflow monitor workflow load statics. Here the
source, target are Teradata tables and relational connections are used. The
relational connection uses the transient journal and loads the data record by
record into target table, so the throughput of target loading is less and it was
taken around 5 hours of time to load the 1.7 millions of records.
The second group utilities include the Teradata standalone Load and Unload
Utilities. These are the fastest way of getting external data into Teradata.it will
bypass transaction handling and data is loaded in big chunks, namely blocks of 64
kilobytes of data.
· It does not support target tables with unique secondary indexes (USI), join
indexes (JI), referential integrity (RI) or triggers. FastLoad even goes a step
further and requires the target table to be empty.
· In the case of a SET target table, a duplicate row check would be required.
FastLoad by default remove the record level duplicates
· Multiload, as the second tool in the set of fast loading utilities can take over
additional tasks such as UPDATE, DELETE, UPSERT etc. but all again on a block
level. Multiload uses temporary work tables for this purpose and does not reach
the load performance like FastLoad but still is a pretty good choice to get in
huge amounts of data very fast.
The Image Fig6, it is Informatica Workflow monitor workflow load statistics. Here
the source, targets are Teradata tables and TPT connection used. The TPT
operators load data by block level, so the throughput of target loading is huge and
it was taken few seconds to load the 1.7 millions of records.
Each group has its own advantages and disadvantages and depending on load
requirements. The decision, when we should use which tool for loading depends on
the amount of data records and the number of bytes a data records is consuming.
3.8. Loading the Teradata target table using Informatica update else
insert logic by using TPT connection.
In Informatica session, we can update the target table against to source table
data. Generally we will use the two flows for insert else update operation one flow
is for bulk insert and another one is for update.
The TPT update operator (Multiload) will lock the target table to load the data.
When we will use the two or more target instances in single session, the TPT will
lock the target table while loading the data using one of the targets instance and
session will try to load the data using another target instance, here again the
MultiLoad job try to lock the same target table, which is already locked by insert
instance so session will fail with the TPT error.
To avoid this kind of error, we need to make use of Informatica Update else Insert
option from session properties and treat source rows as update. We will update
the record based on key columns and we will not update some fields like
Create_Timestamp. If we want to use single instance for insert and update the
record then we need to update record with previous (target) values for some
fields which we don’t want to update with new values. i.e., Create_Timestamp.
White Paper|Working with Informatica-Teradata Parallel Transporter 16
3.9. Loading the PPI indexed Teradata table with Informatica by using
TPT connection.
Whenever we need to Load and select incremental data from a huge table, we can
use Multiload jobs to load data and create index to select data. But when we create
index like Secondary index or join index Teradata Bulk load utilities doesn’t work.
We can overcome this issue by using the partitioned primary index on the table.
Teradata Multiload job will work on partitioned primary indexed tables.
For MultiLoad job, the table should contain the primary index for update the table
records and we need another index on table for fast selecting of incremental data.
We can use PI for MultiLoad job and PPI for selecting incremental data. For
portioned primary index tables, Teradata MultiLoad tasks require all values of the
primary index column set and all values of the partitioning column set for deletes
and updates. The table creation statement will be:
We will load data with CREATE_DT and UPDATE_DT values as current_date for
inserting records and for updating records UPDATE_DT will be updated with
current_date.
This will select both inserted and updated records for the day.
This section describes issues might encounter when we move data between
PowerCenter and Teradata.
Sessions that perform lookups on Teradata tables must use Teradata relational
connections. If we experience performance problems when running a session that
performs lookups against a Teradata database, we might be able to increase
performance in the following ways:
· Use FastExport to extract data to a flat file and perform the lookup on the flat file.
· Enable or disable the Lookup Cache.
1. Create a simple, pass-through mapping to pass the lookup data to a flat file.
Configure the session to extract data to the flat file using FastExport.
2. Configure the original mapping to perform the lookup on the flat file.
Note: If we redesign the mapping using this procedure, we can further increase
performance by specifying an ORDER BY clause on the FastExport SQL and
enabling the Sorted Input property for the lookup file. This prevents Power
Center from having to sort the file before populating the lookup cache.
To recover from a failed MultiLoad job, we must release the target table from the
MultiLoad state and drop the Multiload log, error and work tables.
While Teradata MultiLoad loads data to a database table, it locks the table.
MultiLoad requires that all instances handle wait events so they do not try to
access the same table simultaneously.
If we have multiple PowerCenter sessions that load to the same Teradata table
using MultiLoad, set the Tenacity attribute for the session to a value that is
greater than the expected run time of the session. The Tenacity attribute controls
the amount of time a MultiLoad instance waits for the table to become available.
Also configure each session to use unique log file names.
For more information about the Tenacity, see the PowerCenter Advanced
Workflow Guide.
MAPPING> DBG_21684 Target [TD_INVENTORY] does not support multiple partitions. All
data will be routed to the first partition.
If we do not route the data to a single file, the session fails with the following
error:
WRITER_1_*_1> WRT_8240 Error: The external loader [Teradata Mload Loader] does not
support partitioned sessions.
WRITER_1_*_1> Thu Jun 16 11:58:21 2005
WRITER_1_*_1> WRT_8068 Writer initialization failed. Writer terminating.
For more information about loading from partitioned sessions, see the
PowerCenter Advanced Workflow Guide.
of data to the target table and unlocks it. FastLoad requires an exclusive lock on
the target table during the loading Phase.
MultiLoad also loads data during two main phases: acquisition and application.
In the acquisition phase, MultiLoad reads the input data and writes it to a
temporary work table. In the application phase, MultiLoad writes the data from
the work table to the actual target table. MultiLoad requires an exclusive lock on
the target table during the application phase.
TPump loads data in a single phase. It converts the SQL in the control file into a
database macro and applies the macro to the input data. TPump uses standard
SQL and standard table locking.
The following table lists the error tables can check to troubleshoot load or unload
utility errors:
Data Loading
Default Error table Name
Utility Phase Error types
When a load fails, check the “ET1_” error table first for specific information. The
Error Field or Error Field Name column indicates the column in the target table
that could not be loaded. The Error Code field provides details that explain why
the column
· 2689: Trying to load a null value into a non-null field
· 2665: Invalid date format
In the MultiLoad “ET2_” error table, we can also check the DBC Error Field
column and DBC Error Code field. The DBC Error Field column is not initialized in
the case of primary key uniqueness violations. The DBC Error Code that
corresponds to a primary key uniqueness violation is 2794.
White Paper|Working with Informatica-Teradata Parallel Transporter 20
5. Conclusion
APPENDIX-A
Before we run sessions that move data between PowerCenter and Teradata, we might
want to install Teradata client tools. We also need to locate the Teradata TDPID.
2. BTEQ.
A general-purpose, command-line utility (similar to Oracle SQL*Plus) that enables to
communicate with one or more Teradata databases.
4. TDPID
The Teradata TPDID indicates the name of the Teradata instance and defines the
name a client uses to connect to a server. When we use a Teradata Parallel
Transporter or a standalone load or unload utility with PowerCenter, we must specify
the TDPID in the connection properties. The Teradata TDPID appears in the hosts file
White Paper|Working with Informatica-Teradata Parallel Transporter 21
on the machines on which the Integration Service and PowerCenter Client run. By
default, the hosts file appears in the following location:
UNIX: /etc/hosts
Windows: %SystemRoot%\system32\drivers\etc\hosts*
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\DataB
asePath
The hosts file contains client configuration information for Teradata. In a hosts file
entry, the TDPID precedes the string“cop1.”
The first entry has the TDPID “demo1099.” This entry tells the Teradata database that
when a client tool references the Teradata instance “demo1099,” it should direct requests
to “localhost” (IP address 127.0.0.1).The following entries have the same TDPID, “cust.”
Multiple hosts file entries with the same TDPID indicate the Teradata instance is
configured for load balancing among nodes. When a client tool attempts to reference
Teradata instance “cust,” the Teradata database directs requests to the first node in the
entry list, “td_1.” If it takes too long for the node to respond, the database redirects the
request to the second node, and so on. This process prevents the first node, “td_1” from
becoming overloaded.
APPENDIX-B
a. Teradata Connections
Teradata relational connections use ODBC to connect to Teradata. PowerCenter
uses the ODBC Driver for Teradata to retrieve metadata and read and write to
Teradata. To establish ODBC connectivity between Teradata and PowerCenter,
install the ODBC Driver for Teradata on each PowerCenter machine that
communicates with Teradata. The ODBC Driver for Teradata is included in the
Teradata Tools and Utilities (TTU). We can download the driver from the Teradata
White Paper|Working with Informatica-Teradata Parallel Transporter 22
web site. PowerCenter works with the ODBC Driver for Teradata available in the
following TTU versions:
Informatica Teradata
Version Version(s)
TD 14.0
9.5
TD 13.10
TD 13.10
9.1 TD 13.0
TD 12.0
TD 13.10
9.0.1 TD 13.0
TD 12.0
TD 13.0
8.6.1
TD 12.0
8.6.0 TD 12.0
8.1.1 SP1 -SP5 TD 12.0
Driver=/usr/odbc/drivers/tdata.so
Description= running Teradata V14DBCName=intdv14
SessionMode=Teradata
CharacterSet=UTF8
StCheckLevel=0
DateTimeFormat=AAA
LastUser=
Username=
Password=
Database=
DefaultDatabase=
White Paper|Working with Informatica-Teradata Parallel Transporter 23
ODBC is a native interface for Teradata. Teradata provides 32- and 64-bit ODBC
drivers for Windows and UNIX platforms. The driver bit mode must be compatible with
the bit mode of the platform on which the PowerCenter Integration Services runs. For
example, 32-bit PowerCenter only runs with 32-bit drivers.
For more information about configuring odbc.ini, see the PowerCenter Configuration
Guide and the ODBC Driver for Teradata User Guide.
.
Fig8. Teradata relational connection creation
When we choose Teradata as the connection type, the Integration Service still uses
Teradata ODBC to connect to Teradata. Although both ODBC and Teradata connection
types might work, the Integration Service communicates with the Teradata database
more efficiently when we choose the Teradata connection type.
2. Configure the connection as per the requirement as shown in the screen shot
below:
Here we can specify the TDPID, Database name and the system operator that we
need for a specific connection.