Overview of DataStage Architecture
Overview of DataStage Architecture
Architectural Overview
I. Executive Summary
In today’s information age, organizations have grown beyond the question "Should we build a data warehouse?" to "Can
we consolidate our global business data such that it is reliable and makes sense?" With many organizations implement-
ing e-business, customer relationship management (CRM), enterprise applications and decision support systems, the
process of collecting, cleansing, transforming, and preparing business data for analysis has become an enormous inte-
gration challenge. A solid data integration environment, by definition, will attract more data consumers since the ultimate
goal of business intelligence is to discover new sales or cross-selling opportunities as well as reduce operating costs.
However, more users translate into more requirements in terms of performance, features, overall support, as well as dif-
ferent business intelligence tools. Very quickly developers find themselves having to incorporate more complex data
sources, targets, applications and meta data repositories into an existing data environment.
How an organization chooses to integrate its business data has an impact on costs in terms of time, money, and the
resources required to expand and maintain the data environment. If an organization chooses the wrong tool for their
environment they may find that they have to write custom code or manually enter data in order to integrate it with the
existing infrastructure and toolset. Writing custom code is often the simplest approach at first, but quickly exhausts the
organization’s ability to keep up with demand. Change is very difficult to manage when the transformation rules exist
buried in procedural code or when meta data has to be manually entered again and again. Purchasing a full solution
from an RDBMS vendor can be an attractive initial choice but it requires a commitment to their database and tool suite.
Ideally, the best approach should allow flexibility in the data integration architecture so that the organization can easily
add other best-of-breed tools and applications over time. The most flexible and cost effective choice over the long term
is to equip the development team with a toolset powerful enough to do the job as well as allow them to maintain and
expand the integrated data environment.
Providing developers with power features is only part of the equation. Ascential Software has long supported developers
in their efforts to meet tight deadlines, fulfill user requests, and maximize technical resources. DataStage offers a devel-
opment environment that:
• Works intuitively thereby reducing the learning curve and maximizing development resources
• Reduces the development cycle by providing hundreds of pre-built transformations as well as encouraging code re-
use via APIs
• Helps developers verify their code with a built-in debugger thereby increasing application reliability as well as reducing
the amount of time developers spend fixing errors and bugs
• Allows developers to quickly "globalize" their integration applications by supporting single-byte character sets and
multi-byte encoding
• Enhances developer productivity by adhering to industry standards and using certified application interfaces
Another part of the equation involves the users of integrated data. Ascential Software is committed to maximizing the
productivity of the business users and decision-makers. An integrated data environment built with DataStage offers data
users:
• Quality management which gives them confidence that their integrated data is reliable and consistent
• Meta data management that eliminates the time-consuming, tedious re-keying of meta data as well as provides accu-
rate interpretation of integrated data
• A complete integrated data environment that provides a global snapshot of the company’s operations
• The flexibility to leverage the company’s Internet architecture for a single point of access as well as the freedom to
use the business intelligence tool of choice
Ascential’s productivity commitment stands on the premise that data integration must have integrity. To illustrate this
point, company XYZ uses Erwin and Business Objects to derive their business intelligence. A developer decides that it is
necessary to change an object in Erwin but making that change means that a query run in Business Objects will no
longer function as constructed. This is a common situation that leaves the business user and the developer puzzled and
frustrated. Who or rather what tool is responsible for identifying or reconciling these differences? Ascential Software’s
DataStage family addresses these issues and makes data integration transparent to the user. In fact, it is the only prod-
uct today designed to act as an "overseer" for all data integration and business intelligence processes.
In the end, building a data storage container for business analysis is not enough in today’s fast-changing business cli-
mate. Organizations that recognize the true value in integrating all of their business data, along with the attendant com-
plexities of that process, have a better chance of withstanding and perhaps anticipating significant business shifts.
Individual data silos will not help an organization pinpoint potential operation savings or missed cross-selling opportuni-
ties. Choosing the right toolset for your data environment can make the difference between being first to market and
capturing large market share or just struggling to survive.
Table of Contents
I. EXECUTIVE SUMMARY 2
II. INTRODUCTION 6
GETTING READY FOR ANALYSIS 6
THE PITFALLS 7
NOT ALL TOOLS ARE CREATED EQUAL 7
III. THE DATASTAGE PRODUCT FAMILY 8
DATASTAGE XE 8
Data Quality Assurance 8
Meta Data Management 9
Data Integration 9
DATASTAGE XE/390 10
DATASTAGE XE PORTAL EDITION 10
DATASTAGE FOR ENTERPRISE APPLICATIONS 10
IV. THE ASCENTIAL APPROACH 11
DEVELOPMENT ADVANTAGES 11
Application Design 11
Debugging 12
Handling Multiple Sources and Targets 12
Native Support for Popular Relational Systems 12
Integrating Complex Flat Files 13
Applying Transformations 13
Merge Stage 13
Graphical Job Sequencer 13
ADVANCED MAINTENANCE TOOLS 14
Object Reuse and Sharing 14
Version Control 14
Capturing Changed Data 15
National Language Support 15
EXPEDITING DATA TRANSFER 16
Named Pipe Stage 16
FTP Stage 16
USING EXTERNAL PROCESSES 16
The Plug-In API 16
Invoking Specialized Tools 16
PERFORMANCE AND SCALABILITY 17
Exploiting Multiple CPU’s 17
Linked Transformations 17
In-memory Hash Tables 18
Sorting and Aggregation 18
Bulk Loaders 18
SUPPORT FOR COMPLEX DATA TYPES 18
XMLPack 18
ClickPACK 19
Legacy Sources 19
MQSeries 20
INTEGRATING ENTERPRISE APPLICATIONS 21
DataStage Extract PACK for SAP R/3 22
DataStage Load PACK for SAP BW 22
DataStage PACK for Siebel 22
DataStage PACK for PeopleSoft 22
V. NATIVE MAINFRAME DATA INTEGRATION 23
VI. END-TO-END META DATA MANAGEMENT 24
META DATA INTEGRATION 25
Semantic Meta Data Integration 25
Bidirectional Meta Data Translation 26
Integration of Design and Event Meta Data 26
META DATA ANALYSIS 26
Cross Tool Impact Analysis 26
Data Lineage 27
built-In Customizable query and Reporting 27
META DATA SHARING, REUSE, AND DELIVERY 27
Publish and Subscribe Model for Meta Data Reuse 27
Automatic Change Notification 28
On-Line Documentation 28
Automatic Meta Data Publishing to the DataStage XE Portal Edition 28
VII. EMBEDDED DATA QUALITY ASSURANCE 28
VIII. USING THE WEB TO ACCESS THE DATA INTEGRATION ENVIRONMENT 29
IX. CONCLUSION 30
X. APPENDIX A 32
PLATFORMS 31
STAGES 31
METABROKERS 33
Sybase PowerDesigner 33
CHANGED DATA CAPTURE SUPPORT ENVIRONMENTS 33
II. Introduction
In today’s information age, organizations have grown beyond the question "Should we build a data warehouse?" to "Can
we consolidate our global business data such that it is reliable and makes sense and what tools or toolsets are available
to help us?" With many organizations implementing e-business, CRM, enterprise applications and decision support sys-
tems, the process of collecting, cleansing, transforming, and preparing business data for analysis has become an enor-
mous integration challenge. Unfortunately, the headlong rush to get these systems up and running has resulted in
unforeseen problems and inefficiencies in managing and interpreting the data. How were these problems created and
better yet, how can they be avoided?
In the beginning, the organizational requirement for the data environment was to provide easy and efficient transfer of
usable data ready for analysis from the transaction systems to the data consumers. This requirement has expanded to:
• Merging single data silos into a reliable, consistent "big picture"
• Developing and maintaining the infrastructure’s meta data such that developers, administrators, and data consumers
understand the data and all its inter-relationships
• Leveraging the Internet and intranet infrastructure so that administrators and data consumers can access systems and
data, anytime and anywhere
• Allowing the data consumers to use their business intelligence (BI) tool of choice
In order to meet all these requirements, developers have had three options for cleansing and integrating enterprise data
such that it is reliable and consistently interpreted. They could:
• Write custom code
• Purchase a packaged solution from an RDBMS vendor
• Equip the infrastructure and data designers with powerful, "best of breed" tools
Choosing a development option without considering the organization’s data requirements or the complex issues involv-
ing data integration is the first pitfall. From there the problems start to build because organizations grow, data users
become more sophisticated, business climates change, and systems and architectures evolve. Before building or updat-
ing their data environment, developers and managers need to ask:
• How will the development options and data environment architecture affect overall performance?
• How will the environment handle data complexities such as merging, translating and maintaining meta data?
• How will we correct invalid or inaccurate data?
• How will we integrate data and meta data especially when it comes from obscure sources?
• How will we manage the data environment as the organization expands and changes?
Simply building a storage container for the data is not enough. Ultimately, the integrated environment should have reli-
able data that is "analysis ready".
Second, since the data coming from online systems is usually full of inaccuracies it has to be "cleansed" as it is moved
to the integrated data environment. Company mergers and acquisitions that bring diverse operational systems together
further compound the problem. The data needs to be transformed in order to aggregate detail, expand acronyms,
identify redundancies, calculate fields, change data structures, and to impose standard formats for currency, dates,
names, and so forth. In fact, data quality is a key factor in determining the accuracy and reliability of the integrated data
environment as a whole.
6
Third, the details about the data also need to be consolidated into classifications and associations that turn them into
meaningful, accessible information. The standard data definitions, or meta data, must accompany the data in order for
users to recognize the full value of the information. Realizing the data’s full potential means that the users must have a
clear understanding of what the data actually means, where it originated, when it was extracted, when it was last
refreshed, and how it has been transformed.
Fourth, the "cleansed" integrated data and meta data should be accessible to the data users from an Internet portal, an
internal company network or direct system connection. Ideally, they also should be able to mine the data and update the
meta data with any BI tool they choose. In the end, report generation should support the organization’s decision-making
process and not hinder or confuse it.
Ideally, the best approach should allow flexibility in the integrated data architecture so that the organization can easily
add other best-of-breed tools over time. The most flexible and cost effective choice over the long term is to equip the
development team with a toolset powerful enough to do the job as well as allow them to maintain and expand the data
environment.
In addition, the toolset must offer high performance for moving data at run-time. It should pick the data up once and put
it down once, performing all manipulations in memory and using parallel processing wherever possible to increase
speed. Since no tool stands alone, it must have an integration architecture that provides maximum meta data sharing
with other tools without imposing additional requirements upon them. Indeed, each tool should be able to retain its own
meta data structure yet play cooperatively in the environment.
7
And what about the web? Today’s development team must be able to integrate well-defined, structured enterprise data
with somewhat unstructured e-business data. Ideally, the toolset should accommodate the meta data associated with the
"e-data" as well as provide integration with the organization’s business intelligence (BI) toolset. Without this integration,
the organization runs the risk of misinterpreting the effect of e-commerce on their bottom line. Further, most global enter-
prises have taken advantage of the Internet infrastructure to provide business users with information access anytime,
anywhere. Following suite, the integrated data environment needs to be accessible to users through their web browser
or personalized portal.
Finally, the toolset’s financial return must be clearly measurable. Companies using a toolset that fulfills these require-
ments will be able to design and run real-world integrated data environments in a fraction of the average time.
Since each enterprise has its own data infrastructure and management requirements, Ascential Software offers different
configurations of the DataStage Product Family:
• DataStage XE — the core product package that includes data quality, meta data management, and data integration
• DataStage XE/390 — for full mainframe integration with the existing data environment,DataStage XE Portal Edition –
for web-enabled data integration development, deployment, administration, and access
• DataStage XE Portal Edition — for delivering a complete, personalized view of business intelligence and information
assets across the enterprise
• DataStage Connectivity for Enterprise Applications — for integrating specific enterprise systems such as
SAP,PeopleSoft and Siebel
Services
i Applications
DataStage Connectivity
A li ti
Value
for Enterprise
DataStage XE
Enterprise Scope
Figure1: The DataStage Product Family
DataStage XE
The foundation of the DataStage Product Family is DataStage XE. DataStage XE consists three main components: data
quality assurance, meta data management, and data integration. It enables an organization to consolidate, collect and
centralize high volumes of data from various, complex data structures. DataStage XE incorporates the industry’s premier
meta data management solution that enables businesses to better understand the business rules behind their informa-
tion. Further, DataStage XE validates enterprise information ensuring that it is complete, accurate and retains its superi-
or quality over time. Using DataStage XE companies derive more value from their enterprise data that allows them to
reduce project costs, and make smarter decisions, faster.
8
ment. It gives development teams and business users the ability to audit, monitor, and certify data quality at key points
throughout the data integration lifecycle. They can identify a wide range of data quality problems and business rule viola-
tions that can inhibit data integration efforts as well as generate data quality metrics for projecting financial returns.
By improving the quality of the data going into the transformation process, organizations improve the data quality of the
target. The end result is validated data and information for making smart business decisions and a reliable, repeatable,
and accurate process for making sure that quality information is constantly used throughout the enterprise.
DATA INTEGRATION
DataStage XE optimizes the data collection, transformation and consolidation process with a client/server data integra-
tion architecture that supports team development and maximizes the use of hardware resources. The client is an inte-
grated set of graphical tools that permit multiple designers to share server resources and design and manage individual
transformation mappings. The client tools are:
• Designer — Provides the design platform via a "drag and drop" interface to create a visual representation of the data
flow and transformations in the creation of ‘jobs’ that execute the data integration tasks
• Manager — Catalogs and organizes the building blocks of every project including objects such as table definitions,
centrally coded transformation routines, and meta data connections
• Director — Provides interactive control for starting, stopping, and monitoring the jobs
The server is the actual data integration workhorse where parallel processes are controlled at runtime to send data
between multiple disparate sources and targets. The server can be installed in either NT or UNIX environments and
tuned to take advantage of investments in multiple CPUs and real memory. Using the many productivity features includ-
ed in DataStage XE, an organization can reduce the learning curve, simplify administration and maximize development
resources thereby shrinking the development and maintenance cycle for data integration applications.
DataStage XE
Architecture Overview
IMS, DB2
ADABAS
VSAM
MQ Series
ata Quality Assurance
Oracle, UDB
Sybase DW
Microsoft DataStage
ataS Stagg
S
Informix Ser
S errrver
Serv v
ver
XML, Infomover,
EMC, Trillium DM
FirstLogic, Siebel Reporting
PeopleSoft
Data Mining
Data
Complex Manager
g r Designer
g r Directorr Query
Flat Files BW Analytic
FTP Applications
IIS, Apache,
Netscape,
Outlook
Figure 2: The combination of data integration, meta data management and data quality assurance makes DataStage XE the
industry's most complete data integration platform.
9
DataStage XE/390
DataStage XE/390 is the industry’s only data integration product to offer full support for client-server and mainframe
environments in a single cross-platform environment. By allowing users to determine where to execute jobs—at the
source, the target, or both—DataStage XE/390 provides data integration professionals with capabilities unmatched by
competitive offerings. No other tool provides the same quick learning curve and common interface to build data integra-
tion processes that will run on both the mainframe and the server. This capability gives the enterprise greater operational
efficiency.
Furthermore, by extending the powerful meta data management and data quality assurance capabilities to the industry-
leading OS/390 mainframe environment, users are ensured that the processed data is reliable and solid for business
analysis.
For the business user, Ascential Software offers the first portal product that integrates enterprise meta data. With its
unique Business Information Directory (BID), DataStage XE Portal Edition links, categorizes, personalizes, and distrib-
utes a wide variety of data and content from across the enterprise: applications, business intelligence reports, presenta-
tions, financial information, text documents, graphs, maps, and other images.
The Portal Edition gives end users a single point of web access for business intelligence. Users get a full view of the
enterprise by pulling together all the key information for a particular subject area. For example, the business user can
view a report generated by Business Objects without ever leaving their web browser. Furthermore, the business user
can also access the report’s meta data from the web browser thereby increasing the chances of interpreting the data
accurately. The end result is web access to an integrated data environment as well as remote monitoring for developers,
administrators, and users, anytime and anywhere.
Integrating data from enterprise applications is a challenging and daunting task for developers. These systems usually
consist of proprietary applications and complex data structures. Integrating them often requires writing low-level API
code to read, transform and then output the data in the other vendor’s format. Writing this code is difficult, time-
consuming and costly. Keeping the interfaces current as the packaged application changes presents yet another chal-
lenge.
To ease these challenges, Ascential Software has developed a set of packaged application connectivity kits (PACKs) to
collect, consolidate and centralize information from various enterprise applications to enhance business analysis using
open target environments. These PACKs are the DataStage Extract PACK for SAP R/3, DataStage Load PACK for SAP
BW, DataStage PACK for Siebel and the DataStage PACK for PeopleSoft.
10
IV. The Ascential Approach
DataStage XE’s value proposition stems from Ascential’s underlying technology and its approach to integrating data and
meta data. The product family is a reflection of the company’s extensive experience with mission critical applications and
leading technology. Supporting such systems requires an intimate understanding of operating environments, perfor-
mance techniques, and development methodologies. Combining this understanding with leading technology in areas
such as memory management, user interface design, code generation, data transfer, and external process calls gives
developers an edge in meeting data user requirements quickly and efficiently. The sections that follow detail key ele-
ments that make Ascential’s approach to data integration unique.
Development Advantages
With a premium on IT resources, DataStage XE offers developers specific advantages in building and integrating data
environments. Starting with designing and building applications through integrating data sources, transforming data, run-
ning jobs, and debugging, DataStage XE provides short cuts and automation designed to streamline the development
process and get integration applications up and running reliably.
APPLICATION DESIGN
The graphical palette of the DataStage Designer is the starting place for data integration job design. It allows developers
to easily diagram the movement of data through their environment at a high level. Following a workflow style of thinking,
designers select source, target, and process icons and drop them onto a "drafting table" template that appears initially
as an empty grid. These icons called "stages" exist for each classic source/target combination. The Designer connects
these representative icons via arrows called "links" that illustrate the flow of data and meta data when execution begins.
Users can annotate or write notes onto the design canvas as they create DataStage jobs, enabling comments, labels or
other explanations to be added to the job design. Further, DataStage uses the graphical metaphor to build table lookups,
keyed relationships, sorting and aggregate processing.
Designing jobs in this fashion leads to faster completion of the overall application and simpler adoption of logic by other
members of the development team thereby providing easier long-term maintenance. Developers drag and drop icons
that represent data sources, data targets, and intermediate processes into the diagram. These intermediate processes
include sorting, aggregating, and mapping individual columns. The Designer also supports drag-and-drop objects from
the Manager to the Designer palette. Further, developers can add details such as table names and sort keys via this
same graphical drag and drop inter-
face. DataStage pictorially illus-
trates lookup relationships by con-
necting primary and foreign keys,
and it makes function calls by sim-
ply selecting routine names from a
drop down menu after clicking the
right mouse button. Developers can
select user-written routines from a
centrally located drop down list.
Sites may import pre-existing
ActiveX routines or develop their
own in the integrated editor/test
facility. At run-time DataStage does
all of the underlying work required
to move rows and columns through
the job, turning the graphical
images into reality.
The Designer helps take the complexity out of long-term maintenance of data integration applications. Team members
can share job logic and those who inherit a multi-step transformation job will not be discouraged or delayed by the need
to wade through hundreds of lines of complex programming logic.
DEBUGGING
The DataStage Debugger is fully integrated into the Designer. The Debugger is a testing facility that increases productiv-
ity by allowing data integration developers to view their transformation logic during execution. On a row-by-row basis,
developers set break points and actually watch columns of data as they flow through the job. DataStage immediately
detects and corrects errors in logic or unexpected legacy data values. This error detection enhances development pro-
ductivity especially when working out date conversions or when debugging a complex transformation such as rows that
belong to a specific cost center or why certain dimensions in the star schema are coming up blank or zero. Laboring
through extensive report output or cryptic log files is unnecessary because the DataStage Debugger follows the ease-of-
use metaphor of the DataStage Designer by graphically identifying user-selected break points where logic can be veri-
fied.
The DataStage design supports the writing of multiple target tables. In the same job tables can concurrently:
• Be of mixed origin (different RDBMS or file types)
• Have multiple network destinations (copies of tables sent to different geographic regions)
• Receive data through a variety of loading strategies such as bulk loaders, flat file I/O, and direct SQL
By design, rows of data are selectively divided or duplicated when the developer initially diagrams the job flows.
DataStage can optionally replicate rows along multiple paths of logic, as well as split rows both horizontally according to
column value, and vertically with different columns sent to each target. Further, DataStage can drive separate job flows
from the same source. The Server engine from one graphical diagram can split single rows while in memory into multiple
paths or invoke multiple processes in parallel to perform the work. There are no intermediate holding areas required to
meet this objective. The resulting benefit is fewer passes of the data, less I/O, and a clear, concise illustration of real-
world data flow requirements.
12
best performance with the least amount of overhead. DataStage supports ODBC and provides drivers for many of the
popular RDBMS. It also concentrates on delivering direct API access allowing the data integration developer to bypass
ODBC and "talk" natively to the source and target structures using direct calls. Wherever possible, DataStage exploits
special performance settings offered by vendor APIs and avoids the management overhead associated with the configu-
ration and support of additional layers of software.
APPLYING TRANSFORMATIONS
DataStage includes an extensive library of functions and routines for column manipulation and transformation.
Developers may apply functions to column mappings directly or use them in combination within more complicated trans-
formations and algorithms. After applying functions and routines to desired column mappings, the developer clicks on
the "compile" icon. First, the complier analyzes the job and then determines how to best manage data flows. Next, it cre-
ates object code for each of the individual transformation mappings and uses the most efficient method possible to
manipulate bytes of data as they move to their target destinations.
Within DataStage there are more than 120 granular routines available for performing everything from sub-string opera-
tions through character conversions and complex arithmetic. The scripting language for transformations is an extended
dialect of BASIC that is well suited for data warehousing as it contains a large number of functions dedicated to string
manipulation and data type conversion. Further, this language has a proven library of calls and has been supporting not
only data integration applications but also on-line transaction processing systems for more than 15 years. In addition,
DataStage has more than 200 more complex transformations that cover functional areas such as data type conversions,
data manipulation, string functions, utilities, row processing, and measure conversions, including distance, time, and
weight included in DataStage.
MERGE STAGE
Included in DataStage is the Merge Stage. It allows developers to perform complex matching of flat files without having
to first load data into relational tables or use cryptic SQL syntax. The Merge Stage supports the comparison of flat files
with the option of selecting up to seven different Boolean set matching combinations. More versatile than most SQL
joins, the Merge Stage allows developers to easily specify the intersection of keys between the files, as well as the
resulting rows they wish to have processed by the DataStage job. Imagine comparing an aging product listing to the
results of annual revenue. The Merge Stage lets data users ask questions such as "Which products sold in what territo-
ries and in what quantities?" and "Which products have not sold in any territory?" The Merge Stage performs a compari-
son of flat files, which is often the only way to flag changed or newly inserted records, to answer questions such as
"Which keys are present in today’s log that were not present in yesterday’s log?" When flat files are the common
sources, the Merge Stage provides processing capability without having to incur the extra I/O of first loading the struc-
tures into the relational system.
13
graphical interface, "stages" which represent activities (e.g. run a job) and "links" which represent control flow between
activities (i.e. conditional sequencing) are used for constructing batch jobs or job sequences.
Control of job sequencing is important because it supports the construction of a job hierarchy complete with inter-
process dependencies. For instance in one particular data integration application, "AuditRevenue" may be the first job
executed. No other jobs can proceed until (and unless) it completes with a zero return code. When "AuditRevenue" fin-
ishes, the Command Language signals DataStage to run two different dimension jobs simultaneously. A fourth job,
"ProcessFacts" waits on the two dimension jobs, as it uses the resulting tables in its own transformations. These jobs
would still function correctly if run in a linear fashion, but the overall run time of the application would be much longer.
The two dimension jobs run concurrently and in their own processes. On a server box configured with enough resources
they will run on separate CPUs as series of sequences can become quite complicated.
Advanced Maintenance
Once up and running, the integrated data environment presents another set of challenges in terms of maintenance and
updates. Industry experts say that upwards of 70 percent of the total cost associated with data integration projects lies in
maintaining the integrated environment. Ascential’s approach takes maintenance challenges into account right from the
start. Providing functions such as object reuse, handling version control and capturing changed data helps developers
meet ongoing data user requirements quickly and without burdening IT resources.
VERSION CONTROL
In addition, DataStage offers version control that saves the history of all the data integration development. It preserves
application components such as table definitions, transformation rules, and source/target column mappings within a two-
part numbering scheme. Developers can review older rules and optionally restore entire releases that can then be
moved to remote locations. The DataStage family of products allows developers to augment distributed objects locally as
they are needed. The objects are moved in a read-only fashion so as to retain their "corporate" identity. Corporate IT
staff can review local changes and "version" them at a central location. Version control tracks source code for develop-
ers as well as any ASCII files such as external SQL scripts, shell command files and models. Version control also pro-
tects local read-only versions of the transformation rules from inadvertent network and machine failures, so one point of
failure won’t bring down the entire worldwide decision support infrastructure.
Production Sites
Branch Offices &
DataStage
XEE Other Locations
Corporate
Development
Network overhead and lengthy load times are unacceptable in volume-intensive environments. The support for Changed
Data Capture minimizes the load times required to refresh the target environments.
Changed Data Capture describes the methodologies and tools required to support the acquisition, modification, and migra-
tion of recently entered or updated records. Ascential Software realizes the inherent complexities and pitfalls of capturing
changed data and has created a multi-tiered strategy to assist developers in building the most effective solution based on
logs, triggers, and native replication. There are two significant problems to solve when dealing with changed data:
• The identification of changed records in the operational system
• Applying the changes appropriately to the data integration infrastructure
To assist in the capture of changed data, Ascential Software provides tools that obtain new rows from the logs and trans-
actional systems of mainframe and popular relational databases such as DB2, IMS, and Oracle. In fact, each version of
Changed Data Capture is a unique product that reflects the optimal method of capturing changes given the technical
structure of the particular database. The Changed Data Capture stage specifically offers properties for timestamp check-
ing and the resolution of codes that indicate the transaction type such as insert, change, or delete. At the same time, its
design goals seek to select methods that leverage the native services of the database architecture, adhere to the data-
base vendor’s document formats and APIs as well as minimize the invasive impact on the OLTP system.
See Appendix A for a list of supported environments for Changed Data Capture.
Ascential has designed the DataStage product family for Internationalization from the ground up. The approach that Ascential
has adopted provides a unified and industry-standard approach. The cornerstones of the implementation are as follows:
Windows Client Standards — All DataStage client interfaces fully support the 32-bit Windows code page standards in terms
of supporting localized versions of Windows NT and Windows 2000. For example, developers using the DataStage
Designer in Copenhagen, Denmark; Moscow, Russia and Tel Aviv, Israel would all function correctly using their various
Danish, Russian and Israeli Windows versions.
DataStage Server — DataStage servers are not dependent upon the underlying operating system for its internationaliza-
tion support. Instead, each server can be enabled to use a unified and single internal character set (Unicode) for all likely
character mapping, and POSIX-based national conventions (locales) for Sort, Up-casing, Down casing, Currency, Time
(Date), Collation and Numeric representations.
Localization — All DataSatge XE components have the necessary client infrastructure in place to enable full localization
of the GUI interface into a local language. This includes Help messages, Documentation, Menus, Install routines, and
individual components such as the Transformer stage. DataStage is already available as a fully localized Kanji version.
15
Expediting Data Transfer
Moving data within the integrated environment and across corporate networks can have a large impact on performance
and application efficiency. Ascential’s approach takes this challenge into account and as a result has built in additional
stages specifically designed for data transfers.
FTP STAGE
Preparing data for loading into the target environment can involve the use of file transfer protocol to move flat files
between machines in the corporate network. This process is supported generally by shell programs and other user writ-
ten scripts and requires that there be enough disk space on the target machine to retain a duplicate copy of the file.
While this process works, it does not efficiently use available storage. DataStage provides FTP support in a stage that
uses file transfer protocol, but skips the time consuming I/O step needed when executing a stand-alone command pro-
gram. While blocks of data are sent across the network, the Server engine pulls out the pre-mapped rows, moving them
directly into the transformation process. In addition, FTP can be used to write flat files to remote locations. The FTP
Stage provides great savings in time as no extra disk I/O is incurred before or after job execution.
16
Performance and Scalability
Every aspect of the DataStage Server engine is optimized to move the maximum number of rows per second through
each process. By default, jobs are designed to pick up data just once from one or more sources, manipulate it as neces-
sary, and then write it once to the desired target. Consequently DataStage applications do not require intermediate files
or secondary storage locations to perform aggregation or intermediate sorting. Eliminating these steps avoids excessive
I/O, which is one of the most common performance bottlenecks. Data does not have to be loaded back into a relational
system to be reprocessed via SQL. As a result, DataStage jobs complete in a shorter period of time.
If multiple targets are critical to the application, DataStage will split rows both vertically and horizontally and then send
the data through independent paths of logic—each without requiring a temporary staging area. DataStage makes it a pri-
ority to do as much as possible in-memory and to accomplish complex transformations in the fewest possible passes of
the data. Again, the benefit of this approach is fewer input/output operations. The less disk activity performed by the job,
the faster the processes will complete. Target systems can be on-line and accessible by end users in a shorter period of
time. The following sections describe the core concepts within DataStage that support this methodology.
BULK LOADERS
Ascential Software is committed to supporting the high speed loading technologies offered by partners and other indus-
try-leading vendors. Bulk loaders or "fast load" utilities permit high-speed insertion of rows into a relational table by typi-
cally turning off logging and other transactional housekeeping performed in more volatile on-line applications. DataStage
supports a wide variety of such bulk load utilities either by directly calling a vendor’s bulk load API or generating the con-
trol and matching data file for batch input processing. DataStage developers simply connect a Bulk Load Stage icon to
their jobs and then fill in the performance settings that are appropriate for their particular environment. DataStage man-
ages the writing and re-writing of control files especially when column names or column orders are adjusted. In the case
of bulk API support, DataStage takes care of data conversion, buffering, and all the details of issuing low-level calls. The
developer gets the benefit of bulk processing without having to incur the learning curve associated with tricky coding
techniques.
XML PACK
XML Pack provides support for business data exchange using XML. With XML Pack, users can read XML-formatted text
and write out data into an XML format, then exchange both business-to-business and application-to-application data and
meta data.
Specifically, the Pack’s XML Writer and XML Reader enables companies to leverage their legacy investments for
exchanging infrastructures over the Internet without the need for writing specific point-to-point interchange programs and
interfaces. Using XML Writer, DataStage can read data from legacy applications, ERP systems, or operational
18
databases, transform or integrate the data with other sources, and then write it out to an XML document. The XML Reader
allows the reading or import of XML documents into DataStage. Further, DataStage can integrate data residing in the XML
document with other data sources, transform it, and then write it out to XML or any other supported data target.
Users can employ XML not only as a data encapsulation format but also as a standard mechanism for describing meta
data. By means of an XML Pack Document Type Definition (DTD) for extracting meta data, DataStage supports the
exporting and importing of all meta data related to the data integration process. When exporting the meta data into an
XML document, DataStage supplies the DTD thus enabling other tools to understand the structure and contents of the
exported document. In addition, XML Pack enables users to export XML meta data from any report, including catalogs,
glossaries, and lineage reports, to an XML document. From there users can display the documents in an XML-capable
browser or imported into any other application that supports XML.
CLICKPACK
ClickPack, an intrinsic part of DataStage, extracts and integrates data from web server log files and email servers and
transforms it as part of a complex data integration task. It speeds up the development time for extracting important visi-
tor behavior information. Specifically ClickPack provides:
Web server log file readers — The content of web server log files reveal who has visited a web site, where they travel
from click to click, and what they search for or download. ClickPack provides two plug-in stages for the DataStage
Designer to extract data from these log files, and using its built-in transforms and routines, process the data for integra-
tion elsewhere. These stages optimized performance by allowing preprocessing of the log file to filter out unwanted data.
ClickPack provides basic log file reading capability using the LogReader plug-in stage, and advanced log reading using
the WebSrvLog stage. In the former, the format is determined by tokenizing patterns in each column definition.
Tokenizing patterns use Perl regular expressions to specify the format of each data field read from the web log file. The
pre-supplied templates enable reading of generic format web log server files in the following formats: Common log for-
mat (CLF), Extended Common Log Format (ECLF or NCSA format) and Microsoft IIS log format (W3C format).
The WebSrvLog stage is a more advanced toolset that includes a plug-in stage and sample jobs. These provide building
blocks that enable the definition of jobs that provide more performant web log analysis, particularly where the log for-
mats have been customized. Either stage is inserted into the Designer canvas in the same way as all other stages. The
WebSrvLog stage represents the directory in which the log files are held.
POP3 compliant Email server reader — ClickPack provides a plug-in stage to read email messages from mail servers
that support the POP3 protocol. It can separate the messages into columns representing different parts of the message,
for example, the From, To, and Subject fields. New ClickPack transforms support both Mail ID parsing and the extraction
of tagged head or body parts from messages.
ClickPack Utilities — ClickPack provides various transforms, routines, table definitions and sample jobs, which can be
used in processing the output from the web log reader stages and the email reader stage.
LEGACY SOURCES
Legacy data sources should be as transparent to the data integration developer as the computer system that they run
on. Ascential enables developers to effortlessly and efficiently pull data and the appropriate meta data from legacy
sources. Once collected, the data is easily represented graphically.
IMS
Sophisticated gateway technology provides the DataStage transformation engine with SQL-like access to IMS, VSAM, flat
files, as well as other potential legacy sources. This connectivity software consists of programs residing locally on the
mainframe, usually in one or more long running address spaces, a listener and a data access component. The listener
waits for communication from DataStage running on Unix or NT connected to the mainframe via TCP/IP. The Data Access
component receives SQL upon connection from the remote application, and translates that SQL into lower level IMS calls
for retrieval and answer set processing. Once an answer set is processed, the resulting rows are sent down to the awaiting
client application. These resulting rows are sent down a "link" for further processing within the DataStage engine and ulti-
mately placed into appropriate targets or used for other purposes. Specifically, Ascential’s IMS connectivity features:
19
- Meta data import from COBOL FD’s, IMS DBD’s and PSBs as well as PCB’s within PSBs.
- Full support of IMS security rules. RACF and Top-Secret security classes are supported in addition to userid access
to datasets that may be allocated. Security logon is checked by the DataStage job at run time as well as when
obtaining meta data. The data integration developer can easily switch PSB’s and access multiple IMS databases
within the enterprise if necessary. New PSB’s do not have to be generated.
- Exploitation of all performance features. Full translation of SQL WHERE clauses into efficient SSA’s, the screening
conditions in IMS. All indexes are supported. In addition, IMS ‘fields specifically mentioned in the DBD itself and non-
sensitive fields or "bytes" in the DBD that are then further defined by a COBOL FD or other source are also supported.
- Efficient support for multiple connections. This could be multiple data integration jobs running on the NT or Unix client,
or simply one job that is opening up multiple connections for parallel processing purposes. Ascential’s IMS connectivi-
ty uses DBCTL, which is the connection methodology developed by IBM for CICS to talk to IMS. No additional
address spaces are required and new thin thread connections to IMS are established dynamically when additional
jobs or connections are spawned within DataStage.
- Support for all types of IMS databases, including Fast Path, partitioned and tape based HSAM IMS database types.
- Ability to access and join VSAM, QSAM, and/or DB2 tables in the same SQL request.
ADABAS
An option for DataStage called DataStage ADABAS Connect offers developers native access to ADABAS, a powerful
mainframe based data management system that is still used heavily in OLTP systems. DataStage ADABAS Connect
provides two important tools for retrieving data locked in ADABAS:
- NatQuery, which automatically generates finely tuned requests using Natural, the standard ADABAS retrieval
language
- NatCDC, which further expands the power of Ascential’s Changed Data Capture toolset
Using NatQuery, developers import meta data from ADABAS DDM’s and construct queries by selecting desired columns
and establishing selection criteria. NatQuery understands how to retrieve complex ADABAS constructs such as:
Periodic-Group fields (PEs), Multiple-Valued fields (MUs), as well as MUs in PEs. Additionally, NatQuery fully exploits all
ADABAS key structures, such as descriptors and super-descriptors. NatQuery is a lightweight solution that generates
Natural and all associated runtime JCL.
NatCDC is a highly optimized PLOG processing engine that works by providing developers with "point in time" snap-
shots using the ADABAS Protection Logs. This assures that there is no burden on the operational systems, and simpli-
fies the identification and extraction of new and altered records. NatCDC recognizes specific ADABAS structures within
the PLOG, performs automatic decompression, and reads the variable structure of these transaction specific recovery
logs.
MQSERIES
Many enterprises are now turning to message-oriented, "near" real-time transactional middleware as a way of moving
data between applications and other source and target configurations. The IBM MQSeries family of products is currently
one the most popular examples of messaging software because it enables enterprises to integrate their business
processes.
The DataStage XE MQSeries Stage treats messages as another source or target in the integrated environment. In other
words, this plug-in lets DataStage read from and write to MQSeries message queues. The MQSeries Stage can be used
as:
20
• An intermediary between applications, transforming messages as they are sent between programs
• A conduit for the transmission of legacy data to a message queue
• A message queue reader for transmission to a non-messaging target
Message-based communication is powerful because two interacting programs do not have to be running concurrently as
they do when operating in a classic conversational mode. Developers can use DataStage to transform and manipulate
message contents. The MQSeries Stage treats messages as rows and columns within the DataStage engine like any
other data stream. The MQSeries Stage fully supports using MQSeries as a transaction manager, browsing messages
under the "syncpoint control" feature of MQ and thus guaranteeing that messages between queues are delivered to their
destination.
The MQSeries Stage is the first step in providing real-time data integration and business intelligence. With the MQSeries
Stage, developers can apply all of the benefits of using a data integration tool to application integration.
Figure 6 depicts the architecture for an integrated data integration environment with multiple enterprise applications
including SAP, PeopleSoft, and Siebel.
DataStage
Server
Load PACK
21
DATASTAGE EXTRACT PACK FOR SAP R/3
The DataStage Extract PACK for SAP R/3 enables developers to extract SAP R/3 data and integrate it with data from
other enterprise sources thereby providing an enterprise-wide view of corporate information. Once extracted, DataStage
can manipulate the R/3 data by mapping, aggregating and reformatting it for the enterprise data environment. Further, it
generates native ABAP code (Advanced Business Application Programming) thereby eliminating manual coding and
speeding up deployment of SAP data integration. To optimize environment performance and resources, developers can
opt to have the generated ABAP code uploaded to the R/3 system via a remote function call (RFC). Or the developer
can manually move the R/3 script into the system using FTP and the R/3 import function. Developers can chose to con-
trol job scheduling by using the DataStage Director or SAP’s native scheduling services. The DataStage Extract PACK
for SAP R/3 employs SAP’s RFC library and IDocs; the two primary data interchange mechanisms for SAP R/3 access,
to conform to SAP Interfacing Standards.
Further, the Extract PACK integrates R/3 meta data and ensures that it is fully shared and re-used throughout the data
environment. Using DataStage’s Meta Data object browser, developers and users can access transparent, pool, view,
and cluster tables.
With over 15,000 SAP database tables and their complex relationships, the Meta Data Object Browsers provides easy
navigation through the Info Hierarchies before joining multiple R/3 tables. Overall, making the meta data integration
process much simpler as well as error-free.
The DataStage Load PACK for SAP BW offers automated data transformation and meta data validation. Browsing and
selecting SAP BW source system and information structure meta data is simplified. Data integration is automated as the
RFC Server listens and responds to SAP BW request for data. The Load Pack for SAP BW uses high-speed bulk BAPIs
(Business Application Programming Interface), SAP’s strategic technology, to link components into the business frame-
work. In fact, SAP has included the DataStage Load Pack for SAP BW in their mySAP product offering. The end result is
an accurate, enterprise-wide view of the organization’s data environment.
DataStage’s PACK for Siebel eliminates tedious meta data entry because it extracts the meta data too. It captures the
business logic and associated expressions resulting in easy maintenance as well as faster deployment. DataStage
allows developers to integrate customer data residing in Siebel with other ERP and legacy systems thus providing a
complete business view. True enterprise data integration means harnessing valuable customer information that will in
turn to improved customer satisfaction, leveraged cross-selling and up-selling opportunities, and increased profitability.
22
organizations can make strategic business decisions based on accurate information that is easy to interpret. The
PeopleSoft PACK allows developers to view and select PeopleSoft meta data using familiar PeopleSoft terms and menu
structures. The developer selects the meta data and DataStage then extracts, transforms, and loads the data into the
target environment.
In addition, developers can use any PeopleSoft business process or module as a data source as well as map data ele-
ments to target structures in the warehouse. Once in the warehouse, PeopleSoft data can be integrated with mainframe
data, relational data or data from other ERP sources. For ongoing maintenance, this PeopleSoft PACK allows develop-
ers to capture and share meta data as well as view and manipulate the meta data definitions. All of these features give
developers a clear view of the warehouse design, where and when data changed, and how those changes originated.
The PeopleSoft PACK supports a consistent user interface for a short learning curve and multiple platforms (NT,
UNIX) and globalization (multi-byte support and localization) for cost-effective flexibility. The end result is a consistent,
reliable infrastructure that includes PeopleSoft data without months of waiting or the extra expense of custom develop-
ment.
DataStage XE/390 provides data developers with a capability unmatched by competitive offerings. No other tool pro-
vides the same quick learning curve and common interface to build data integration processes that will run on both the
Mainframe and the Server.
Once the job design is completed, DataStage XE/390 will generate three files that are moved onto the Mainframe, a
COBOL program, and two JCL jobs. One JCL job compiles the program and the other runs the COBOL program.
Typically, an organization using mainframes will follow one of the two deployment philosophies:
• Both their transaction system and integrated data will be on the mainframe or
• Minimum processing is done on their mainframe based transactional systems and the integrated data is deployed
on a Unix or NT server
DataStage XE/390 excels in both instances. In the first instance, all processing is kept on the mainframe while data inte-
gration processes are created using the DataStage GUI to implement all the complex transformation logic needed.
In the second instance, DataStage XE/390 can be used to select only the rows and columns of interest. These rows can
23
have complex transformation logic applied to them before being sent via FTP to the UNIX or NT server running
DataStage. Once the file has landed, the DataStage XE server will continue to process and build or load the target envi-
ronment as required.
Typically, there are two common methods to integrate meta data. The theoretical "quickest and easiest" method is to
write a bridge from one tool to another. Using this approach, a developer must understand the schema of the source tool
and the schema of the target tool before writing code to extract the meta data from the source tool’s format, transform it,
and put it into the target tool’s format. If the exchange needs to be bidirectional then the process must also be done in
reverse. Unfortunately this approach breaks down as the number of integrated tools increases. Two bridges support the
exchange between two tools. It takes six to integrate three tools, twelve to integrate four tools, and so forth. If one of the
four integrated tools changes its schema then the developer must change six integration bridges to maintain the integra-
tion. Writing and maintaining tool-to-tool bridges are an expensive and resource-consuming proposition over time for any
IS organization or software vendor.
Another integration method is the common meta data model, which generally takes one of two forms. The first form,
characterized as a "least common denominator" model, contains a set of meta data that a group of vendors agree to use
for exchange. However, this exchange is limited to these items described in the common model. The more vendors
involved in the agreement, the smaller the agreed set tends to become. Alternatively, a vendor can establish their own
meta data model as "the standard" and require all other vendors to support that model in order to achieve integration.
This method benefits the host vendor’s product, but typically doesn’t comprise all the meta data that is available to be
shared within tool suites. Sometimes a common model will have extension capabilities to overcome the shortcomings of
this approach. But, the extensions tend to be private or tool-specific. They are either not available to be shared, or
shared via a custom interface to access this custom portion of the model, i.e. by writing custom bridges to accomplish
the sharing.
Ascential takes a different approach to meta data sharing in the DataStage Product Family. The core components of the
technology are the Meta Data Integration Hub, a patented transformation process, and the MetaBroker production
process. The Meta Data Integration Hub is the integration schema and is vendor neutral in its approach. It is composed
of granular, atomic representations of the meta data concepts that comprise the domain of software development. As
such, it represents a union of all of the meta data across tools, not a least common denominator subset. To use an anal-
ogy from the science of chemistry, it could be considered a "periodic table of elements" for software development. It is
this type of model that allows all the meta data of all tools to be expressed and shared.
Software tools express their meta data not at this kind of granular level, but at a more "molecular" level. They "speak" in
terms of tables, columns, entities, and relationships, just as we refer to one of our common physical molecules as salt,
not as NaCL. In an ideal integration situation the tools do not need to change their meta data representation in any way
nor be cognizant of the underlying integration model. The MetaBroker transformation process creates this result. The
MetaBroker for a particular tool represents the meta data just as it is expressed in the tool’s schema. It accomplishes the
exchange of meta data between tools by automatically decomposing the meta data concepts of one tool into their atomic
24
elements via the Meta Data Integration Hub and recomposing those elements to represent the meta data concepts from
the perspective of the receiving tool. In this way, all meta data and their relationships are captured and retained for use
by any of the tools in the integrated environment.
Tool B Tool C
Tool A
Integration Hub
Database
Ascential’s MetaBroker production process is a model-based approach that automatically generates the translation
process run-time code—a translation engine (.dll) that converts tool objects into Meta Data Integration Hub objects and
vice versa for the purpose of sharing meta data. It uses a point-and-click graphical interface to create granular semantic
maps from the meta data model of a tool to the Meta Data Integration Hub. This powerful process allows for easy cap-
ture, retention, and updating of the meta data rules that are encoded in the MetaBroker thus providing huge timesavings
in creating and updating data sharing solutions. The current set of MetaBrokers facilitates meta data exchange between
DataStage and popular data modeling and business intelligence tools. See Appendix A for a listing of specific
MetaBrokers provided by Ascential Software. Further, the Custom MetaBroker Development facility allows the Ascential
Services organization to deliver customer-specified MetaBrokers quickly and reliably.
The DataStage Director extends ease of use at run-time by providing tools that limit resources and allow validation of
jobs to check for subtle syntax and connectivity errors. It immediately notifies developers that passwords have changed
rather than discovering such failures the following morning when the batch window has closed. The Director also pro-
vides the Monitor window that provides real-time information on job execution. The Monitor displays the current number
of rows processed and estimates the number of rows per second. Developers often use the Monitor for determining bot-
tlenecks in the migration process.
25
ment. One that the MetaBrokers easily transform and express into a variety of different languages either specific to a
particular tool or aligned to industry standards.
The Object Connector facility allows the developer or administrator to specify which items in the Meta Data Integration
Hub are equivalent to other objects for the purpose of impact analysis, by "connecting" them. The GUI enables users to
set object connections globally across or individually on an object-by-object basis. Cross-tool impact analysis shows its
power again by exposing the location and definitions of standard, corporate, calculated information such as profit or
cost-of-goods-sold. Starting from the transformation routine definition, Designers can quickly identify which columns
within a stage use the routine, thereby allowing developers to implement any changes to the calculation quickly. End-
users have the ability to trace the transformation of a data item on their reports back through the building process that
created it. They can now have a complete understanding of how their information is defined and derived.
26
Figure 8: Cross Tool Impact Analysis diagram that illustrates
where the Table Defintion "Customer Detail" is used in
ERwin, DataStage, and Business Objects
DATA LINEAGE
Data lineage is the reporting of events that occur against data during the data integration process. When a DataStage
job runs, it reads and writes data from source systems to target data structures and sends data about these events to
the Meta Data Integration Hub automatically. Event meta data connects with its job design meta data. Data Lineage
reporting captures meta data about the production processes: time and duration of production runs, sources, targets,
row counts, read and write activities. It answers questions for the Operations staff: What tables are being read from or
written to, and from what databases on whose servers? What directory was used to record the reject file from last night’s
run? Or, exactly which version of the FTP source was used for the third load? Data lineage reporting allows managers to
have an overall view of the production environment from which they can optimize their resources and correct processing
problems.
27
Data modeling, data integration, and business intelligence tools can share meta data seamlessly. This flexible layer of
management encourages meta data re-use while preventing accidental overwriting or erasure of critical business logic.
ON-LINE DOCUMENTATION
Developers can create on-line documentation for any set of objects in the Meta Data Integration Hub. DataStage can
output documentation to a variety of useful formats: .htm and .xml for web-based viewing, .rtf for printed documentation,
.txt files, and .csv format that can be imported into spread sheets or other data tools. Users can select sets of business
definitions or technical definitions for documentation. Path diagrams for Impact Analysis and Data Lineage can be auto-
matically output to on-line documentation as well as printed as graphics. The document set can result from a meta data
query, an import, or a manual selection process. It is vital to the management of the information asset that meta data is
made available to anyone who needs it, tailored to his or her needs.
First, through built-in automation features the developers can quickly discover the values within the data and can answer
the question: Is the data complete and valid? From meta data derived in this initial discovery process, the developers
can then move to build simple or sophisticated tests known as Filters against specific combination of data fields and
sources to ensure that data complies with business rules and database constraints. They can run a series of Filters as a
job stream or using the Macro feature, can also schedule the job stream within or outside the quality component. The
incidents and history of defective data are stored thereby allowing developers to export this information to other tools,
send it using email or post it on intranet sites. The automated meta data facility reviews and updates meta data informa-
tion for further analysis and for generating DataStage transformations. Throughout, the data quality assurance compo-
nent of DataStage XE offers Metric capabilities to measure, summarize, represent and display the business cost of data
defects. The developer can produce in a variety of formats including HTML and XML reports, analysis charts for showing
the relative impact and cost of data quality problems, and trend charts that illustrate the trend in data quality improve-
ment over time—important for on-going management of the data environment. The data quality assurance capabilities in
the DataStage Product Family outlined below ensure that only the highest quality data influences your organization’s
decision-making process.
• Automates Data Quality Evaluation — Performs automated domain analysis that determines exactly what is present in
28
the actual data values within a data environment. The evaluation process furthers classifies data into its semantic or
business use—a key element in determining what types of data validation and integrity checking apply. Using this
information, DataStage can determine the degree of completeness and validity within these data values.
• Evaluates Structural Integrity — Looking within and across tables and databases, the data quality services of the
DataStage Product Family uncovers the degree of conformance of the data environment to the implicitly or explicitly
defined data structure.
• Ensures Rule Conformance — Enables the testing of the data within and across data environments for the adherence
to the organization’s rules and standards. DataStage looks at the interrelationships of data elements to determine if
they correctly satisfy the true data relationships that the business requires.
• Measures and Presents the Impact of Data Quality Defects — Evaluates the condition of the data and the processes
responsible for defective data in an effort to improve either or both. A key feature of the data quality component of the
DataStage Product Family is the metric facility that allows the quantification of relative impacts and costs of data
quality defects. This information provides a means to prioritize and determine the most effective improvement pro-
gram.
• Assesses Data Quality Meta Data Integration — Checks the data quality of meta data integration in order to create,
incorporate, and retain extensive data quality-related meta data. Data integration processes access and use this meta
data that provides detailed and complete understanding of the strengths and weaknesses of source data.
• On-going Monitoring — Monitors the data quality improvement process to ensure quality improvement goals are met.
Depending on the data environment, ensuring data quality can range from the very simple to the very complex. For
example, companies dealing with global data need to be localization-ready. The DataStage Product Family enables data
quality analysis of double-byte data. Others require production mode processing. Further, it allows users to run a server
version of the data quality SQL-based analysis in a monitoring environment. In addition, DataStage allows developers to
plug in complimentary partner products such as Trillium and FirstLogic for name and address scrubbing.
The DataStage XE Portal Edition also enhances user productivity. First, keeping track of important information that has
changed because of content modifications, additions or deletions is time-consuming and tedious. The Portal Edition
automatically notifies users of important updates via email, a notification web page, or customized alerts. Further, it
enhances productivity by allowing managers and authorized users to share information with other co-workers or external
business associates such as customers, vendors, partners or suppliers. The Portal Edition integrates with Windows
desktop and best of breed business intelligence tools thereby enabling permission-based direct content publishing from
desktop applications such as Microsoft Word, Excel, and PowerPoint. It also automatically links data environment’s meta
data to BI reports. This flexibility enables business users to take advantage of their Internet browser to view Business
Intelligence and meta data reports.
A common obstacle to user productivity is searching through and categorizing the large amounts of data in the data
environment. The DataStage XE Portal Edition features an XML-based Business Information Directory (BID) that pro-
vides search and categorize capabilities across all content sources. The BID uses corporate vocabularies to correctly
model the enterprise’s industry-based information, processes, and business flows. Because of the BID’s semantic mod-
eling, non-technical business users can quickly locate specific enterprise data. It also provides users with a way to re-
29
categorize data because of a significant business event such as a merger or acquisition.
In addition to the maintenance and development features of the DataStage XE Portal Edition for managing the complexi-
ties of a distributed heterogeneous environment, the Portal Edition offers administrators convenience and flexibility. As
part of the organization’s intranet infrastructure, the Portal Edition allows administrators to customize the user interface
to support their corporate branding standards. From within the Portal Edition, administrators can handle sensitive tasks
including initial setup, BID configuration, user and group administration, LDAP Directory synchronization as well as
everyday maintenance such as create, modify, backup and restore. Further administrators have the option to delegate
certain responsibilities to specific users such as the ability to publish content in folders, set permissions on folders, or set
subscriptions to content on behalf of other users. They can also assign permissions to individual users or groups of
users. Furthermore, for organizations that have implemented network and Internet security such as secure socket layer
(SSL) and digital certificates (X.509), the Portal Edition leverages this technology to secure end-user and administrative
access to the data environment.
The DataStage XE Portal Edition integrates easily into any organization’s computing environment without requiring mas-
sive retooling. By using industry standard eXtensible Markup Language (XML), JavaScript, and Java, the Portal Edition
delivers an information integration framework and cross-platform portability. Further, while it provides out-of-the-box con-
tent integration, the Portal Edition allows developers to add content types, repositories, and resources seamlessly
through its extensible SDKs. The software’s underlying design supports GUI customization and true device indepen-
dence meaning that users can access the data environment via any browser or wireless device such as a PDA or mobile
phone. The Portal Edition preserves the organization’s investment in the existing data infrastructure while simultaneously
providing the flexibility to incorporate new technologies.
XSLT Translation
Server
Data Meta Data Data
Integration Management Quality
Reuters Oracle
ZDNet Sybase
Web OLTP DB2
Sites .DOC Cognos Data
.PDF Cognos Crystal Warehouse
.PPT Brio Business
.XLS Business Objects
Objects Figure 9: DataStage XE Portal Edition
IX. Conclusion
Rushing headlong to build a useful data environment has caused many developers and administrators to deal with unan-
ticipated problems. However, a successful integration project will attract more users who will add and change data
requirements. More importantly, the business climate will change and systems and architectures will evolve leaving
developers to figure out how they are going to "make it all work" with the resources they have.
Data integration developers who consider key factors such as performance, network traffic, structured and unstructured
30
data, data accuracy, integrating meta data, and overall data integration maintenance will stay one step ahead. But, is
there a toolset that can help them deal with these factors without creating long learning curves or diminishing
productivity?
With the DataStage Product Family, administrators and developers will have the flexibility to:
• Provide data quality and reliability for accurate business analysis
• Collect, integrate, and transform complex and simple data from diverse and sometimes obscure sources for building
and maintaining the data integration infrastructure without overloading available resources
• Integrate meta data across the data environment crucial for maintaining consistent analytic interpretations
• Capitalize on mainframe power for extracting and loading legacy data as well as maximize network performance
Providing developers with power features is only part of the equation. Ascential Software has long supported developers
in their efforts to meet tight deadlines, fulfill user requests, and maximize technical resources. DataStage offers a devel-
opment environment that:
• Works intuitively thereby reducing the learning curve and maximizing development resources
• Reduces the development cycle by providing hundreds of pre-built transformations as well as encouraging code re-
use via APIs
• Helps developers verify their code with a built-in debugger thereby increasing application reliability as well as reducing
the amount of time developers spend fixing errors and bugs
• Allows developers to quickly "globalize" their integration applications by supporting single-byte character sets and
multi-byte encoding
• Offers developers the flexibility to develop data integration applications on one platform and then package and deploy
them anywhere across the environment
• Enhances developer productivity by adhering to industry standards and using certified application interfaces
Another part of the equation involves the users of integrated data. Ascential Software is committed to maximizing the
productivity of the business users and decision-makers. An integrated data environment built with DataStage offers data
users:
• Quality management which gives them confidence that their integrated data is reliable and consistent
• Meta data management that eliminates the time-consuming, tedious re-keying of meta data as well as provides
accurate interpretation of integrated data
• A completed integrated data environment that provides a global snapshot of the company’s operations
• The flexibility to leverage the company’s Internet architecture for a single point of access as well as the freedom to
use the business intelligence tool of choice
Ascential’s productivity commitment stands on the premise that data integration must have integrity. Ascential Software
has constructed the DataStage family to address data integration issues and make data integration transparent to the
user. In fact, it is the only product today designed to act as an "overseer" for all data integration and business intelli-
gence processes.
In the end, building a data storage container for business analysis is not enough in today’s fast-changing business cli-
mate. Organizations that recognize the true value in integrating all of their business data, along with the attendant com-
plexities of that process, have a better chance of withstanding and perhaps anticipating significant business shifts.
Individual data silos will not help an organization pinpoint potential operation savings or missed cross-selling opportuni-
ties. Choosing the right toolset for your data environment can make the difference between being first to market and
capturing large market share or just struggling to survive.
31
X. Appendix
Client Tools
Stages
Direct Process
- Oracle OCI - Transformation
- Sybase Open Client - Aggregation
- Informix CLI - Sort
- OLE/DB - Merge
- ODBC - Pivot
- Teradata
- IBM DB2
- Universe
- Sequential
32
Bulk Loaders Special Purpose Stages
- Oracle OCI - Web Log Reader
- Oracle Express - Web Server Log Reader
- Informix XPS - Container
- Sybase Adaptive Server and Adaptive Server IQ - POP3Email Reader
- Microsoft SQL Server 7 - Folder
- BCPLoad - XML Reader
- Informix Redbrick - XML Writer
- Teradata - IBM MQ Series
- IBM UDB - Complex Flat File
- Hashed File
- Perl Language Plug-in
- Named Pipe
- FTP
33
About Ascential Software
Ascential Software Corporation is the leading provider of Information Asset Management solutions to the Global 2000.
Customers use Ascential products to turn vast amounts of disparate, unrefined data into reusable information assets that
drive business success. Ascential’s unique framework for Information Asset Management enables customers to easily
collect, validate, organize, administer and deliver information assets to realize more value from their enterprise data,
reduce costs and increase profitability. Headquartered in Westboro, Mass., Ascential has offices worldwide and supports
more than 1,800 customers in such industries as telecommunications, insurance, financial services, healthcare,
media/entertainment and retail. For more information on Ascential Software, visit http://www.ascentialsoftware.com.
© 2002 Ascential Software Corporation. All rights reserved. AscentialTM is a trademark of Ascential Software Corporation and may be registered in
other jurisdictions. The following are trademarks of Ascential Software Corporation, one or more of which may be registered in the U.S. or other juris-
dictions: AxielleTM, DataStage®, MetaBrokersTM, MetaHubTM, XML PackTM, Extract PackTM and Load PACKTM. Other trademarks are the property of the own-
ers of those marks.
WP-3002-0102 Printed in USA 1/02. All information is as of January 2002 and is subject to change.