DataStage Best Practises 1
DataStage Best Practises 1
49017773.doc Page 1 of 41
CONTENTS
1. INTRODUCTION................................................................................................................................. 6
1.1 OBJECTIVE...................................................................................................................................... 6
1.2 REFERENCES.................................................................................................................................. 6
1.3 AUDIENCE....................................................................................................................................... 6
1.4 DOCUMENT USAGE.......................................................................................................................... 6
2. DATASTAGE OVERVIEW.................................................................................................................. 7
49017773.doc Page 2 of 41
9.5.1 In-line Notification of Rejects............................................................................................. 21
9.5.2 Cross Functional Notification of Rejects............................................................................22
10. ENVIRONMENT............................................................................................................................ 22
10.1 DEFAULT ENVIRONMENT VARIABLES STANDARDS............................................................................22
10.2 JOB PARAMETER FILE STANDARDS................................................................................................. 22
10.3 DIRECTORY PATH PARAMETERS..................................................................................................... 22
10.4 DEFAULT DIRECTORY PATH PARAMETERS......................................................................................22
10.5 DIRECTORY & DATASET NAMING STANDARDS..................................................................................23
10.5.1Functional Area Input Files................................................................................................ 23
10.5.2Functional Area Output Tables.......................................................................................... 23
10.5.3Functional Area Staging Tables......................................................................................... 23
10.5.4Internal Module Tables...................................................................................................... 23
10.5.5Datasets Produced from Import Processing......................................................................23
11. METADATA MANAGEMENT........................................................................................................ 23
11.1 SOURCE AND TARGET METADATA................................................................................................... 24
11.2 INTERNAL METADATA..................................................................................................................... 24
49017773.doc Page 3 of 41
16.13 JST_UNLOAD.......................................................................................................................... 43
16.14 jbt_abort_threshold.............................................................................................................. 43
49017773.doc Page 4 of 41
1. INTRODUCTION
1.1 Objective
This document will serve as a source of standards for use of the DataStage software as
employed by the Dummy Transformation project.
The below mentioned standards will be followed by all developers. It is understood that this
document, while setting the standards might not be possible to cover all the development
scenarios. In such cases, developer must contact the appropriate authority to seek clarification
and ensure that such missing items are subsequently added to this document.
It will therefore be an evolving document which will be updated to continually reflect the
changing needs and thoughts of the development team and hence continue to represent best
practices as the project progresses.
Initial review and sign-off process will therefore be followed within this context.
It will be referenced by developers initially for familiarisation and as required during the course
of the project. Use of the document will therefore reduce over time as developers become
familiar with the practices described. The Offshore Build Manager will maintain the document
(in collaboration with the development team – through weekly developer meetings) and will be
responsible for distributing the document to developers (and explaining it’s content) initially and
after updates have been applied, ensuring that the standards it describes are communicated
and understood. Such communication will highlight the areas of change.
The best practices will also form the basis for QA and peer testing within the development
environment.
2. DATASTAGE OVERVIEW
DataStage is a powerful Extraction, Transformation, and Loading tool. DataStage has the
following features to aid the design and processing:
Uses graphical design tools. With simple point-and-click techniques you can draw a
scheme to represent your processing requirements
Extracts data from any number or type of database
Handles all the metadata definitions required to define your data warehouse or
migration. You can view and modify the table definitions at any point during the
design of your application
Aggregates data. You can modify SQL SELECT statements used to extract data
Transforms data. DataStage has a set of predefined transforms and functions you
can use to convert your data. You can easily extend the functionality by defining
your own transforms to use.
49017773.doc Page 5 of 41
3. DATASTAGE DEVELOPMENT WORKFLOW
Within DataStage, a project is the entity in which all related material to a development is stored
and organised.
Development will have three projects where each code will move i.e. Dummy_Dev, Version and
Dummy_Promo. Developers will develop code in Dummy_Dev project and after unit testing it
promote to project Version where Version controlling will be managed. After base-lining the
code the DataStage administrator will collate all code in the Dummy_Promo project from where
the DMCoE will move it for unit and end to end testing on the Test server. Finally the code will
be moved by DMCoE to production. Please refer to the Dummy Transform Code Migration
Strategy document for further details.
Production Server
Developer
Role
Ranch_Prod
Version DS Project
DS Project
Administrator
Onshore Activity -- DMcoE
Role
49017773.doc Page 6 of 41
4. DATASTAGE JOB DESIGN CONSIDERATIONS
Scheduler Process
Hogan
Extract
Data DataStage Environment
Non-
Hogan
Extract
Data Target data is provided as
flat files
Staging and
Staging Staging
Lookup
Data Data
Source data is provided as Data
complex flat files
Hogan
Extract
Data
Import Job
Check zero
Record
byte file, Read File in Create
s to be
Validate specific output
Proces
header and format datasets
Non- sed
trailer details
Hogan
Extract
Data
49017773.doc Page 7 of 41
Datasets created by import jobs will be processed by transform jobs. Transform will join two or
more datasets, lookup data as per functional design specification. Finally the records will be split
as per destination file design and a destination dataset will be created. All data errors will be
captured in an exception log for future reference.
Data Held
Transform Job for Future
Job
Driving Transform
Join Lookup
Data Flow and Split
Data Held
for Future
Records from Job
driving data
flow failing lookup
Records from
driving data
flow failing Exceptions resulting from data
to join inconsistencies are captured
during job execution.
Exception
Log
5. USE OF STAGES
A brief description as to when to use these stages is provided in the following table:
The Lookup stage is most appropriate when the reference data for all lookup stages in a job is
small enough to fit into available physical memory. Each lookup reference requires a contiguous
block of physical memory. If the datasets are larger than available resources, the JOIN or
MERGE stage should be used.
Count
Sum
Mean
Min / Max.
Several others are available to process business logic, however it is most likely that
aggregations will be used as part of a calculation to determine the number of rows in an output
table for inclusion in header and footer records for unload files.
5.2 Sorting
There are two options for sorting data within a job, either on the input properties page of many
stages (a simple sort) or using the explicit sort stage. The explicit sort stage has additional
properties, such as the ability to generate key change column and to specify the memory usage
of the stage.
49017773.doc Page 9 of 41
5.3 Data Manipulation
The Transformer rejects NULL derivation results because the rules for arithmetic and string
handling of NULL values are by definition undefined. For this reason, always test for null values
before using a column in an expression, for example:
For example, TrimLeadingTrailing(string) works only if string is a VarChar field. Thus, the
incoming column must be type VarChar before it is evaluated in the Transformer.
The stage variables and the columns within a link are evaluated in the order in which they are
displayed in the Transformer editor. Similarly, the output links are also evaluated in the order in
which they are displayed.
From this sequence, it can be seen that there are certain constructs that will be inefficient to
include in output column derivations, as they will be evaluated once for every output column that
uses them. Such constructs are:
49017773.doc Page 10 of 41
Where the same part of an expression is used in multiple column derivations
For example, suppose multiple columns in output links want to use the same substring of an
input column, then the following test may appear in a number of output column derivations:
In this case, the evaluation of the substring of DSLINK1.col[1,3] is evaluated for each column
that uses it.
This can be made more efficient by moving the substring calculation into a stage variable. By
doing this, the substring is evaluated just once for every input row. In this case, the stage
variable definition will be:
DSLINK1.col1[1,3]
This example could be improved further by also moving the string comparison into the stage
variable. The stage variable will be:
This reduces both the number of substring functions evaluated and string comparisons made in
the Transformer.
For example, a column definition may include a function call that returns a constant value, such
as:
Str(" ",20)
This returns a string of 20 spaces. In this case, the function will be evaluated every time the
column derivation is evaluated. It will be more efficient to calculate the constant value just once
for the whole Transformer.
This can be achieved using stage variables. This function could be moved into a stage variable
derivation. However in this case, the function will still be evaluated once for every input row.
The solution here is to move the function evaluation into the initial value of a stage variable.
A stage variable can be assigned an initial value from the Stage Properties dialog/Variables tab
in the Transformer stage editor. In this case, the variable will have its initial value set to:
Str(" ",20)
You will then leave the derivation of the stage variable on the main Transformer page empty.
Any expression that previously used this function will be changed to use the stage variable
instead.
49017773.doc Page 11 of 41
The initial value of the stage variable is evaluated just once, before any input rows are
processed. Then, because the derivation expression of the stage variable is empty, it is not re-
evaluated for each input row. Therefore, it is value for the whole Transformer processing is
unchanged from the initial value.
In addition to a function value returning a constant value, another example would be part of an
expression such as:
"abc" : "def"
As with the function call example, this concatenation is evaluated every time the column
derivation is evaluated. Since the subpart of the expression is actually constant, this constant
part of the expression could again be moved into a stage variable, using the initial value setting
to perform the concatenation just once.
DSLink1.col1+"1"
In this case, the "1" is a string constant, and so, in order to be able to add it to DSLink1.col1, it
must be converted from a string to an integer each time the expression is evaluated. The
solution in this case is just to change the constant from a string to an integer:
DSLink1.col1+1
In this example, if DSLINK1.col1 were a string field, then a conversion will be required every
time the expression is evaluated. If this just appeared once in one output column expression,
this will be fine. However, if an input column is used in more than one expression, where it
requires the same type conversion in each expression, it will be more efficient to use a stage
variable to perform the conversion once. In this case, you will create, for example, an integer
stage variable, specify its derivation to be DSLINK1.col1, and then use the stage variable in
place of DSLink1.col1, where that conversion would have been required.
It should be noted that when using stage variables to evaluate parts of expressions, the data
type of the stage variable should be set correctly for that context, otherwise needless
conversions are required wherever that variable is used.
49017773.doc Page 13 of 41
6. GUI STANDARDS
Job Description Fields – the description annotation is mandatory for each job. Note that the
description annotation updates the job short description.
The full description should include the job version number, developer name, date and a brief
reference to the design document including the version number the job has been coded up to,
plus the main job annotation and any modifications to the job. Where the job has not yet entered
Version Control, the initial version should be referred to as 0.1.
When using DataStage Version Control, the Full Description field in job properties is also used
by DS Version control to append revision history. This is packaged and maintained with the job
and will be visible when the jobs are deployed to test, promo and production. It does not stop
developers from using Full Description as a method of maintaining the relevant documentation,
but information maintained by the developer will get appended to by the Version Control tool.
Naming conventions must be enforced on links, transforms and source and target files.
Annotations are also used to further describe the functionality of jobs and stages.
Two types of annotation, a blue job description (description annotation) and a yellow operator
specific description (standard annotation) are used. The detailed description is also updated
automatically in by DataStage Version Control process following the first initialization into
Version Control.
Entries put in the detailed description by Version Control must not be modified manually.
49017773.doc Page 15 of 41
8. RUNTIME COLUMN PROPAGATION (RCP)
One of the aims/benefits of RCP is to enable jobs that have variable metadata that is
determined at run time. An example would be a generic job that reads flat file and stores the
data into a Dataset, but the file name itself is a job parameter. In this case it is not possible to
determine the column definitions during build.
Conversely, one of the features that sometimes confuse developers, is that in jobs where RCP
is not desired by the developer but the feature is switched on, can cause additional columns to
appear in the output dataset that the developer may have thought were dropped.
For these reasons developers must turn off RCP within each job unless the feature is explicitly
required in the job by the developer as in the above example. In any event, RCP should be
enabled within the Project Properties (providing flexibility at to use RCP at job level) and in the
event that RCP is required, it can be turned on at job / stage level. An annotation should make
this clear on the job.
locate the rejection message and understand the format of the message
locate and diagnose the reason for rejections
set tolerances to the numbers of rejects permitted
allow for the re-process rejected rows.
Reject processing is not provided as standard within DataStage Enterprise (Parallel) across the
majority stages. There is a reject link on the Lookup stage. However, a standard approach
must be introduced for the remaining stages and adopted across all stages.
This will be achieved by the introduction of a bespoke element (in the form of example stages
within template jobs) and through the use of a standardised reject component made available to
all developers via a DataStage wrapper.
49017773.doc Page 16 of 41
Records
Lookup
to be
Dataset
Processed
Data Held
Example Job for Future
Stages Job
Data Held
for Future
Job
Bespoke Reject Component
Rejected
Transform
Rows
Address Key
Card Key
Standad reject message.
Customer Key
Account Key
All stages (where a row might be rejected) must include a reject link. Three such stages are
shown in the diagram (i.e. Join, Lookup and Transform). In the example above, the Lookup
stage is shown with a reject link, though this is just as applicable to Join, Transform and other
stages. For instance, data flowing down the reject link from a Lookup or Join stage might result
from an inability to match keys and from a Transform stage from the validation of data items, for
instance an unexpected value or null might be encountered. In each of these cases, the
rejected row is passed down a reject link to a bespoke component that:
1. Passes the row to a dataset in order to facilitate the re-processing of the rejected rows
2. Identifies the key of the rejected row and passes this down the relevant link (depending
on the key type) to the standardised reject handling component. Where there is no key,
i.e. a file is empty or there is a mismatch between the number of rows read and the
information provided on the footer record, zeros are passed down all links intended for
key information
3. Compiles and passes a standard message (see table below) describing the rejection to
the standardised reject handling component.
This approach assumes that a key uniquely identifying each failing row is present on driving
flows.
The standardised reject component takes two inputs (over a possible five input links) and
creates a surrogate key, uniquely defining each reject and writes the message along with the
two keys to a dataset. This reject dataset therefore holds the key from the rejected row (that
can be used to cross reference to the dataset of rejected rows) and a message that will help
identify the reason for the rejection.
49017773.doc Page 17 of 41
Paths where reject datasets are automatically set to write to are date stamped within a common
reject and log directory. Reject datasets are uniquely named and created each time the module
runs (see below).
The reject component will be used with every stage which can fail due to data discrepancies
(e.g. join, and lookup). The Join stage requires further processing whilst the error link from the
lookup stage can be linked directly to the custom error component.
In order to facilitate reject handling within the Join stage, further processing is required. This
processing requirement is shown in the following diagram:
stage.
Secondary
This component will be made available to all developers for use in reject handling as a job
template.
Developers must intercept rejects in the code they generate and generate a standard reject
message that contains accurate data and relevant information from the record. The job, field
and stage names must be inserted into the message.
A description of rejects and messages should be made available to operational support to help
diagnose problems encountered when running the batch.
The reject will be variable between 0 and 99. A reject limit of 0 (zero) will ABORT ON FIRST
REJECT, whilst a reject limit of 99 will NEVER ABORT (on reject).
This allows central control the level of rejects allowed across all modules and jobs used in the
Dummy batch.
9.5 Notifications
A notification is the method by which:
operations are informed of a reject i.e. in-line notifications
rejects are communicated between functional streams and / retained to support the re-
running of modules i.e. cross functional notifications.
These are described in the following sections.
49017773.doc Page 19 of 41
9.5.2 Cross Functional Notification of Rejects
This type of notification is the means by which rejects are communicated between functional
streams. This ‘communication’ is built around the feedback from the load process, prompting a
rerun and between migration steps (i.e. T14 to T) and an understanding of the dependencies
between functional areas i.e. transactions being dependent on accounts etc. In this instance,
rejected accounts will be incorporated into the transaction processing process, therefore limiting
those transactions processed to those where an account had also been successfully processed.
10. ENVIRONMENT
The following DataStage Environment variables must exist in all jobs. (Note that DataStage
Environment Variables are different to Standard Parameters)
The template job, held under the Users/Template DataStage will have these parameters
defined.
49017773.doc Page 20 of 41
10.5 Directory & Dataset naming standards
UNIX directory paths are set using the following convention, based on the parameters defined
above. Note the final subdirectories (i.e. “Deliver” and “Internal”) are hard coded in the jobs.
This is fine because if the developer mistypes the value the job will fail immediately as the
mistyped directory will not exist.
10.5.1 Functional Area Input Files
Source files will be pushed by Extract system to ETL server in a holding area ‘Hold` via connect
direct software.
#pDSPATH#/#pITERATION#/#pRUNNUMBER#/Hold/<source_file_name>
10.5.2 Functional Area Output Tables
Datasets that are defined in the Detailed Design as output tables for a functional area are stored
in a “Product” directory. This is the directory that downstream Functional Areas (including the
Unload process) will go to find input tables from previous areas.
#pDSPATH#/#pITERATION#/#pRUNNUMBER#/Product/<datasetname>.ds
10.5.3 Functional Area Staging Tables
Datasets that are defined in the Detailed Design as staging tables within an area are stored in a
“Staging“directory. This is the directory that other modules within the same Functional Area will
go to find staging tables from previous modules.
#pDSPATH#/#pITERATION#/#pRUNNUMBER#/Staging/<datasetname>.ds
10.5.4 Internal Module Tables
Datasets produced within a module and used only internally within that module will be stored in
an “Internal” directory. Datasets in this directory are only used within jobs.
#pDSPATH#/#pITERATION#/#pRUNNUMBER#/Internal/<datasetname>.ds
10.5.5 Datasets Produced from Import Processing
Datasets that are produced by Pre-Processing are stored in a “Source” directory. This is the
directory that Functional Areas will go to find input tables from the source.
#pDSPATH#/#pITERATION#/#pRUNNUMBER#/Source/<datasetname>.ds
Reference Datasets that are produced by Import Processing are stored in a “Reference”
directory. Reference data is not split into iterations.
#pDSPATH#/Reference/<datasetname>.ds
49017773.doc Page 21 of 41
11.1 Source and Target Metadata
Record formats will have been pre-defined within the DataStage Repository describing the
record formats of files that form inputs to import jobs and outputs from unload jobs. This
metadata will therefore only be used by import and unload jobs.
These record formats are for the convenience of developers (they are described in the FDs and
are therefore fixed,) and help maintain consistency in terms of the way data is interpreted
across all jobs (define once, use many times), therefore having a positive impact in terms of
quality.
Should a change be required to this metadata, it should first be impacted to assess the potential
impact of the change on jobs that use the metadata and processed through standard change
control.
This metadata will define the outputs of import jobs, be used by all transoform jobs and will
define the inputs to unload jobs and must be stored in the repository with a name that matches
the name of the dataset it describes.
Should it be necessary or more efficient to process data in a different way from the way it is
presented within the pre-defined metadata, developers may create a job specific version of the
metadata which must be clearly identified as a variant on the original and saved within the
repository.
Increase quality of the code, since the most optimal method will be used for a function
which is to be achieved in multiple jobs
Promote reuse, productivity is increased and developers can spend more time on tasks
which are specific to individual jobs
Reduce the complexity of common tasks.
Create a template for a server or parallel job. This can be subsequently used to create
new jobs. New jobs will be copies of the original job
Create a new job from a previously created template
Create a simple parallel data migration job. This extracts data from a source and writes it
to a target
Not only will the use of templates help in standardization but also it will form reusable
components, which need not be coded yet again. Also certain elements will be common in many
49017773.doc Page 22 of 41
jobs, namely: parameters, annotations and reject handling, etc. which can be implemented by
the use of templates.
Dummy project will have templates which will be a job with stages following naming standards.
These jobs acting as a template will assist developer to develop new jobs as per mentioned
standards.
12.1.1 Import Jobs
Each source file will be read in persistence datasets by separate jobs called import jobs. These
jobs will have functionality of doing sanity checks on received file e.g. data file is not empty,
header and trailer details are consistent with file properties. In Dummy project the files are
repeatedly used in different functionality, we will read file only once and create a DataStage
datasets. These datasets will then be used in respective functionalities. Since associated logic
for importing and validating files will be same, we will build and test one such job and use this
architecture in rest.
The files that are used in multiple instances are described below:
49017773.doc Page 23 of 41
Common Process Functionality
12.2 Containers
A container is a group of stages and links. Containers simplify and modularize server job
designs by replacing complex areas of the diagram with a single container stage. DataStage
provides two types of container:
Local containers. These are created within a job and are only accessible by that job. A
local container is edited in a tabbed page of the job’s Diagram window. Local containers
can be used in server jobs or parallel jobs.
Shared containers. These are created separately and are stored in the Repository in the
same way as other jobs. There are two types of shared container:
o Server shared containers are used in server jobs. They can also be used in
parallel jobs, though this can cause bottlenecks in processing as they are serial
only and should be avoided if possible
o Parallel shared container is used in parallel jobs.
You can also include server shared containers in parallel jobs as a way of incorporating server
job functionality into a parallel stage (for example, you could use one to make a server plug-in
stage available to a parallel job).
Containers are the means by which standard DataStage processes are captured and made
available to many users. They are used just as a developer would use a standard stage. Some
work needs to be done to identify opportunities for reuse within the overall design. However,
once identified, reusable components will be identified and delivered into the DataStage
repository as shared components.
Identified containers in Dummy transform project are described in the table below:
49017773.doc Page 24 of 41
13. DEBUGGING A JOB
The following techniques options will assist when debugging a job. Debugging essentially
involves viewing the data in order to isolate the fault. There are a number of techniques
including:
Adding a peek stage will output certain rows to the job log
Adding a filter to the start of the job to filter out all rows except the ones with the
attributes that the developer may wish to test or debug the behaviour on
Adding an additional output to a transformer with the relevant constraints and storing the
data into a sequential file to be used as part of the investigation. The use of the copy
stage would also be an option
A variant of the above would be to add a parameter pDEBUG with a value of 1 or 0 that
will be used as part of the constraint. The resulting debug sequential file would only
contain data when pDEBUG=1.
All changes to code made for debugging (including peeks, extra stages and extra parameters)
must be removed prior to final unit test. Final unit testing must occur on the exact version of
code that is to be promoted to Integration Test.
In order to ensure trouble free scaling, jobs are built 1-way and unit tested 1-way and n-way.
This ensures that there has been no functional impact in making the switch to parallel
processing. Jobs will run n-way when live in order to achieve the benefits of parallel processing
provided by DataStage Enterprise.
Problems to do with scaling usually become evident when comparing record counts between 1-
way and n-way runs. Clearly, these counts and the physical records involved should be the
same. If there is a difference, the reasons for this must be examined and corrected.
There are many possible reasons for variations in record counts, for instance:
One of the most common reasons is when the Join, Lookup and Merge stages (and
others) are used. In these situations care must be taken to ensure that incoming data
streams are not only sorted but partitioned the same way. If not, join conditions may not
be met because of records (with keys that would otherwise match 1-way) being in
different partitions and therefore go unmatched. In these situations, records may be
unnecessarily rejected (either down a reject link or omitted all together) and will therefore
49017773.doc Page 25 of 41
not flow down the main output link to subsequent stages or into an output dataset, hence
causing a variation between the actual rows processed and the anticipated number
it should be ensured that a dataset used as input on the lookup link to a Lookup stage
must be partitioned as Entire to ensure that the entire dataset is available for lookup
across all partitions within the main input link to these stages, otherwise the lookup may
fail simply because the dataset was partitioned incorrectly for the lookup
an incoming dataset may have been created by another job or module which may also
have been written by another developer. In this case it might contain the required data,
but may not be correctly partitioned for the needs of your job. Therefore good practice,
unless you can be absolutely sure that the datasets you are using are partitioned
correctly for your needs, is to repartition at the start of a job. This might be less efficient,
but more effective in terms of retaining control over your jobs and the quality of the
output data flows. Where possible, partitioning will be considered within the overall
solution design, therefore minimising the need for repartitioning.
Configuration files are provided for 1-way and 4-way running on the Development server, with 1-
way being the default. 4-way processing is specified at job level as an override. The developer
must ensure that overrides are removed from their jobs prior to promotion to the Test server.
Duplicates will often be identified when the output data (from a DataStage job), perhaps in the
form of a flat file, is loaded into a target database table. This load process will most likely fail if
there are duplicate keys in the data, particularly if the target table is uniquely keyed.
Another sign that there may be duplicates in the data is when the output of a job or stage (within
a job) has more rows in the output stream than would have been thought possible from the
inputs.
For these reasons, care must be taken at the unit test stage and it is always a good idea to have
a general understanding of the anticipated throughput of a job before starting the build.
The key to solving problems related to duplicates, is to understand how duplicates are be
generated. Here are some examples:
an incoming data stream i.e. a data source or internal dataset (for instance the source
system itself, the output from another job or module), i.e. the problem may be inherited
and a more extensive search may be required in order to find the problem. If the
problem lies with the source system, then this may need to be raised as a data quality
issue and corrected at source
a 1-way/n-way issue. Scaling from 1-way to n-way processing will often cause
problems. Essentially this is because when running with a single node, all data flows
through a single partition (where processing rules apply to all the data), usually giving
correct results. Running with multiple nodes means that partitioning comes into play and
therefore issues arise from applying processing rules across multiple partitions. This
effect may be desirable, however in many cases this can also lead to incorrect results.
For instance of a job is generating a unique key column, the same key may me
generated across all partitions and therefore duplicated when the data is collected for
output. A sign that this is the case is if the final record count is a multiple of the number
of nodes compared to the single node record count. To avoid this kind of issue, a stage
can be forced to run sequentially (though this may become a bottleneck) or alternatively,
49017773.doc Page 26 of 41
particularly then defining keys, the partition number can be built into the algorithm for
generating the key, therefore ensuring uniqueness across partitions
A Cartesian join.
Since DataStage Enterprise (DataStage) starts one Unix process per node (nodes are defined
in the configuration file and can be thought of as a logical processor) per stage, the effective use
of available processors and to an extent the total memory usage is determined by the operating
system rather than DataStage, though generally the more resource (processors and memory)
the better.
Clearly, this can lead to an explosion of processes running and eventually the operating system
spend more time managing than executing code, having a detrimental effect on performance.
The key is to run a number of performance tests to determine the optimum number of nodes. A
starting point will usually be around 50% of actual CPUs.
Within DataStage, the optimum use of parallel (partitioned and piped) data streams is clearly
essential, as is the appropriate use of stages within jobs and the elimination of unnecessary
repartitioning and sorting.
As a general rule of thumb, incoming data streams should be partitioned and sorted as far up
stream as possible and maintained for as long as possible. Partitioning and sorting will take
considerable amounts of time during job execution, so where possible these activities should be
minimized. The sort order of the data within a partition in a data stream will be maintained
throughout a job, even when included as an input link to sort dependent stages such as Dedupe
and Join. It is always tempting to sort on the input links of these stages, however this is
completely unnecessary (providing the data is in the correct order already) and time consuming.
Similarly, it is also tempting to repartition on the input links of stages when specifying Same will
suffice (again, providing the data is correctly partitioned already).
Within DataStage, the Transform stage was inherited from the DataStage Server product and is
less efficient than other native Parallel stages. The jury is out as far as the use of Transform is
concerned, with arguments for and against. For users of DataStage Server it will be familiar
and easy to use, read and maintain. The native Modify stage is a good alternative but is not
consistent with the user interface implemented for other stages, though Transform also differs
slightly. Common sense is the key, too many Transforms will slow your jobs down and in this
situation, Modify for simple type conversions should be considered. Using several transforms in
sequence is also undesirable. Quite often they will ‘look’ good but could be combined, therefore
reducing the overhead.
Finally, the Lookup stage: This stage differs from Merge and Join in that it requires the whole of
the lookup dataset to be held in memory. The upper limit is large, though this needs to be
considered in the context of the total memory available and what else will be running at the time.
Total memory usage will be hard to estimate and will be best left until a point when the runtime
batch has been designed and run – be prepared to increase memory and split jobs if the usage
is too great.
49017773.doc Page 27 of 41
Likewise, to improve runtimes, be prepared to add further processors to facilitate
scaling.
Ensure that job does not look complex. If there are more stages (more than 10) in
a job divide into two or more jobs on functional basis.
Use containers where stages in the jobs can be grouped together.
Use Annotations for describing steps done at stages. Use Description
Annotation as job title; as Description Annotation also appears in Job
properties>Short Job Description and also in the Job Report when generated.
When using String functions on decimal always use Trim function to avoid as
String functions interpret an extra Space used for sign in decimal.
When you need to get a substring (e.g. first 2 characters from the left) of a
character field:
Use <Field Name>[1,2]
Similarly for a decimal field then:
Use Trim(<Field Name>)[1,2]
Always use Hash Partition in Join and Aggregator stages. The hash key should
be the same as the key used to join/aggregate.
If Join/Aggregator stages do not produce desirable results, try running in
sequential mode (verify results; if still incorrect problem is with data/logic) and
then run in parallel using Hash partition.
Use Column Generator stage to create sequence numbers or adding columns
having hard coded values.
In Job sequences; always use “Reset if required, then run” option in Job
Activity stages. (Note: This is not a default option)
When mapping a decimal field to a char field or vice versa , it is always better to
convert the value in the field using the ‘Type Conversion’ functions
“DecimalToString” or “StringToDecimal” as applicable while mapping.
“Clean-up on failure” property in sequential files must be enabled (enabled by
default)
49017773.doc Page 28 of 41
15. REPOSITORY STRUCTURE
The DataStage repository is the resource available to developers that helps organise the
components they are developing or using within their development. This consists of metadata
i.e. table definitions, the jobs themselves and specific routines and shared containers.
The anticipated repository structure is described in the following sections. However the
structure may change during development, usually evolving to a structure that is in it’s most
usable form.
Import Jobs: Import Jobs will be starting point for transformation. Sanity checks on file
and validation of external properties e.g. Size will be done here. Source file will be read
in memory datasets as per source record layout. Exception log will be created with
records that do not follow file layout. Source data will then be filtered to process records
and unprocessed data will be maintained in a dataset for future reference. Finally one or
more datasets will be created which will be input to actual transform process.
Transform Jobs: Datasets created by import jobs will be processed by actual transform
job. Transform will join two or more datasets, lookup data as per given functionality.
Finally the records will be split as per destination file and a destination dataset will be
created. All data errors will be captured in an exception log for future reference.
Unload Jobs: Unload jobs will take transform datasets as a source and create final files
required by load team in the given format.
Source/Target Flat-files: The source and target files will be included in this category.
These files will be converted into datasets by DataStage jobs and then after the
Transformation process is complete, they will be converted back to Target flat files.
Datasets: Datasets are used as intermediate storage for the various processes. A
Dataset can store data being operated on in a persistent form, which can then be used
by other DataStage jobs. Datasets can either be Sequential or Parallel. These Datasets
will be created from the external data by the ‘Import’ job and will be created whenever
intermediate datasets are needed to be created for further single/multiple jobs to
process.
15.3 Routines
Before and after routines (should they be needed) will be described here.
16.1 jbt_sc_join
jbt_sc_join is a common component built to meet a specific requirement in Dummy project to
capture 3 types of records from a Join stage, whereas Datastage just offers 2 outputs from a
Join stage.
For example, take file A (master) and file B (child).
49017773.doc Page 29 of 41
The Join stage of Datastage will give 2 outputs in this case:
A + B (Join records)
A not in B (Reject Records)
The common component jbt_sc_join will give 3 outputs in this case:
A + B (Join records)
A not in B (Reject records)
B not in A (Non Join records)
This functionality is illustrated in the flow diagram below:
A not in B
ej
_r
File ‘A’
_B
_A
(Master) lnk
lnk
_A
B_jn
lnk_A_ A+B
jn_A_B
lnk
_B _A
lnk _B
_n
jn
File ‘B’
(Child) B not in A
16.2 jbt_sc_srt_cd_lkp
Sort Code look up is a functionality which is required at many places (in various FD’s in
Dummy). So a common component with this functionality is built.
This will take a file as input and divide into 2 files for notth and south separately.
h
ort
A _n
ln k_
lnk_A sc_srt_c
File ‘A’
d_lkp
lnk
_A
_so
uth
49017773.doc Page 30 of 41
16.3 jbt_env_var
This is a template job with commonly used environmental variables imported. This can be used
for all the jobs being developed with these set of common environment variables rather then
importing them again and again.
These Environment variables are as shown below:
$ADTFILEDIR: This would contain the Audit file and reconciliation reports.
$BASEDIR: This folder is the base directory.
$DSEESCHEMADIR: DSEE Schemas that are used by EE jobs using RCP/schema files.
$ITERATION: Current Iteration number
$JOBLOGDIR: This would contain all the Error log files generated in DataStage jobs.
$PARMFILEDIR: This folder will contain parameter files that would be looked up by
jobs/routines that would be triggered from a common parameter file. These parameters values
will be set as per development environment.
$REJFILEDIR: This would contain all the reject files generated in DataStage jobs.
$SCRIPTDIR: This will contain routine UNIX scripts used for processing files, copying, taking
file backup etc.
$SRCDATASET: All the input files will be partitioned and imported into DataStage datasets.
This folder will store all the input datasets.
$SRCFILEDIR: This folder will contain all the input files from the Extract team. All files will be
manually copied into this folder.
$SRCFORMATDIR: This folder will contain the copybook formats for input source files. These
copybook formats are as per functional specifications.
$TMPDATASET: This folder will be used to store all the intermediate files created during
transform job.
$TRGDATASET: This folder will be used for storing output DataStage datasets files.
$TRGFILEDIR: These folders will contain all the transformed output files which can be loaded
to Bank B’s mainframe.
$TRGFORMATDIR: This folder will contain the copybook formats for output source files.
16.4 jbt_annotation
This is a template job where annotations are used for describing steps done at stages. Also
Description Annotation are used as job title; as Description Annotation also appears in Job
properties>Short Job Description and also in the Job Report when generated.
49017773.doc Page 31 of 41
16.5 Job Log Snapshot
JobLogSnapShot.ksh is a script which will create the log file (as seen in Datastage Director) of
job's latest run.
The following parameters need to be hard coded in the script as per environment:
DSHOME=/wload/dqad/app/Ascential/DataStage/DSEngine
PROJDIR=/wload/dqad/app/Ascential/DataStage/Projects/Dummy_dev
LOGDIR=/wload/dqad/app/data/Dummy_dev/itr01/errfile
The script will be called from the after job subroutine of a job.
ksh /wload/dqad/app/data/Dummy_dev/com/script/JobLogSnapShot.ksh $1
49017773.doc Page 32 of 41
.
.
.
.
49017773.doc Page 33 of 41
16.6 Reconciliation Report
Reconcilation.ksh is a script which will create the Reconciliation Report of the respective
functional area (FD).
The script will be called from an Execute Command stage of a Job Sequence.
ksh /wload/dqad/app/data/Dummy_dev/com/script/Reconcilation.ksh $1 $2
49017773.doc Page 34 of 41
49017773.doc Page 35 of 41
16.7 Script template
All scripts are made according to this template script. This has a script description and also a
section for maintaining modification history of the script.
This script name is /wload/dqad/app/data/Dummy_dev/com/script/ScriptTemplate.ksh
ksh /wload/dqad/app/data/Dummy_dev/com/script/SplitFile.ksh $1
This requires the file name to have .dat extension. The header, detail and trailer files created
would be $1_hdr.dat, $1_det.dat and $1_trl.dat respectively.
The input file will be /wload/dqad/app/data/Dummy_dev/itr01/opfile/$1.dat
All these files ($1_hdr.dat, $1_det.dat and $1_trl.dat) will be output in
/wload/dqad/app/data/Dummy_dev/itr01/opfile/.
ksh /wload/dqad/app/data/Dummy_dev/com/script/Make_File.ksh $1
49017773.doc Page 36 of 41
All these files ($1_hdr.dat, $1_dtl.dat and $1_trl.dat) will have to be present in
/wload/dqad/app/data/Dummy_dev/itr01/opfile/.
The output file will be /wload/dqad/app/data/Dummy_dev/itr01/opfile/$1.dat
16.10 jbt_import
This template job processes the Header, Detail and Trailer record created by the SplitFile.ksh
described in 16.8.
The header and trailer data is validated.
The validations done on header are:
The file header identifier must contain the value ‘HDR-TDAACCT’
The file header date must equal the T-14 migration date
The validations done on trailer are:
The file trailer file identifier must contain the value ‘TRL-TDAACCT’
The file trailer creation date must equal the file header creation date
The file trailer record count must equal the total number of record on the input file
including the header and trailer records.
The file trailer record amount must equal the sum of the Closing Balance field from every
record on the input file excluding the header and trailer records. The accumulation of the
Closing Balance field must be performed using an integer data format, allowing for
overflow.
If any of the above checks fail, then processing should be immediately aborted with a relevant
fatal error message. This is implemented using subroutine AbortOnCall.
Note: These header/trailer validations are for FD01. They will vary (slightly though) for other
FD’s. But this common approach as shown in the template can be taken.
The detail records are written to a dataset to be processed in transform job.
49017773.doc Page 37 of 41
49017773.doc Page 38 of 41
16.11 jst_import
This template job sequence calls the following components:
SplitFile.ksh as described in 16.8
jbt_import as described in 16.10
This sequence template will split the source file into 3 different files: Header, Detail and Trailer &
call the import job which will do the necessary validation and create a detail dataset.
16.12 jbt_unload
This template job illustrates creation of header and trailer records. The trailer consists of record
count and Hash count.
This template mainly is for following logic:
49017773.doc Page 39 of 41
49017773.doc Page 40 of 41
16.13 jst_unload
This template job sequence calls the following components:
jbt_unload as described in 16.12
MakeFile.ksh as described in 16.9
Reconciliation report as described in 16.6
This sequence template will create 3 different files: Header, Detail and Trailer & call the script
which will combine these 3 files to create the target file. Also Reconciliation report is created.
16.14 jbt_abort_threshold
Abort Threshold template will abort a job based on threshold value passed as a job parameter.
It uses common routine called “AbortOnThreshold”. This routine has to be called from a BASIC
Transformer:
AbortOnThreshold (@INROWNUM, <Threshold Value>, DSJ.ME)
Here <Threshold Value> is the job parameter. For example, if you give Threshold Value as 5,
job will abort after 4 records pass through the BASIC Transformer.
This is used in places where job needs to be aborted on a particular number of reject records.
49017773.doc Page 41 of 41