DataStage Handling COBOL Source File
DataStage Handling COBOL Source File
Version
1.3
Status
Revised Version
File Name
C3: Protected
Page 2 of 31
Control Sheet
Document Location
This document is located at:
Document Review
Name
Probal Banerjee
Title
Organization
Sr Associate
Cognizant
Document Approval
Name
Title
Aditi Sanyal
Organization
ETL Specialist
Cognizant
Signature
Date
22nd Sep, 2015
AS
Document Author
Name
Title
Sourav Guha
Organization
ETL Developer
Cognizant
Signature
Date
30th May, 2015
SG
Version History
Version
Amendment/Reason
Who
Date
th
1.0
Sourav Guha
30 May, 2015
1.1
Sourav Guha
1.2
Sourav Guha
1.3
Sourav Guha
Page 3 of 31
Table of Contents
1.0
Introduction
2.0
Context
Target audience
3.0
Assumptions
Business Scenario
10
4.0
10
11
12
16
17
17
22
24
26
Appendix
27
Page 4 of 31
1.0 Introduction
IBM Infosphere Datastage is a powerful data integration tool. The IBM
Infosphere Datastage is capable of integrating data on demand across
multiple and high volumes of data sources and target applications using
a high performance parallel framework. Infosphere Datastage also
facilitates extended metadata management and enterprise connectivity.
In simple words, Datastage is an ETL tool which basically performs
extraction, transformation and load operation in EDW framework. The
scalable platform provides more flexible integration of all types of data,
including big data at rest (Hadoop-based) or in motion (stream-based),
on distributed and mainframe platforms. Source file may appear in
different format viz. ASCII text fixed length, EBCIDIC text fixed length,
XML hierarchical data, comma separated (.csv) file etc. From the source
side extracting the data might be tricky at times.
What if the source is a mainframe generated COBOL source file which
needs to be processed using Datastage. So lets consider there is a
Mainframe generated Multi-record fixed length file which may contain
different sorts of information in separate records of equal length e.g.
Employee basic information in one record and payroll information in
another. That necessarily allows a room to think of the following:
2.0 Context
Target audience
The audience for this document is intended to be Datastage developers
looking for a simple development guidelines while they encounter
mainframe generated COBOL source file during job development. Apart
from the process flow, they can gather a few important concepts of such
source files from the document.
Assumptions
Following assumptions have been made before processing further with
the context
Basic concepts of ETL (Extract, Transform, and Load) are known to
the reader.
Basic stages of IBM Datastage (copy, sort, join, lookup, aggregate
etc.) and their operations are known to the reader.
COBOL metadata definition file (Info.cfd) which holds the metadata
of the source file, is stored in local system.
Business Scenario
Mainframe system
Source File gets generated
Reporting Tool
Reporting team handles
the report generation
Load
Data is stored in dataset
and subsequently in a
permanent storage
Extraction
Datstage reads data from
source file
Transformation
Necessary transformation
is applied in Parallel Job
the Highlighted portion in Grey depicts the ETL process flow. This is the
second layer of the project and the data processed in this layer will be available
for the Reporting layer at the end of the flow. The Reporting layer is the final
layer which will generate necessary reports for the End User.
1.1.1
NAME
13
LASTNAME
PIC
X(0005).
13
FIRSTNAME
PIC
X(0005).
Also it needs to be noticed that NAME does not have type of its own which also
signifies that it is a Group not a Field. Again a Group can contain both: another
sub-group or Field. Level numbers will identify which group or sub-group a field
or a sub-group belongs to.
REDEFINES Clause
The REDEFINES clause allows the same memory area /block to be described by
different data items. If one or more data items are not used simultaneously, i.e.
if we are sure that two or more field values will not be in use at the same time,
then we go for REDEFINES clause. It is basically done to save and reuse memory
blocks. NAME and OFFICIAL_NAME are said to be two fields that makes use of
the REDEFINES clause in the example given below.
11
11
NAME
13
LASTNAME
PIC
X(0005).
13
FIRSTNAME
PIC
X(0005).
OFFICIAL_NAME
REDEFINES
NAME
PIC
X(0010).
But there are some rules of using this clause. Make a note of them:
MONTHLY_SALES_1
PIC
9(0020).
11
MONTHLY_SALES_2
PIC
9(0020).
..
11
MONTHLY_SALES_12
.
PIC
9(0020).
But there's an easier way in COBOL. The field can be specified once and can be
declared that it repeats 12 times. This is done with the OCCURS clause, like this:
11
MONTHLY_SALES
OCCURS 12 TIMES
PIC
9(0020).
This specifies 12 fields, all of which have the same PIC, and is called a table
(also called an array). The individual fields are referenced in COBOL by
using subscripts, such as MONTHLY_SALES (1)".
OCCURS DEPENDING ON, is an OCCURS, like above, but the number of times it
occurs in a particular record can vary (between some limits). The number of
times it actually occurs in any particular record will be given by a value in
another field of that record. This creates records that vary in size from record to
record. E.g.
11
VALID_MONTHS
11
MONTHLY_SALES
DEPENDING ON
PIC
9(0005).
OCCURS 0 TO 12 TIMES
VALID_MONTHS
PIC
9(0020).
Apart from getting familiar with COBOL layout there are few more things reader
should at least have a brief idea about before pitching into the given business
scenario and how to solve it with Datastage.
1.1.2
An access method (Type) defines the technique that is used to store and
retrieve data. Access methods have their own data set structures to organize
data, system-provided programs (or macros) to define data sets, and utility
programs to process data sets.
There are times when an access method identified with one organization can be
used to process a data set organized in a different manner. For example, UNIX
files, which can be processed using BSAM, QSAM, basic partitioned access
method (BPAM), or virtual storage access method (VSAM).
Normally to read COBOL source file mainframe access type should be set to
QSAM_SEQ_COMPLEX in Datastage. There are quite a few other access types
which can be found. We pick the file access based on business requirements.
For more information refer Appendix.
1.1.3
COBOL generated data files can be of either Single record type or multiple
record types. For single record types, all the record will have same structure.
This is what we normally see when we use sequential file as a source. However
in real life this may not be the case. There can be multiple records types in a
single File. As in our case although records for Employee basic information and
payroll information is available in a single file, but they will have different
structures (naturally).
To describe multiple record types more than one record description in the files
identifier entry is used. As record descriptions begin with 01 level entry, for
each of the record type 01 level entry will be specified. As it has already been
1.1.4
The relationship between the master and detail records is inherent only in the
physical record order: payroll records correspond to the employee record they
follow. However, if this is the only means of relating detail records to their
masters, this relationship is lost when Warehouse Builder loads each record into
its target separately. To have an idea how to keep the relationship maintained,
please refer Appendix.
In Datastage Designer, click Import > Table Definitions and then select
Cobol File Definitions.
A list of all tables will be seen in tables section. If not, the copybook might
have some error. Refer FAQs.
In the To Folder box, the location to store the Table Definition has to be
specified. In our example we will store it in Table Definitions > COBOL FD
> Info.
Select all the tables and click Import. In case there is single record
metadata, only one table definitions will be imported.
In the Start position field of the metadata import page, weve to specify
the position of the level 01 in the copy book. For multiple record types,
each of the table definitions should have level 01 specified at the same
position. Otherwise the metadata import may remain incomplete or may
fail. For our example it is 15 for our case which can differ for other copy
books.
FAQ2: What if I dont have a level 01 specified in the
metadata available?
Open EMP File Definitions. In the General tab, we can find basic
On the Format tab, we can view options if the Table definition is used with
a sequential file in a server job. Further information on this is out of scope
for this document.
Yes. Even after the import is done, the Metadata can be altered. In the
Columns Tab, Double click on the left side of the column that needs to be
worked upon. As a result well have the properties window of the Column;
there we can alter the Definition accordingly. But normally we should
follow what is given in the copybook unless told otherwise.
As an example, let us change the length of the ADDRESS field in the EMP
file definition.
1) Double click on the left hand side of the ADDRESS field on the Columns
Tab.
2) Specify the length to be 100 (earlier it was 20).
Have a close look of the different types of records containing in the data
file. Records are delimited by | character. The first character of each
record defines its record type. E.g. 1 implies EMP type, 2 will mean SAL
type records.
Two or more consecutive SAL type records, followed by an EMP type would
result in relating all of the SAL records to the EMP record they followed.
Create a parallel job with a simple structure as shown below. The Complex
flat file stage would read the data file in hand. Transformer stage will be
useful if any kind of basic logical transform to your logical data is required.
And finally the processed records will be moved to separate datasets .
1.2.1
Stage Section
Record type should be mentioned as Fixed from the drop down as we are
dealing with fixed length data file.
Normally this source stage reads records sequentially, but we have the
option to operate it to read records in parallel. Either check the box for
Read from multiple nodes or specify any value larger than 1 in the
Number of readers per node section.
We also have the luxury of fetching records applying filter or fetching first
N number of records in this tab. Both of these fields can be parameterized.
A navigation control in the left side below of the tab called Fast path can
be found. This will help navigate to few of the most essential Tabs;
alternatively we can choose which tab we want to work on by directly
In Record Type segment of the page, you can choose different options from
a drop down according to your case.
For EBCIDIC formatting please select accordingly from the drop downs of
each options.
Optionally you can specify a few decimal properties in this tab. But as we
can see we are only dealing with character type values in our file, so well
ignore this.
Records Tab
- Now well import metadata which we have already saved to read the data
file.
-
First uncheck Single Record box (Right hand side bottom). To explain, our
file is of multiple record types as we have seen our copybook contains two
record types namely EMP & SAL.
Right Click on the left pane of the tab. Select Add New Record.
Change the name of the record type to EMP from default name
NEWRECORD.
Click Load. Select the EMP table definitions saved earlier, and all the
Columns from the table.
Specify Field Width in the Extended attributes for the entire CHARACTER
field. Unless you do this, View Data may fail thus resulting abortion of the
parallel job. Specify a field width that equals the length of the field.
Add another record by right clicking on the left pane. Alternatively, one
can find different icons at the below part of the pane to add new records.
Hover the mouse on each of the icon, and a prompt will appear describing
its function. Below image shows Insert new record after current record
option.
Rename the new record type to be SAL. Repeat the process to define and
load its table definitions.
Select the EMP record type. Now click the right most icon (Toggle Master
Specify record identifiers for both EMP and SAL type records.
For EMP identifier field DJY6001=1 and for SAL record type it should be
2.
Layout Tab
- The next tab is used to view the column layout for each of the record
types.
-
Select COBOL Layout. From the left hand side pane click on the record
type for which you want to view the column layout.
1.2.2
Output Section:
Selection Tab
- Move to the fast path 3 of 4 which will navigate us to Selection Tab of
Output section.
-
Click on >> box. This will nominate all the columns to be propagated
to next stage via output feed.
If at all there is any column which needs to be dropped, select the very
column in the right hand side and click < box. In our case, well ignore
this step.
Constraint Tab
- Move to the next tab.
-
Define output constraints for each record type by clicking on the Default
button. This will ensure that only record on these two types will flow into
the output link. Note that for single record type nothing needs to be
specified here.
Columns tab
- In columns tab the columns flowing into output link can be seen.
-
Click on View Data to see the source data. You may notice all the columns
from all record types are populated with data in them. But, it should be
kept in mind that data for a particular column of a given record type is
only valid if the record identifier field DJY6001 holds the particular record
identifier character(1 for EMP, 2 for SAL).
The EMP_records output link should get basic EMP information only.
The SAL_records output link should get employee payroll information. Map
the input columns accordingly.
4.0 Appendix:
-
http://www01.ibm.com/support/knowledgecenter/zosbasics/com.ibm.zos.zconcepts/zconcepts_150.htm
http://coboltuts.blogspot.com/p/multiple-record-types-in-earlier.html
http://docs.oracle.com/cd/B28359_01/owb.111/b31278/concept_etl_perfo
rmance.htm#i1143526
http://www01.ibm.com/support/knowledgecenter/SSZJPZ_8.5.0/com.ibm.swg.im.iis.d
s.design.help.doc/topics/CFF_stage_page_records_tab.html