Study Guide for DataStage Certification
Study Guide for DataStage Certification
This entry is a comprehensive guide to preparing for the DataStage 7.5 Certification exam.
This page can be reached via the shortened URL tinyurl.com/dsexam. All the
details on the DataStage 8 Certification exam can be found at
tinyurl.com/ds8exam.
Regular readers may be feeling a sense of de ja vu. Haven’t we seen this post before? I
originally posted this in 2006 and this, the Directors Cut - I’ve added some deleted scenes, a
commentary for DataStage 7.0 and 8.0 users and generally improved the entry. By
reposting I retain the links from other sites such as my DataStage Certification Squidoo lens
with links to my certification blog entries and IBM certification pages.
This post shows all of the headings from the IBM exam Objectives and describes how to
prepare for that section.
Before you start read work out how you add Environment variables to a job as a job
parameter as they are handy for exercises and testing. See the DataStage Designer Guide
for details.
Versions: Version 7.5.1 and 7.5.2 are the best to study and run exercises on. Version 6.x is
risky but mostly the same as 7. Version 8.0 is no good for any type of installation and
configuration preparation as it has a new approach to installation and user security.
Reading: Read the Installation and Upgrade Guide for DataStage, especially the section on
parallel installation. Read the pre-requisites for each type of install such as users and
groups, the compiler, project locations, kernel settings for each platform. Make sure you
know what goes into the dsenv file. Read the section on DataStage for USS as you might get
one USS question. Do a search for threads on dsenv on the dsxchange forum to become
familiar with how this file is used in different production environments.
Exercise: installing your own DataStage Server Enterprise Edition is the best exercise -
getting it to connect to Oracle, DB2 and SQL Server is also beneficial. Run the DataStage
Administrator and create some users and roles and give them access to DataStage
functions.
I’ve move section 4 and 9 up to the front as you need to study this before you run exercises
and read about parallel stages in the other sections. Understanding how to use and monitor
parallel jobs is worth a whopping 20% so it’s a good one to know well.
Versions: you can study this using DataStage 6, 7 and 8. Version 8 has the best definition of
the parallel architecture with better diagrams.
Reading: Parallel Job Developers Guide opening chapters on what the parallel engine and job
partitioning is all about. Read about each partitioning type. Read how sequential file stages
partition or repartition data and why datasets don’t. The Parallel Job Advanced Developers
Guide has sections on environment variables to help with job monitoring, read about every
parameter with the word SCORE in it. The DataStage Director Guide describes how to run job
monitoring - use the right mouse click menu on the job monitor window to see extra parallel
information.
Try creating one node, two node and four node config files and see how jobs behave under
each one. Try the remaining exercises on a couple different configurations by adding the
configuration environment variable to the job. Try some pooling options. I got to admit I
guessed my way through some of the pooling questions as I didn’t do many exercises.
Generate a set of rows into a sequential file for testing out various partitioning types. One
column with unique ids 1 to 100 and a second column with repeating codes such as A, A, A,
A, A, B, B, B, B, B etc. Write a job that reads from the input, sends it through a partitioning
stage such as a transformer and writes it to a peek stage. The Director logs shows which
rows went where. You should also view the Director monitor and expand and show the row
counts on each instance of each stage in the job to see how stages are split and run on each
node and how many rows each instance gets.
Use a filter stage to split the rows down two paths and bring them back together with a
funnel stage, then replace the funnel with a collector stage. Compare the two.
Test yourself on estimating how many processes will be created by a job and check the
result after the job has run using the Director monitor or log messages. Do this throughout
all your exercises across all sections as a habit.
Section 2 - Metadata
I’ve merged these into one. Both sections talk about sequential files, datasets, XML files and
Cobol files.
Exercises: Step through the IBM XML tutorial to get the tricky part on reading XML files. Find
an XML file and do various exercises reading it and writing it to a sequential file. Switch
between different key fields to see the impact of the key on the flattening of the XML
hierarchy. Don’t worry too much about XML Transform.
Import from a database using the Orchestrate Import and the Plugin Import and compare
the table definitions. Run an exercise on column propagation using the Peek stage where a
partial schema is written to the Peek stage to reveal the propagated columns.
Create a job using the Row Generator stage. Define some columns, on the columns tab
doubleclick on a column to bring up the advanced column properties. Use some properties
to generate values for different data types. Get to know the advanced properties page.
Create a really large sequential file, dataset and fileset and use each as a reference in a
lookup stage. Monitor the Resource and Scratch directories as the job runs to see how these
lookup sources are prepared prior to and during a job run. Get to know the difference
between the lookup fileset and other sources for a lookup stage.
One of the more difficult topics if you get questions for a database you are not familiar with.
I got one database parallel connectivity question that I still can’t find the answer to in any of
the manuals.
Versions: DataStage 7.x any version or at a pinch DataStage 8. Earlier versions do not have
enough database stages and DataStage 8 has a new approach to database connections.
Reading: read the plugin guide for each enterprise database stage: Oracle, SQL Server, DB2
and ODBC. In version 8 read the improved Connectivity Guides for these targets. If you have
time you can dig deeper, the Parallel Job Developers Guide and/or the Advanced Developers
Guide has a section on the Oracle/DB2/Informix/Teradata/Sybase/SQL Server interface
libraries. Look for the section called "Operator action" and read it for each stage. It’s got
interesting bits like whether the stage can run in parallel, how it converts data and handles
record sizes.
Exercise: Add each Enterprise database stage to a parallel job as both an input and output
stage. Go in and fiddle around with all the different types of read and write options. You
don’t need to get a connection working or have access to that database, you just need to
have the stage installed and add it to your job. Look at the differences between
insert/update/add/load etc. Look at the different options for each database stage. If you
have time and a database try some loads to a database table.
Section 6 - Data Transformation (15%)
If you’ve used DataStage for longer than a year this is probably the topic you are going to
ace - as long as you have done some type of stage variable use.
Reading: there is more value in using the transformation stages than reading about it. It’s
hard to read about it and take it in as the Transformer stage is easier to navigate and
understand if you are using it. If you have to make do with reading then visit the dsxchange
and look for threads on stage variables, the FAQ on the parallel number generator, removing
duplicates using a transformer and questions in the parallel forum on null handling. This will
be better than reading the manuals as they will be full of practical examples. Read the
Parallel Job Advanced Developers Guide section on "Specifying your own parallel stages".
Exercises: Focus on Transformer, Modify Stage (briefly), Copy Stage (briefly) and Filter
Stage.
Create some mixed up source data with duplicate rows and a multiple field key. Try to
remove duplicates using a Transformer with a sort stage and combination of stage variables
to hold the prior row key value to compare to the new row key value.
Process some data that has nulls in it. Use the null column in a Transformer concatenate
function with and without a nulltovalue function and with and without a reject link from the
Transformer. This gives you an understanding of how rows get dropped and/or trapped from
a transformer. Explore the right mouse click menu in the Transformer, output some of the
DS Macro values and System Variables to a peek stage and think of uses for them in various
data warehouse scenarios. Ignore DS Routine, it’s not on the test.
Don’t spend much time on the Modify stage - it would take forever memorizing functions.
Just do an exercise on handle_null, string_trim and convert string to number. Can be tricky
getting it working and you might not even get a question about it.
I’ve combined these since they overlap. Don’t underestimate this section, it covers a very
narrow range of functionality so it is an easy set of questions to prepare for and get right.
There are easy points on offer in this section.
Versions: Any version 7.x is best, version 6 has a completely different lookup stage, version
8 can be used but remember that the Range lookup functionality is new.
Reading: The Parallel Job Developers Guide has a table showing the differences between the
lookup, merge and join stages. Try to memorize the parts of this table about inputs and
outputs and reject links. This is a good place to learn about some more environment
variables. Read the Parallel Job Advanced Developers Guide looking for any environment
variables with the word SORT or SCORE in them.
Exercises: Compare the way join, lookup and merge work. Create a job that switches
between each type.
Create an annotation and a description annotation and explore the differences between the
two. Use the Multiple Compile tool in the DataStage Manager (version 6, 7) or Designer
(version 8). Create and use shared containers.
Versions: most deployment methods have remained the same from version 6, 7 and 8.
Version 8 has the same import and export functions. DataStage 8 parameter sets will not be
in the version 7 exam.
Reading: you don’t need to install or use the Version Control tool to pass this section
however you should read the PDF that comes with it to understand the IBM
recommendations for deployment. It covers the move from dev to test to prod. Read the
Server Job Developers Guide section on command line calls to DataStage such as dsjob and
dssearch and dsadmin.
Exercises: practice a few dsjob and dssearch commands. Practice saving a log to a text file.
Create some job specific environment variables.