0% found this document useful (0 votes)

46 views64 pages

Ibm Websphere Qualitystage: 8 Release 1

Uploaded by

Krishna Keerti Kesiraju

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views64 pages

Ibm Websphere Qualitystage: 8 Release 1

Uploaded by

Krishna Keerti Kesiraju

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 64

IBM WebSphere QualityStage

Version 8 Release 1

Tutorial

SC18-9925-01
IBM WebSphere QualityStage

Version 8 Release 1

Tutorial

SC18-9925-01
Note
Before using this information and the product that it supports, read the information in “Notices” on page 51.

© Copyright International Business Machines Corporation 2004, 2008.

US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract
with IBM Corp.
Contents
Chapter 1. About WebSphere Configuring the Transformer stage. . . . . . 24
QualityStage . . . . . . . . . . . . . 1 Configuring the Copy stage . . . . . . . . 25
About DataStage Projects . . . . . . . . . . 1 Configuring the Match Frequency stage . . . . 25
About QualityStage jobs . . . . . . . . . . 1 Lesson checkpoint . . . . . . . . . . . 26
WebSphere DataStage and QualityStage stages . . . 2 Lesson 2.3: Configuring the target data sets . . . . 26
Server and client components . . . . . . . . . 2 Lesson checkpoint . . . . . . . . . . . 27
Module 2: Summary . . . . . . . . . . . 27
Chapter 2. Tutorial project goals . . . . 3
Chapter 6. Module 3: Grouping records
Chapter 3. Setting up the tutorial . . . . 5 with common attributes . . . . . . . 29
Lesson 3.1: Setting up an Unduplicate Match job . . 29
Creating a folder for the tutorial files . . . . . . 5
Lesson checkpoint for the Unduplicate Match job 31
Copying tutorial data . . . . . . . . . . . 5
Lesson 3.2: Configuring the Unduplicate Match job
Creating the tutorial project . . . . . . . . . 5
stage properties . . . . . . . . . . . . . 31
Starting a project . . . . . . . . . . . . . 6
Configuring the Unduplicate Match stage . . . 32
Creating a job . . . . . . . . . . . . . 6
Configuring the Funnel stage . . . . . . . 33
Importing tutorial components . . . . . . . 6
Lesson 3.2 checkpoint . . . . . . . . . . 34
Lesson 3.3: Configuring Unduplicate job target files 34
Chapter 4. Module 1: Investigating Lesson checkpoint . . . . . . . . . . . 35
source data . . . . . . . . . . . . . 9 Module 3: Summary . . . . . . . . . . . 35
Lesson 1.1: Setting up and linking an Investigate job 9
Lesson checkpoint . . . . . . . . . . . 10 Chapter 7. Module 4: Creating a single
Lesson 1.2: Renaming links and stages in an
record . . . . . . . . . . . . . . . 37
Investigate job . . . . . . . . . . . . . 11
Lesson 4.1: Setting up a Survive job . . . . . . 37
Lesson checkpoint . . . . . . . . . . . 12
Lesson checkpoint . . . . . . . . . . . 38
Lesson 1.3: Configuring the source file . . . . . 12
Lesson 4.2: Configuring Survive job stage properties 38
Lesson checkpoint . . . . . . . . . . . 13
Configuring the Survive stage . . . . . . . 39
Lesson 1.4: Configuring the Copy stage . . . . . 13
Configuring the target file . . . . . . . . 40
Lesson checkpoint . . . . . . . . . . . 13
Lesson checkpoint . . . . . . . . . . . 40
Lesson 1.5: Configuring the Investigate stage . . . 14
Module 4: Summary . . . . . . . . . . . 41
Lesson summary . . . . . . . . . . . 15
Lesson 1.6: Configuring the InvestigateStage2 icon 15
Lesson summary . . . . . . . . . . . 16 Chapter 8. WebSphere QualityStage
Lesson 1.7: Configuring target reports . . . . . 17 Tutorial: summary . . . . . . . . . . 43
Lesson checkpoint . . . . . . . . . . . 17
Lesson 1.8: Compiling and running jobs . . . . . 17 Product documentation . . . . . . . 45
Lesson checkpoint . . . . . . . . . . . 18 Contacting IBM . . . . . . . . . . . . . 45
Module 1: Summary . . . . . . . . . . . 18
How to read syntax diagrams . . . . . 47
Chapter 5. Module 2: Standardizing
data . . . . . . . . . . . . . . . . 19 Product accessibility . . . . . . . . 49
Lesson 2.1: Setting up a Standardize job . . . . . 19
Lesson checkpoint . . . . . . . . . . . 21
Lesson 2.2: Configuring the Standardize job stage
Notices . . . . . . . . . . . . . . 51
properties . . . . . . . . . . . . . . . 21 Trademarks . . . . . . . . . . . . . . 53
Configuring the Customer file properties . . . 21
Configuring the Standardize stage . . . . . . 22 Index . . . . . . . . . . . . . . . 55

© Copyright IBM Corp. 2004, 2008 iii

iv IBM WebSphere QualityStage Tutorial
Chapter 1. About WebSphere QualityStage
IBM® WebSphere® QualityStage is a data cleansing component that is part of the
WebSphere DataStage™ and QualityStage Designer (Designer client).

The Designer client provides a common user interface in which you design your
data quality jobs. In addition, you have the power of the parallel processing engine
to process large stores of source data.

The integrated stages available in the Repository provide the basis for
accomplishing the following data cleansing tasks:
v Resolving data conflicts and ambiguities
v Uncovering new or hidden attributes from free-form or loosely controlled source
columns
v Conforming data by transforming data types into a standard format
v Creating one unique result

Learning objectives
The key points that you should keep in mind as you complete this tutorial include
the following concepts:
v How the processes of standardization and matching improve the quality of the
data
v The ease of combining both QualityStage and DataStage stages in the same job
v How the data flows in an iterative process from one job to another
v The surviving data results in the best available record

About DataStage Projects

The IBM WebSphere DataStage and QualityStage Designer client provides projects
as a method for organizing your re-engineered data. You define data files and
stages and you build jobs in a specific project. WebSphere QualityStage uses these
projects to create and store files on the client and server.

Each QualityStage project contains the following components:

v QualityStage jobs
v Stages that are used to build each job
v Match specification
v Standardization rules
v Table definitions

In this tutorial, you will create a project by using the data that is provided.

About QualityStage jobs

IBM WebSphere QualityStage uses jobs to process data.

To start a QualityStage job, you open the Designer client and create a new Parallel
job. You build the QualityStage job by adding stages, source and target files, and
links from the Repository, and placing them onto the Designer canvas. The

© Copyright IBM Corp. 2004, 2008 1

Designer client compiles the Parallel job and creates an executable file. When the
job runs, the stages process the data by using the data properties that you defined.
The result is a data set that you can use as input for the next job.

In this tutorial, you build four QualityStage jobs. Each job is built around one of
the Data Quality stages and additional DataStage stages.

WebSphere DataStage and QualityStage stages

A stage in IBM WebSphere DataStage and QualityStage performs an action on
data. The type of action depends on the stage that you use.

The Designer client stages are stored in the Designer tool palette. You can access all
the QualityStage stages in the Data Quality group in the palette. You configure
each stage to perform the type of actions on the data that obtain the required
results. Those results are used as input data to the next stage. The following stages
are included in QualityStage:
v Investigate stage
v Standardize stage
v Match Frequency stage
v Unduplicate Match stage
v Reference Match stage
v Survive stage

In this tutorial, you use most of the QualityStage stages.

You can also add any of the DataStage stages to your job. In some of the lessons,
you add DataStage stages to enhance the types of tools for processing the data.

Server and client components

You load the client and server components to process the data.

The following server components are installed on the server:

Repository
A central store that contains all the information required to build a
QualityStage job.
IBM WebSphere DataStage server
Runs the QualityStage jobs.

The following DataStage client components are installed on any personal

computer:
v WebSphere DataStage and QualityStage Designer
v DataStage Director
v DataStage Administrator

In this tutorial, you use all of these components when you build and run your
QualityStage project.

2 IBM WebSphere QualityStage Tutorial

Chapter 2. Tutorial project goals
The goal of this tutorial is to use Designer client stages to cleanse customer data by
removing all the duplicates of customer addresses and providing a best case for
the correct address.

In this tutorial, you have the role of a database analyst for a bank that provides
many financial services. The bank has a large database of customers; however,
there are problems with the customer list because it contains multiple names and
address records for a single household. Because the marketing department wants
to market additional services to existing customers, you need to find and remove
duplicate addresses.

For example, a married couple has four accounts, each in their own names. The
accounts include two checking accounts, an IRA, and a mutual fund.

In the bank’s existing system, customer information is tracked by account number

rather than customer name, number, or address. For this one customer, the bank
has four address entries.

To save money on the mailing, the bank wants to consolidate the household
information so that each household receives only one mailing. In this tutorial, you
are going to use IBM WebSphere QualityStage to standardize all customer
addresses. In addition, you need to locate and consolidate all records of customers
who are living at the same address.

Learning objectives

The purpose of this tutorial is to provide a working knowledge of the QualityStage

process flow through the jobs. In addition, you learn how to do the following
tasks:
v Set up each job in the project
v Configure each stage in the job
v Assess results of each job
v Apply those results to your business practices

After you complete these tasks, you should understand how QualityStage stages
restructure and cleanse the data by using applied business rules.

This tutorial will take approximately 2.5 hours to complete.

Skill level

To use this tutorial, you need an intermediate to advanced level of understanding

of data analysis.

Audience

This tutorial is intended for business analysts and systems analysts who are
interested in understanding QualityStage.

© Copyright IBM Corp. 2004, 2008 3

System requirements
v IBM Information Server Suite
v Microsoft® Windows® XP or Linux® operating systems

Prerequisites

To complete this tutorial, you need to know how to use

v IBM WebSphere DataStage and QualityStage Designer
v Personal computers

Expected results

Upon the completion of this tutorial, you should be able to use the Designer client
to create your own QualityStage projects to meet the business requirements and
data quality standards of your company.

4 IBM WebSphere QualityStage Tutorial

Chapter 3. Setting up the tutorial
The setup process for the tutorial includes creating a folder, copying the job and
input data file, creating and starting a project. You must complete the setup tasks
before you begin Module 1 of the tutorial.

Creating a folder for the tutorial files

Copy the folder with the tutorial files from the installation DVD to your IBM
WebSphere QualityStage client computer.
1. Insert the DVD into the CD or DVD drive of the client computer.
2. Locate the TutorialData folder on the DVD.
3. Copy the TutorialData from the DVD to the C: drive of the client computer (for
example, C:\TutorialData).
4. Open the TutorialData\QualityStage folder to locate the tutorial data files.

Copying tutorial data

Copy the tutorial data files from the tutorial folder you created on the client
computer to the project folder or directory on the IBM WebSphere QualityStage
computer where the engine tier is installed.

The IBM WebSphere QualityStage server may be installed on the same Windows
computer as the clients, or it may be on a separate Windows, UNIX®, or Linux
computer. Sometimes the engine tier is referred to as the DataStage and
QualityStage server. When you created the project for the tutorial, you
automatically created a folder or directory for that project on the computer where
the engine tier is installed.
1. Open the tutorial folder TutorialData\QualityStage that you created on the client
computer and locate the input.csv file.
2. Open the project folder on the computer where the engine tier is installed for
the tutorial project you created. Where tutorial_project is the name of the
project you created, examples of path names are:
v For a Windows server: C:\IBM\InformationServer\Server\Projects\
tutorial_project
v For a UNIX or Linux server: opt/IBM/InformationServer/Server/Projects/
tutorial_project
3. On the client computer, right-click the input.csv file in the tutorial folder and
select Copy from the shortcut menu.
4. Move to the project folder on the server computer, right-click and select Paste
from the shortcut menu.

Creating the tutorial project

Create a new project for the tutorial to keep your tutorial exercises separate from
the other work on IBM WebSphere QualityStage.

You must have QualityStage Administrator privileges.

1. Select Start → All Programs → IBM Information Server → IBM WebSphere
DataStage and QualityStage Administrator.

© Copyright IBM Corp. 2004, 2008 5

2. In the Attach window, type your user name and password.
3. In the Projects tab, click Add to open the Add New Project window.
4. In the Name field, specify the name of the new project (for example, Tutorial).
5. Click OK to create the new project.
6. Click Close to close the Administrator client.

Starting a project
Use Designer client project as a container for your QualityStage jobs.

Open the DataStage Designer client to begin the tutorial. The DataStage Designer
Parallel job provides the executable file that runs your QualityStage jobs.
1. Click Start → All Programs → IBM Information Server → IBM WebSphere
DataStage and QualityStage Designer. The Attach to Project window opens.
2. In the Domain field, type the name of the server that you are connected to.
3. In the User name field, type your user name.
4. In the Password field, type your password.
5. In the Project field, select the project you created (for example, Tutorial).
6. Click OK. The New Parallel job opens in the Designer client.

Creating a job
The Designer client provides the interface to the parallel engine that processes the
QualityStage jobs. You are going to save a job to a folder in the DataStage
repository.

If it is not already open, open the DataStage Designer client.

To create a new DataStage job:

1. From the New window, select the Jobs folder in the left pane and then select
the Parallel Job icon in the right pane.
2. Click OK. A new empty job design window opens in the job design area.
3. Click File → Save.
4. In the Save Parallel Job As window, right-click the Jobs folder and select New →
Folder from the shortcut menu.
5. Type in a name for the folder (for example, MyTutorial).
6. Click the new folder (MyTutorial) and in the Item name field, type
Investigate1.
7. Click Save to save the job.

You have created a new parallel job named Investigate and saved it in the folder
Jobs\MyTutorial in the repository. Using these procedures, create 3 more jobs in
this folder and name them Standardize1, Unduplicate1, and Survive1.

Import the tutorial data into your project now.

Importing tutorial components

Use the Designer client to import the sample metadata, which includes a job and
table definition, into the tutorial project.

Import the tutorial sample metadata to begin the tutorial lessons:

6 IBM WebSphere QualityStage Tutorial

1. Select Start → All Programs → IBM Information Server → IBM WebSphere
DataStage and QualityStage Designer.
2. In the Attach window, type your user name and password.
3. Select the Tutorial project from the Project list and click OK. The Designer
client opens and displays the New window.
4. Click Cancel to close the New window because you are opening an existing
job, not creating a new job or object.
5. Select Import → DataStageComponents.
6. In the Import from file field, go to C:\TutorialData\QualityStage on the client
computer and select the QualityStage_Tutorial.dsx file.
7. Click Open to open the file.
8. Click Select All.
9. Click on the items that already exist (Status = Already Exists) to deselect
them.
10. Click OK to import the sample job and sample table definition into a
repository folder named QualityStage Tutorial.

The sample tutorial is displayed in the repository under Jobs/QualityStage Tutorial

folder. You can open each job and look at how they are designed on the canvas.
Use these jobs as a reference when you create your own jobs.

You can begin Module 1.

Chapter 3. Setting up the tutorial 7

8 IBM WebSphere QualityStage Tutorial
Chapter 4. Module 1: Investigating source data
This module explains how to set up and process an Investigate job to provide data
from which you can create reports in the IBM Information Server Web console.

You can use the information in the reports to make basic assumptions about the
data and the steps you must take to attain the goal of providing a legitimate
address for each customer in the database.

Learning objectives

After completing the lessons in this module, you should know how to do the
following tasks:
1. Add QualityStage or DataStage stages and links to a job
2. Configure stage properties to specify which action they take when the job is
run
3. Load and process customer data and metadata
4. Compile and run a job
5. Produce data for reports

This module should take approximately 30 minutes to complete.

Lesson 1.1: Setting up and linking an Investigate job

Create each QualityStage job by adding Data Quality stages and DataStage
sequential files and stages to the Designer canvas. Each icon on the canvas is
linked together to allow the data to flow from the source file to each stage.

If you have not already done so, open the Designer client.
1. From the left pane of the Designer, go to the MyTutorial folder you created for
this tutorial and double click on Investigate1 to open the job.
2. Click Palette → Data Quality to select the Investigate stage.
If you do not see the palette, click View → Palette.
3. Drag the Investigate stage onto the Designer canvas and drop it in the middle
of the canvas.
4. Drag a second Investigate stage and drop it beneath the first Investigate
stage. You must use two investigate stages to create the data for the reports.
5. Click Palette → File and select Sequential File.
6. Drag the Sequential File onto the Designer canvas and drop it to the left of
the first Investigate stage. This sequential file is the source file.
7. Click Palette → Processing and select the Copy stage. This stage duplicates the
data from the source file and copies it to the two Investigate stages.
8. Drag the Copy stage onto the Designer canvas and drop it between the
Sequential File and the first Investigate stage.
9. Click Palette → File, and drag a second Sequential File onto the Designer
canvas and drop it to the right of the first Investigate stage.
The data from the Investigate stage is sent to the second Sequential File
which is the target file.

© Copyright IBM Corp. 2004, 2008 9

10. Drag a third Sequential File onto the Designer canvas and drop it to the right
of the Investigate stage and beneath the second Sequential File. You now
have a source file, a Copy stage, two Investigate stages, and two target files.
11. Drag a fourth Sequential File onto the Designer canvas and drop it beneath
the third Sequential File as the final target file. Now, link all the stages
together.
12. Click Palette → General → Link.
a. Right-click and drag a link from the source file to the Copy stage.
If the link is red, click to activate the link and drag it until it meets the
stage. It should turn black.
When all the icons on the canvas are linked, you can click on a stage and
drag it to change its position.
b. Continue linking the other stages. The following figure shows the
completed Investigate job with the names you will assign to the stages and
links in the next lesson.

Lesson checkpoint
When you set up the Investigate job, you are connecting the source file and its
source data and metadata to all the stages and linking the stages to the target files.

In completing this lesson, you learned the following about the Designer:
v How to add stages to the Designer canvas
v How to combine Data Quality and Processing stages on the Designer canvas
v How to link all the stages together

10 IBM WebSphere QualityStage Tutorial

Lesson 1.2: Renaming links and stages in an Investigate job
When creating a large job in the Designer client, it is important to rename each
stage, file, and link with meaningful names to avoid confusion when selecting
paths during stage configuration.

When you rename the links and stages, do not use spaces. The Designer client
resets the name back to the generic value if you enter spaces. The goal of this
lesson is to replace the generic names for the icons on the canvas with more
appropriate names.

To rename icons on the canvas:

1. To rename a stage, complete the following steps:
a. Click the name of the source SequentialFile until a highlighted box appears
around the name.
b. Type SourceFile in the box.
c. Click outside the box to deselect the box.
2. To rename a link, complete the following steps:
a. Right-click the generic link name DSLinkXX that connects SourceFile to the
Copy stage and select Rename from the shortcut menu. A highlighted box
appears around the default name.
b. Type Customerdata and click outside the box. The default link name changes
to Customerdata.
3. Right-click the generic link name that connects the Copy stage to the
Investigate stage.
4. Repeat step 2, except type Name in the box.
5. Right-click the generic link name that connects the Copy stage to the second
Investigate stage.
6. Repeat step 2, except type AddrCityState in the box.
7. Click on the names of the following stages and type the new stage name in the
highlighted box:

Stage Change to
Copy CopyStage
Investigate InvestigateStage
Investigate InvestigateStage2

8. Rename the three target files from the top in the following order:
a. NameTokenReport
b. AreaTokenReport
c. AreaPatternReport
9. Right click on the names of the following links, select Rename and type the
new link name in the highlighted box:

Link Change to
From InvestigateStage to NameTokenReport TokenData
From InvestigateStage2 to AreaTokenReport AreaTokenData
From InvestigateStage2 to AreaPatternReport AreaPatternData

Chapter 4. Module 1: Investigating source data 11

Renaming the elements on the Designer canvas provides better organization to the
Investigate job.

Lesson checkpoint
In this lesson, you changed the generic stages and links to names appropriate for
the job.

You learned the following tasks:

v How to select the default name field in order to edit it
v The correct method to use in changing the name

Lesson 1.3: Configuring the source file

The source data and metadata are attached to the SourceFile as the source data for
the job.

The goal of this lesson is to attach the input data of customer names and addresses
and load the metadata.

To add data and metadata to the Investigate job, configure the source file to locate
the input data file input.csv stored on your computer and load the metadata
columns.

To configure the source file:

1. Double-click the SourceFile icon to open the Properties tab on the SourceFile -
Sequential File window.
2. Select the tutorial data file:
a. Click Source → File to activate the File field.

b. Click in the File field and select Browse for File.

c. Locate the directory on the server where you copied the input.csv file from
the DVD (for example, C:\IBM\InformationServer\Server\Projects\
tutorial).
d. Click input.csv to select the file, then click OK.
3. Click the Columns tab.
4. Click Load.
5. From the Table Definitions window, click the QualityStage Tutorial → Table
Definitions folder. This folder was created when you imported the tutorial
sample metadata.
6. Click Input, then click OK to load the sample metadata.
7. Click OK to close the SourceFile - Sequential File window.
8. Click View Data to display the quality of the input data.
9. At the first Investigate1 window, select the number of rows to display and
click OK.
10. At the second Investigate1, you see bank customer names and addresses. The
addresses are shown in a disorganized way making it difficult for the bank to
analyze the data.
11. Click Close to close the Investigate1 window.
12. Click OK to download the input data to your system.

12 IBM WebSphere QualityStage Tutorial

Lesson checkpoint
In this lesson, you attached the input data (customer names and addresses) and
loaded the metadata.

You learned how to do the following tasks:

v Attaching source data to the source file
v Adding column metadata to the source file

Lesson 1.4: Configuring the Copy stage

The Copy stage duplicates the source data and sends it to the two Investigate
stages.

This lesson explains how to configure a DataStage Processing stage, the Copy
stage, to a QualityStage job to duplicate the metadata and send the output
metadata to the two Investigate stages.

To configure a Copy stage:

1. Double-click the CopyStage icon to open the Properties tab on the CopyStage -
Copy window.
2. Click the Input → Columns tab. The metadata you loaded in the SourceFile has
propagated to the CopyStage.
3. Click the Output → Mapping tab.
4. Map the columns that display in the left Columns pane to the right Name
pane.
5. In the Output name field above the Columns pane of the screen, select Name
if it is not already selected. Selecting the correct output link assures that the
data goes to the correct InvestigateStage, InvestigateStage or InvestigateStage2.
6. Copy the data from the Columns pane to the ToName pane:
a. Place your cursor in the Columns pane, right click and select Select All
from the shortcut menu.
b. Right click and select Copy from the shortcut menu.
c. Place your cursor in the Name pane, right click and select Paste Column
from the shortcut menu. The column metadata is copied into the Name
pane and lines are displayed to show the linking from theColumns pane to
the Name pane.
7. In the Output name field above the Columns pane, select AddrCityState from
the drop-down menu.
8. Repeat step 6 to map the Columns pane to the AddrCityState pane.
9. Click OK to save the updated CopyStage.

This procedure shows you how to map columns to two different outputs.

Lesson checkpoint
In this lesson, you mapped the input metadata to the two output links to continue
the propagation of the metadata to the next two stages.

You learned how to do the following tasks:

v Adding a DataStage stage to a QualityStage job
v Propagating metadata to the next stage
v Mapping metadata to two output links

Chapter 4. Module 1: Investigating source data 13

Lesson 1.5: Configuring the Investigate stage
The Word Investigate option of the Investigate stage parses name and address data
into recognizable patterns by using a rule set that classifies personal names and
addresses.

The Investigate stage analyzes each record from the source file. In this lesson, you
®
select the NAME rule set to apply USPS standards.

To configure the Investigate stage:

1. Double-click the InvestigateStage icon.
2. Click the Word Investigate tab to open the Word Investigate window. The
Name column that was propagated to the InvestigateStage from the
CopyStage is shown in the Available Data Columns section.

3. Select Name from the Available Data Columns section and click to
move the Name column into the Standard Columns pane. The
InvestigateStage analyzes the Name column by using the rule set that you
select in step 4.

4. In the Rule Set: field, click to select a rule set for the InvestigateStage.
a. In the Rule Sets window, double-click the Standardization Rules folder to
open the Standardization Rules tree.
b. Double-click the USA folder, double-click the USNAME folder, then select
USNAME. The USNAME rule set parses the Name column according to
United States Post Office standards for names.
c. Right-click USNAME and select Provision All from the shortcut menu.
d. Click OK to exit the Rule Sets window.
Your Name map should look like the following figure:

5. Click the Token Report check box in the Output Dataset section of the
window.
6. Click the Stage Properties → Output → Mapping tab.

14 IBM WebSphere QualityStage Tutorial

7. Map the output columns:
a. Click the Columns pane.
b. Right-click and select Select All from the shortcut menu.
c. Right-click and select Copy from the shortcut menu.
d. Click in the TokenData pane.
e. Right-click and select Paste Column from the shortcut menu. The columns
on the left side map to the columns on the right side.
8. Click the Columns tab. Notice that the Output columns are populated when
you map the columns in the Mapping tab. Click OK.
9. Click OK, then click File → Save to save the updated Investigate stage.

Lesson summary
This lesson explained how to configure the Investigate stage by using the
USNAME rule set.

You learned how to configure the Investigate stage in the Investigate job by doing
the following tasks:
v Selecting the columns to investigate
v Selecting a rule from the rules set
v Mapping the output columns

Lesson 1.6: Configuring the InvestigateStage2 icon

The Word Investigate option of the Investigate stage parses name and address data
into recognizable patterns by using a rule set that classifies personal names and
addresses.

The Investigate stage analyzes each

®
record from the source file. In this lesson, you
apply the USAREA rule set USPS standards.

To configure the InvestigateStage2 icon:

1. Double-click the InvestigateStage2 icon.
2. Click the Word Investigate tab to open the Word Investigate window. The
address columns that were propagated to the second Investigate stage from
the CopyStage are shown in the Available Data Columns pane.
3. Select the following columns in the Available Data Columns pane to move to
the Standard Columns pane. The second Investigate stage analyzes the
address columns by using the rule set that you select in step 5.
v City
v State
v Zip5
v Zip4

4. Click to move each selected column to the Standard Columns pane.

5. In the Rule Set: field, click to locate a rule set for the InvestigateStage2.
a. In the Rule Sets window, double-click the Standardization Rules folder to
open the Standardization Rules tree.

Chapter 4. Module 1: Investigating source data 15

b. Double-click the USA folder and double-click on the USAREA folder and
select USAREA. The USAREA rule set parses the City, State, Zip5 and
Zip4 columns according to the United States Post Office standards.
c. Right-click and select Provision All from the shortcut menu.
d. Click OK to exit the Rule Sets window. USAREA.SET is shown in the Rule
Set field.
6. Click the Token Report and Pattern Report check boxes in the Output
Dataset section of the window. When you assign data to 2 outputs, you must
verify that the link ordering is correct. Link ordering assures that the data is
sent to the correct reports through the assigned links that you named in
Lesson 1.2. The Link Ordering tab is not displayed if there is only one link.
7. Click the Stage Properties → Link Ordering tab and select the output link to
move, if you need to change the display order of the links.
8. Move the links up or down as described next:

v Click to move the link name up a level.

v Click to move the link name down a level.

The following figure shows the correct order for the links.

9. Click the Output → Mapping tab. Since there are two output links from the
second Investigate stage, you must map the columns to each link:
a. In the Output name field above the Columns pane, select
AreaPatternData.
b. Select the Columns pane.
c. Right-click and select Select All from the shortcut menu.
d. Right-click and select Copy from the shortcut menu.
e. Select the AreaPatternData pane, right-click and select Paste Column from
the shortcut menu. The columns are mapped to the AreaPatternData
output link.
f. In the Output name field above the Columns pane, select AreaTokenData.
g. Repeat steps b through e, except select the AreaTokenData pane in step e.
10. Click OK to close the InvestigateStage2 window.

Lesson summary
This lesson explained how to configure the second Investigate stage to the AREA
rule set.

You learned how to configure the second Investigate stage in the Investigate job by
doing the following tasks:
v Selecting the columns to investigate
v Selecting a rule from the rules set
v Verifying the link ordering for the output reports

16 IBM WebSphere QualityStage Tutorial

v Mapping the output columns to two output links

Lesson 1.7: Configuring target reports

The source data information and column metadata are propagated to the target
data files for later use in creating Investigation reports.

The Investigate job modifies the unformed source data into readable data which is
later configured into Investigation reports.

To configure the data files:

1. Double-click the NameTokenReport icon on Designer client canvas.
2. Click Target → File.

3. In the File field, click and browse to the path name of the folder on the
server computer where the input data file resides.
4. In the File name field, type tokrpt.csv to display the path and file name in
theFile field, (for example, C:\IBM\InformationServer\Server\Projects\
tutorial\tokrpt.csv).
5. Click OK to close the stage.
6. Double-click the AreaPatternReport icon.
7. Repeat steps 2 to 5 except type areapatrpt.csv.
8. Double-click the AreaTokenReport icon.
9. Repeat steps 2 to 5 except type areatokrpt.csv.

Lesson checkpoint
This lesson explained how to configure the target files for use as reports.

You configured the three target data files by linking the data to each report file.

Lesson 1.8: Compiling and running jobs

Test the Investigate job by running the compiler followed by running the job to
process the data for the reports.

Compile the Investigate job in the Designer client. After the job compiles
successfully, open the Director client and run the job.

To compile and run the job:

1. Click File → Save to save the Investigate job on the Designer canvas.

2. Click to compile the job. The Compile Job window opens and the job
begins to compile. When the compiler finishes, the following message is shown
Job successfully compiled with no errors.
3. Click Tools → Run Director. The Director application opens with the job shown
in the Director Job Status View window.

4. Click to open the Job Run Options window.

5. Click Run.

Chapter 4. Module 1: Investigating source data 17

After the job runs, Finished is shown in the Status column.

Lesson checkpoint
In this lesson, you learned how to compile and process an Investigate job.

You processed the data into three output files by doing the following tasks:
v Compiling the Investigate job
v Running the Investigate job in the Director

Module 1: Summary
In Module 1, you set up, configured, and processed an IBM WebSphere DataStage
and QualityStage Investigate job.

An Investigate job looks at each record column-by-column and analyzes the data
content of the columns that you select. The Investigate job loads the name and
address source data stored in the database of the bank, parses the columns into a
form that can be analyzed, and then organizes the data into three data files.

The Investigate job modifies the unformed source data into readable data that you
can configure into Investigation reports using the Information Server for Web
console. You select the QualityStage Reports to access the reports interface in the
Web console.

The next module organizes the unformed data into standardized data that provides
usable data for matching and survivorship.

Lessons learned

By completing this module, you learned about the following concepts and tasks:
v How to correctly set up and link stages in a job so that the data propagates from
one stage to the next
v How to configure the stage properties to apply the correct rule set to analyze the
data
v How to compile and run a job
v How to create data for analysis

18 IBM WebSphere QualityStage Tutorial

Chapter 5. Module 2: Standardizing data
This module explains how to set up and process a Standardize job to standardize
name and address information derived from the database of the bank.

When you worked on the data in Module 1, some addresses were free form and
nonstandard. Removing duplicates of customer addresses and guaranteeing that a
single address is the correct address for that customer would be very difficult
without standardizing the data.

Standardizing or conditioning ensures that the source data is internally consistent,

that is, each type of data has the same type of content and format. When you use
consistent data, the system can match address data with greater accuracy during
the Match Frequency stage.

Learning objectives

After completing the lessons in this module, you should know how to do the
following tasks:
1. Add stages and links to a Standardize job
2. Configure the various stage properties to correctly process the data when the
job is run
3. Work with handling nulls by using derivations
4. Generate frequency and standardized data

This module should take approximately 60 minutes to complete.

Lesson 2.1: Setting up a Standardize job

Standardizing data is the first step in data cleansing. In Lesson 2.1, you add a
variety of stages to the Designer canvas. These stages include the Transformer
stage which applies derivatives to handle nulls and the Match Frequency stage
which adds frequency data.

If you have not already done so, open the Designer client.

As you learned in Lesson 1.1, you must add stages and links to the Designer
canvas to create a standardize job. The Investigate job that you completed helped
you determine how to formulate a business strategy by using Investigation reports.
The Standardize job applies rule sets to the source data to condition it for
matching.

To set up a Standardize job:

1. From the left pane of the Designer, go to the MyTutorial folder you created for
this tutorial and double click on Standardize1 to open the job.
2. Drag the following icons onto the Designer canvas from the palette.
v Data Quality → Standardize icon to the middle of the canvas
v File → Sequential File icon to the left of the Standardize stage
v File → Data Set icon to the right of the Standardize stage
v Processing → Transformer icon between the Standardize stage and the Data
Set file
© Copyright IBM Corp. 2004, 2008 19
v Processing → Copy icon between the Transformer stage and the Data Set file
v Data Quality → Match Frequency icon below the Copy stage
v Second File → Data Set icon to the right of the Match Frequency stage
After linking the stages and files, you can adjust their location on the canvas.
3. Right-click the Sequential File icon and drag to create a link from the
Sequential File icon to the Standardize stage icon.
4. Drag links to the remaining stages like you did in step 3.
If the link is red, click to activate the link and drag it until it meets the stage. It
should turn black.
When all the icons on the canvas are linked, you can click on the stages and
drag them to change their positions.
5. Click on the names of the following stages and type the new stage name in the
highlighted box:

Stage Change to
SequentialFile Customer
Standardize stage Standardize
Transformer stage CreateAdditionalMatchColumns
Copy stage Copy
Data_Set file Stan
Match Frequency stage MatchFrequency
Data_Set file Frequencies

6. Right click on the names of the following links, select Rename and type the
new link name in the highlighted box:

Link Change to
From Customer to Standardize Input
From Standardize to Standardized
CreateAdditionalColumns
From CreateAdditionalColumns to Copy ToCopy
From Copy to Stan StandardizedData
From Copy to MatchFrequency ToMatchFrequency
From MatchFrequency to Frequencies ToFrequencies

The following figure shows the Standardized job stages and links.

20 IBM WebSphere QualityStage Tutorial

Lesson checkpoint
In this lesson you learned how to set up a Standardize job. The importance of the
Standardize stage is to generate the type of data that can then be used in a match
job.

You set up and linked a Standardize job by doing the following tasks:
v Adding Data Quality and Processing stages to the Designer canvas
v Linking all the stages
v Renaming the links and stages

Lesson 2.2: Configuring the Standardize job stage properties

The properties for each of the stages in the Standardize job must be configured on
the Designer canvas.

Complete the following tasks to configure the Standardize job:

v Load the source data and metadata
v Add compliant rule sets for United States names and addresses
v Apply derivatives to null sets
v Copy data to the two output links
v Create frequency data

Configuring the Customer file properties

To configure the Customer (source file) stage properties:
1. Double-click the Customer source file icon to open the Properties tab on the
Customer - Sequential File window.
2. Select the File property under the Source section of the window.
3. Click Source → File.

Chapter 5. Module 2: Standardizing data 21

4. In the File field, click and browse to the path name of the folder on the
server computer where the input data file resides.
5. Click input.csv. This is the source file the Standardize stage reads when the
job runs.
6. Click the Columns tab and click Load.
7. From the Table Definitions window, click the QualityStage Tutorial folder.
This folder was created when you imported the tutorial sample metadata.
8. Click Table Definitions → Tutorial → Input. The table definitions load into the
Columns tab of the Customer source file.
9. Click OK to close the Table Definitions window.
10. Click OK to close the Customer source file.

The source data is attached to the Customer source file and table definitions are
loaded to organize the data into standard address columns.

Configuring the Standardize stage

The Standardize stage applies rules to name and address data to parse the data
into a standard column format.

To configure the Standardize stage:

1. Double-click the Standardize stage icon to open the Standardize Stage
window.
2. Click the New Process tab to open the Standardize Rule Process window.
3. In theRule Set field, click Standardization Rules → USA. This rule set is
domain specific for the Standardize job. You select this rule set to create
consistent, industry-standard data structures and matching structures.
4. Open the USNAME folder.
a. Select the USNAME rule set and click OK. USNAME.SET displays in the
Rule Set field. You select this country code because the name and address
data is from the United States.
b. In the Available Columns pane, select Name.

c. Click to move the Name column into the Selected Columns pane.
The Optional NAMES Handling field is activated.
d. Click OK.

22 IBM WebSphere QualityStage Tutorial

5. Click the New Process tab to open the Standardize Rule Process window.
6. In the Rule Set field, click Standardization Rules → USA and select the
USADDR rule set.
7.
8. Select the following column names in the Available Columns pane and move
them to the Selected Columns pane:
v AddressLine1
v AddressLine2
9. Click OK.
10. Click the New Process tab to open the Standardize Rule Process window.
11. In the Rule Set field, click Standardization Rules → USA and select the
USAREA rule set.
12. Select the following column names in the Available Columns pane and move
them to the Selected Columns pane:
v City
v State
v Zip5
v Zip4
13. Click OK.
14. Map the Standardize stage output columns and save the table definitions to
the Table Definitons folder.
a. Click the Stage Properties tab.
b. Click the Output → Mapping tab.
c. In the Columns pane, right-click and select Select All from the shortcut
menu.
d. Right-click and select Copy from the shortcut menu.
e. Move to the Standardized pane, right-click and select Paste Column from
the shortcut menu.
f. Click OK to close the window.
15. Create the Identifier for the Table Definitions.
a. Click the Columns tab.

Chapter 5. Module 2: Standardizing data 23

b. Click Save to open Save Table Definitions window with the file name
displayed in the Table/File name field.
c. In the Data source type field, type Table Definitions.
d. In the Data source name field, type QualityStage.
e. Click OK.
f. In the Table/file name field, type Standardized.
g. Click OK.
h. Click Save to close the Save Table Definitions window.
i. Click OK to close the Standardize Stage window.

You configured the Standardize stage to apply the USNAME, USADDR, and
USAREA rule sets to the customer data and saved the table definitions.

Configuring the Transformer stage

The Transformer stage increases the number of columns the matching stage uses to
select matches. The Transformer stage also applies derivations to handle null sets.

To configure the transformer properties:

1. Double-click the CreateAdditionalMatchColumns stage icon to open the
Transformer Stage window.
2. Right-click the Standardized column and select Select All from the shortcut
menu to highlight all the columns in the Standardized stage.
3. Right click and select Copy from the shortcut menu.
4. Move to the ToCopy pane, right click and select Paste Column from the
shortcut menu.
5. In the lower right section of the window, select the top row of the ToCopy
pane and add three derivatives and columns to the
CreateAdditionalMatchColumns stage:
a. Right-click the row and select Insert row from the shortcut menu.
b. Add two more rows using the procedure explained in step a.
c. Right-click the top inserted row and select Edit row from the shortcut menu
to open the Edit Column Meta Data window.
d. In the Column name field, type MatchFirst1.
e. In the SQL type field, select VarChar.
f. In the Length field, select 1.
g. In the Nullable field, select Yes.
h. Click Apply, then click Close to close the window.
i. Right-click the next row and select Edit row from the shortcut menu.
j. In the Column name field, type HouseNumberFirstChar.
k. Repeat substeps e to h.
l. Right-click the last new row and select Edit row from the shortcut menu.
m. In the Column name field, type ZipCode3.
n. Repeat substeps e to h, except in the Length field, select 3.
6. Add derivations to the columns:
a. Double click the Derivation section of the window for the MatchFirst1
column and type the derivative:If
isNull(Standardized.MatchFirstName_USNAME) then Setnull() Else
Standardized.MatchFirstName_USNAME[1,1] This expression detects whether

24 IBM WebSphere QualityStage Tutorial

the MatchFirstName column contains a null. If it does, it handles it. If it
contains a string, it extracts the first character and writes it to the
MatchFirst1 column.
b. Repeat substep a for the HouseNumberFirstChar column and type the
derivative:If isNull(Standardized.HouseNumber_USADDR) then Setnull()
Else Standardized.HouseNumber_USADDR[1,1].
c. Repeat substep a for the ZipCode3 column and type the derivative:If
isNull(Standardized.ZipCode_USAREA) then Setnull() Else
Standardized.ZipCode_USAREA[1,3].
7. Map the three derivatives and columns to the input columns. .
a. Scroll the Standardized pane until you locate MatchFirstName_USNAME.
b. Click and drag the cell into the cell of the same name in the ToCopy pane.
c. When prompted to override existing data, click Yes.
d. Repeat substeps a and b for HouseNumber_USADDR and
ZipCode_USAREA.
e. Click OK to close the Transformer Stage window.

Configuring the Copy stage

The Copy stage duplicates data and writes it to more than one output link. In this
lesson, the Copy stage duplicates the metadata from the Transformer stage and
writes it to the Match Frequency stage and the target file.

The metadata from the Standardize and Transformer stages is duplicated and
written to two output links.

To configure the Copy stage:

1. Double-click the Copy stage icon to open the Copy Stage window.
2. Click the Output → Mapping tab.
3. Copy the data to the StandardizedData output link:
a. In the Output name field above the Columns pane, select
StandardizedData.
b. Right-click in the Columns pane and select Select All from the shortcut
menu.
c. Right-click and select Copy from the shortcut menu.
d. Move to the StandardizedData pane, right-click and select Paste Column
from the shortcut menu.
4. To copy the data to the ToMatchFrequency output link, repeat steps 2 and 3
except select ToMatchFrequency in the Output name field above the Columns
pane.
5. Click OK to copy the data and close the Copy stage.

Configuring the Match Frequency stage

The Match Frequency stage generates frequency information by using any data
that provides those columns that are needed by a match.

The Match Frequency stage processes frequency data independently from

executing a match. The output link of this stage carries four columns:
v qsFreqVal
v qsFreqCounts
v qsFreqColumnID

Chapter 5. Module 2: Standardizing data 25

v qsFreqHeaderFlag

To configure the Match Frequency stage:

1. Double-click the Match Frequency stage icon to open the Match Frequency
Stage window.
2. Select the Do not use a Match Specification check box. At this point you do
not know which columns are used in the match specification.
3. Click the Stage Properties tab.
4. Click the Output → Mapping tab.
a. In the Output name field, select ToFrequencies.
b. Right-click in the Columns pane and select Select All from the shortcut
menu.
c. Right-click and select Copy from the shortcut menu.
d. Move to the ToFrequencies pane, right-click and select Paste Column from
the shortcut menu.
5. Create the Match Frequency table definitions:
a. Click the Columns tab.
b. Click Save. The Save Table Definitions window opens.
c. Click OK to open the Save Table Definitions As window.
d. Select QualityStage Tutorial → Table Definitions.
e. In the File name field, type ToFrequencies and click Save.
6. Click OK to close the Output → Column tab and the Match Frequency stage.
7. Click OK to close the stage.

Lesson checkpoint
This lesson explained how to configure the source file and all the stages for the
Standardize job.

You have now applied settings to each stage and mapped the output files to the
next stage for the Standardize job. You learned how to do the following tasks:
v Configure the source file to load the customer data and metadata
v Apply United States postal service compliant rule set to the customer name and
address data
v Add additional columns for matching and create derivatives to handle nulls
v Write data to two output links and associate the data to the correct links
v Create frequency data

Lesson 2.3: Configuring the target data sets

The two target data sets in the Standardize job store the standardized and
frequency data that you can use as source data in the Unduplicate Match job.

Complete the following tasks to configure the target data sets:

v Attach the file to the Stan target data set
v Attach the file to the Frequencies data set

To configure the target data sets:

1. Double-click the Stan target data set icon to open the Data Set window.
2. Click Input → Properties and select Target → File.

26 IBM WebSphere QualityStage Tutorial

3. Click and browse to the folder on the server computer where the input
data file resides.
4. In the File name field, type Stan to display the path and file name in theFile
field, (for example, C:\IBM\InformationServer\Server\Projects\tutorial\Stan).
5. Save the Table Definitions.
a. Click the Columns tab.
b. Click Save to open the Save Table Definitions window.
c. In the Data source type field, type Table Definitions.
d. In the Data source name field, type StandardizedData1.
e. In the Table/file name field, type StandardizedData1.
f. Click OK to open the Save Table Definition As window.
g. Click Save to save the table definition and close the Save Table Definition
As window.
h. Click OK to close the stage window.
6. Double-click the Frequencies target data set icon.
7. Repeat steps 2 through 5 for the Frequencies file except replace
StandardizedData1 with ToFrequencies1 in the appropriate fields. The Stan
file and the Frequencies file are the source data sets for the Unduplicate Match
job.
8. Click File → Save to save the Standardize job.

9. Click to compile the job in the Designer client.

10. Click to run the job.

The job standardizes the data according to applied rules and adds additional
matching columns to the metadata. The data is written to two target data sets as
the source files for a later job.

Lesson checkpoint
This lesson explained how to attach files to the target data sets to store the
processed standardized customer name and address data and frequency data.

You have configured the Stan and Frequencies target data set files to accept the
data when it is processed.

Module 2: Summary
In Module 2, you set up and configured a Standardize job.

Running a Standardize job conforms the data to ensure that all the customer name
and address data has the same content and format. The Standardize job loads the
name and address source data stored in the database of the bank and adds table
definitions to organize the data into a format that can be analyzed by the rule sets.
Further processing by the Transformer stage increases the number of columns and
frequency data is generated for input into the match job.

Chapter 5. Module 2: Standardizing data 27

Lessons learned

By completing this module, you learned about the following concepts and tasks:
v How to create standardized data to match records effectively
v How to run DataStage and Data Quality stages together in one job
v How to apply country or region-specific rule sets to analyze the address data
v How to use derivatives to handle nulls
v How to create the data that can be used as source data in a later job

28 IBM WebSphere QualityStage Tutorial

Chapter 6. Module 3: Grouping records with common
attributes
This module explains how to set up and process an Unduplicate job to use
standardized data and frequency data to match records and remove duplicate
records.

The Unduplicate Match stage is one of two stages that matches records while
removing duplicates and residuals. The other matching stage is Reference Match
stage.

The Unduplicate Match stage groups records that share common attributes. The
Match specification that you apply was configured to separate all records with
weights above a certain match cutoff as duplicates. The master record is then
identified by selecting the record within the set that matches to itself with the
highest weight.

Any records that are not part of a set of duplicates are residuals. These records
along with the master records are used for the next pass. Do not include duplicates
because you want them to belong to only one set.

Using a matching stage ensures data integrity because you are applying
probabilistic matching technology. This technology is applied to any relevant
attribute for evaluating the columns, parts of columns, or individual characters that
you define. In addition, you can apply agreement or disagreement weights to key
data elements.

Learning objectives

After completing the lessons in this module, you should know how to do the
following tasks:
1. Add DataStage links and stages to a job
2. Add standardize and frequencies data as the source files
3. Configure stage properties to specify which action they take when the job is
run
4. Remove duplicate addresses after the first pass
5. Apply a Match specification to determine how matches are selected
6. Funnel the common attribute data to a separate target file

This module should take approximately 30 minutes to complete.

Lesson 3.1: Setting up an Unduplicate Match job

Sorting records into related attributes is the next step in data cleansing. In this
lesson, you add the Data Quality Unduplicate Match stage and a Funnel stage to
match records and remove duplicates.

If you have not already done so, open the Designer client.

As you learned in the previous module, you must add stages and links to the
Designer canvas to create an Unduplicate Match job. The Standardize job you just

© Copyright IBM Corp. 2004, 2008 29

completed created a Stan data set and a Frequencies data set. The information
from these data sets is used as the input data when you design the Unduplicate
Match job.

To set up an Unduplicate Match job:

1. From the left pane of the Designer, go to the MyTutorial folder you created for
this tutorial and double click on Unduplicate1 to open the job.
2. Drag the following icons to the Designer canvas from the palette.
v Data Quality → Unduplicate Match icon to the middle of the canvas.
v File → Data Set icon to the top left of the Unduplicate Match icon.
v A second File → Data Set icon to the lower left of the Unduplicate Match
icon.
v Processing → Funnel icon to the upper right of the Unduplicate Match icon.
v Three File → Sequential File icons, one to the right of the Funnel stage and
the other two to the right of the Unduplicate Match stage.
3. Right-click the top Data Set icon and drag to create a link from this data set to
the Unduplicate Match stage.
4. Drag links to the remaining stages as explained in step 3. Drag 2 links from the
Unduplicate Match stage to the Funnel stage.
5. Click on the names of the following stages and type the new stage name in the
highlighted box:

Stage Change to
top left Data Set Frequencies
lower left Data Set Standardized
Unduplicate Match Unduplicate
Funnel CollectMatched
top right Sequential File MatchedOutput_csv
middle right Sequential File ClericalOutput_csv
lower right Sequential File NonMatchedOutput_csv

6. Right click on the names of the following links, select Rename from the
shortcut menu and type the new link name in the highlighted box:

Links Change to
From MatchFrequencies to Unduplicate FrequencyData
From StandardizedData to Unduplicate StandardizedData
Unduplicate to CollectMatched MatchedData
Unduplicate to CollectMatched Duplicates
CollectMatched to MatchOutput_csv MatchedOutput
Unduplicate to ClericalOutput_csv Clerical
Unduplicate to NonMatchedOutput_csv NonMatched

7. Click File → Save to save the job.

30 IBM WebSphere QualityStage Tutorial

Lesson checkpoint for the Unduplicate Match job
In this lesson, you learned how to set up an Unduplicate Match job. During the
processing of this job, the records are matched using the Match specification
created for this tutorial. The records are then sorted according to their attributes
and written to a variety of output links.

You set up and linked an Unduplicate Match job by doing the following tasks:
v Adding Data Quality and Processing stages to the Designer canvas
v Linking all the stages
v Renaming the links and stages with appropriate names

Lesson 3.2: Configuring the Unduplicate Match job stage properties

Configure the properties for each stage of the Unduplicate Match job on the
Designer canvas.

Complete the following tasks to configure the Unduplicate Match job:

v Load data and metadata for two source files
v Apply a Match specification to the Unduplicate Match job and selecting output
links
v Combine unsorted records

To configure the Frequencies and Standardized data sets:

1. Double-click the Frequencies data set icon to open the Properties tab on the
Frequencies - Data Set window.
2. Click File → Source.

3. In the File field, click and browse to the path name of the folder on the
server computer where the input data file resides.

Chapter 6. Module 3: Grouping records with common attributes 31

4. In the File name field, type Frequencies. (For example, C:\IBM\
InformationServer\Server\Projects\tutorial\Frequencies).
5. Click OK to close the stage.
6. Click the Columns tab and click Load. The Table Definitions window opens.
7. Click the QualityStage Tutorial → Table Definitions → ToFrequencies1 file.
The table definitions load into the Columns tab of the source file.
8. Click OK at the Select Columns window.
9. Click OK to close the Frequencies - Data Set window.
10. Double click the Standardized data set icon.
11. Repeat steps 2 to 9 except type Stan in step 4 and select the
StandardizedData1 file in step 7.

The data from the Standardize job is loaded into the source files for the
Unduplicate Match job.

Configuring the Unduplicate Match stage

The Unduplicate Match stage groups records with common attributes.

To configure the Unduplicate Match stage:

1. Double-click the Unduplicate stage icon.

2. Click the Match Specification button.

3. From the Repository window, click the Match Specifications folder.
4. Right-click on NameandAddress and select Provision All from the shortcut
menu.
5. Click OK to attach the Unduplicate Match specification for the tutorial.
6. Click the check boxes for the following Match Output options:
v Match - Sends matched records as output data.
v Clerical - Separates those records that require clerical review.
v Duplicate - Includes duplicate records that are above the match cutoff.
v Residual - Separates records that are not duplicates as residuals.
7. Keep the default Dependent setting in the Dependent Match Type field. After
the first pass is run, duplicates are removed with every additional pass.
8. Click the Stage Properties → Link Ordering tab. Make sure the links are
displayed in the following order. If necessary, use the up and down arrow
buttons to move the links into the correct order.

Link label Link name

Match MatchedData
Clerical Clerical
Duplicate Duplicates
Residual NonMatched

9. Click the Output → Mapping tab and map the following columns to the
correct links:
a. In the Output name field above the Columns pane, select MatchedData .
b. Right-click in the Columns pane and select Select All from the shortcut
menu.

32 IBM WebSphere QualityStage Tutorial

c. Right-click and select Copy from the shortcut menu.
d. Move to the MatchedData pane, right click-and select Paste Column from
the shortcut menu.
e. Select Duplicates from the Output name field above the Columns pane.
f. Repeat steps b through d for the Duplicates data.
g. Select Clerical from the Output name field above the Columns pane.
h. Repeat steps b through d for the Clerical data.
i. Select NonMatched from the Output name field above the Columns pane.
j. Repeat steps b through d for the Nonmatched data.
10. Click OK to close the Stage Properties window.
11. Click OK to close the stage.

Configuring the Funnel stage

The Funnel stage combines records when they are received in an unordered
format.

To configure a continuous funnel:

1. Double-click the CollectMatched stage icon and click the Stage → Properties
tab.
2. In the Options tree, select Funnel Type.
3. In the Funnel Type field, select Sequence from the drop down menu.
4. Click the Stage → Advanced tab.
5. Click the Execution mode → Sequential tab.
6. Click the Input → Partitioning tab. Set the Sort function from this page.
a. Click Collector type → Sort Merge.
b. In the Sorting section of the window, click Perform sort, then click Stable
to preserve any previously sorted data sets.
c. In the Available pane, select the qsMatchSetID sort key.

Chapter 6. Module 3: Grouping records with common attributes 33

7. Click the Output → Mapping tab.
8. Right-click in the Columns pane and select Select All from the shortcut menu.
9. Right-click and select Copy from the shortcut menu.
10. Move to the MatchedOutput column, right-click and select Paste Column
from the shortcut menu.
11. Click OK to close the stage window.

Lesson 3.2 checkpoint

In Lesson 3.2, you configured the source files and stages of the Unduplicate Match
job.

You learned how to do the following tasks:

v Load data and metadata generated in a previous job
v Apply a Match specification to process the data according to matches and
duplicates
v Combine records into a single file

Lesson 3.3: Configuring Unduplicate job target files

To configure the Unduplicate target files you must attach files to the four output
records. The records in the MatchedOutput file become the source records for the
next job.

To configure the target files:

1. Double-click the MatchedOutput_csv icon to open the Properties tab on the
MatchedOutput_csv - Sequential File window. You are attaching a file name to
the match records.
2. Click Target → File.

3. In the File field, click and browse to the folder on the server computer
where the input data file resides.
4. In the File name field, type MatchedOutput.csv to display the path and file
name in theFile field, (for example, C:\IBM\InformationServer\Server\
Projects\tutorial\MatchedOutput.csv).
5. Click the Formats tab, right-click and select Field Defaults from the menu.
6. Click Add sub-property from the menu.
7. Click Null field value and type double quotes (no spaces) in the Null field
value field.
8. Save the Table Definitions.
a. Click the Columns tab.
b. Click Save to open Save Table Definitions window.
c. In the Data source type field, type Table Definitions.
d. In the Data source name field, type MatchedOutput1.
e. In the Table/file name field, type MatchedOutput1.
f. Click OK to open the Save Table Definition As window.
g. Click Save to save the table definition and close the Save Table Definition
As window.
h. Click OK to close the stage window.
9. Repeat steps 1 through 8 for each of the following target files:

34 IBM WebSphere QualityStage Tutorial

v For the ClericalOutput_csv file, type \ClericalOutput.csv.
v For the NonMatchedOutput_csv file, type \NonMatchedOutput.csv.
10. Click File → Save to save the job.

11. Click to compile the job in the Designer client.

12. Click Tools → Run Director to open the DataStage Director. The Director opens
with the Standardize job visible in the Director window with the Compiled
status.
13. Click Run.

You configured the target files.

Lesson checkpoint
In this lesson, you combined the matched and duplicate address records into one
file. The nonmatched and clerical output records were separated into individual
files. The clerical output records can be reviewed manually for matching records.
The nonmatched records are used in the next pass. The matched and duplicate
address records are used in the Survive job.

You learned how to separate the output records from the Unduplicate Match stage
to the various target files.

Module 3: Summary
In Module 3, you set up and configured an Unduplicate stage job to isolate
matched and duplicate name and address data into one file.

In creating an Unduplicate stage job, you added a Match specification to apply the
blocking and matching criteria to the standardized and frequency data created in
the Standardize job. After applying the Match specification, the resulting records
were sent out through four output links, one for each type of record. The matches
and duplicates were sent to a Funnel stage that combined the records into one
output which was written to a file. The unmatched or residuals records were sent
to a file, as were the clerical output records.

Lessons learned

By completing Module 3, you learned about the following concepts and tasks:
v How to apply a Match specification to the Unduplicate stage
v How the Unduplicate stage groups records with similar attributes
v How to ensure data integrity by applying probability matching technology

Chapter 6. Module 3: Grouping records with common attributes 35

36 IBM WebSphere QualityStage Tutorial
Chapter 7. Module 4: Creating a single record
This module designs a Survive job to isolate the best record for the name and
address of each customer.

The Unduplicate job identifies a group of records with similar attributes. In the
Survive job, you specify which columns and column values from each group
creates the output record for the group. The output record can include the
following information:
v An entire input record
v Selected columns from the record
v Selected columns from different records in the group

Select column values based on rules for testing the columns. A rule contains a set
of conditions and a list of targets. If a column tests true against the conditions, the
column value for that record becomes the best candidate for the target. After
testing each record in the group, the columns declared best candidates combine to
become the output record for the group. Column survival is determined by the
target. Column value survival is determined by the rules.

Learning objectives

After completing the lessons in this module, you should know how to do the
following tasks:
1. Add stages and links to a Survive job
2. Choose the selected column
3. Add the rules
4. Map the output columns

This module should take approximately 20 minutes to complete.

Lesson 4.1: Setting up a Survive job

Creating the best results record in the Survive stage is the last job in the data
cleansing process. The best results record is the name and address with the highest
probability of being correct for every bank customer.

In this lesson, add the Data Quality Survive stage, the source file of combined data
from the Unduplicate Match job, and the target file for the best records.

To set up a Survive job

1. From the left pane of the Designer client, go to the MyTutorial folder you
created for this tutorial and double click on Survive1 to open the job.
2. Drag the following icons to the Designer canvas from the palette:
v Data Quality → Survive icon to the middle of the canvas
v File → Sequential File icon to the left of the Survive stage
v Second File → Sequential File icon to the right of the Survive stage
3. Right-click the left Sequential File icon and drag a link to the Survive stage.
4. Drag a second link from the Survive stage to the output Sequential File icon.

© Copyright IBM Corp. 2004, 2008 37

5. Click on the names of the following stages and type the new stage name in the
highlighted box:

Stage Change to
left Sequential file MatchedOutput
Survive stage Survive
right Sequential file Survived_csv

6. Right click on the names of the following links, select Rename from the
shortcut menu and type the new link name in the highlighted box:

Links Change to
From MatchedOutput to Survive Matchesandduplicates
From Survive to Survived_csv Survived

Lesson checkpoint
In this lesson, you learned how to set up a Survive job by adding as source data
the results of the Unduplicate Match job, the Survive stage, and the target file as
the output record for the group.

You have learned that the Survive stage takes one input link and one output link.

Lesson 4.2: Configuring Survive job stage properties

To configure Survive job stage properties, load matched and duplicates data from
the Unduplicate Match job, configure the Survive stage with rules that test
columns to a set of conditions, and configure the target file.

In the Survive job, you are testing column values to determine which columns are
the best candidates for that record. These columns are combined to become the
output record for the group. In selecting a best candidate, you can specify that
these column values be tested:
v Record creation data
v Data source
v Length of data in a column
v Frequency of data in a group

To configure the source file:

1. Double-click the MatchedOutput file icon to access the Properties page.
2. Click File → Source.

3. In the File field, click and browse to the path name of the folder on the
server computer where the input data file resides.
4. Click on the MatchedOutput.csv file.

38 IBM WebSphere QualityStage Tutorial

5. Click the Columns tab and click Load. The Table Definitions window opens.
6. Click the QualityStage Tutorial → Table Definitions → MatchedOutput1 file.
The table definitions load into the Columns tab of the source file.
7. Click OK to close the MatchedOutput window.
8. Click OK to close the MatchedOutput file stage.

You attached file the MatchedOutput.csv file and loaded the Table Definitions into
the MatchedOutput file.

Configuring the Survive stage

Configure the Survive stage with rules to compare the columns against a best case.

To configure the Survive stage:

1. Double click the Survive stage icon.
2. Click New Rule to open the Survive Rules Definition window. The Survive
stage requires a rule that contains one or more targets and a TRUE condition
expression.
Define the rule by specifying the following elements:
v Target column or columns
v Column to analyze
v Technique to apply to the column being analyzed

3. In the AllColumns pane, select AllColumns pane and click to move

AllColumns to the Target column. When you select AllColumns, you are
assigning the first record in the group as the best record.
4. From the Survive Rule (Pick one) section of the window, click the Analyze
Column tab and select qsMatchType from the Use Target drop-down menu.
You are selecting qsMatchType as the target to which to compare other
columns.
5. In the Technique drop-down menu, click Equals.
6. In the Data field, type MP.
7. Click OK to close the Survive Rule Definition window.
8. Follow steps 2 to 5 to add the following columns and rules. Do not enter
values in the Data field.

Targets Analyze Column Technique

GenderCodeUSNAME GenderCodeUSNAME Most Frequent (Non-blank)
FirstNameUSNAME FirstNameUSNAME Most Frequent (Non-blank)
MiddleNameUSNAME MiddleNameUSNAME Longest
PrimaryNameUSNAME PrimaryNameUSNAME Most Frequent (Non-blank)

You can view the rules you added in the Survive grid.
9. From the Select the group identification data column section, click the Selected
Column qsMatchSetID from the list
10. Click the Stage Properties → Output → Mapping. tab.
11. Right click in the Columns pane and select Select All from the shortcut menu.
12. Select Copy from the shortcut menu.
13. Move to the Survived pane, right click and select Paste Column from the
shortcut menu.

Chapter 7. Module 4: Creating a single record 39

14. Click OK to close the Mapping tab.
15. Click OK to close the Survive window.

Configuring the target file

You are configuring the target file for the Survive stage.
1. Double click the Survive_csv target file icon and click Target → File to activate
the File field.

2. In the File field, click and browse to the folder on the server computer
where the input data file resides.
3. In the File name field, type record.csv to display the path and file name in
theFile field, (for example, C:\IBM\InformationServer\Server\Projects\
tutorial\record.csv).
4. Click File → Save to save the job.

5. Click to compile the job in the Designer client.

6. Click Tools → Run Director to open the DataStage Director. The Director opens
with the Standardize job visible in the Director window with the Compiled
status.
7. Click Run.

Lesson checkpoint
You have set up the survive job, renamed the links and stages, and configured the
source and target files, and the Survive stage.

40 IBM WebSphere QualityStage Tutorial

With Lesson 4.2, you learned how to select simple rule sets which is then applied
to a selected column. This combination is then compared against all columns to
find the best record.

Module 4: Summary
In Module 4, you completed the last job in the IBM WebSphere QualityStage work
flow. In this module, you set up and configured the Survive job to select the best
record from the matched and duplicates name and address data that you created
in the Unduplicate Match stage.

In configuring the Survive stage, you selected a rule, included columns from the
source file, added a rule to each column and applied the data. After the Survive
stage processed the records to select the best record, the information was sent to
the output file.

Lessons learned

In completing Module 4, you learned about the following tasks and concepts:
v How to use the Survive stage to create the best candidate in a record
v How to apply simple rules to the column values

Chapter 7. Module 4: Creating a single record 41

42 IBM WebSphere QualityStage Tutorial
Chapter 8. WebSphere QualityStage Tutorial: summary
From the lessons in this tutorial, you learned how QualityStage can be used to
help an organization manage and maintain its data quality. It is imperative for
companies that their customer data be high quality; thus it needs to be up-to-date,
complete, accurate, and easy to use.

The tutorial presented a common business problem which was to verify customer
names and addresses, and showed the steps to take by using QualityStage jobs to
reconcile the various names that belonged to one household. The tutorial presented
four modules that covered the four jobs in the QualityStage work flow. These jobs
provide customers with the following assurances:
v Investigating data to identify errors and validate the contents of fields in a data
file
v Conditioning data to ensure that the source data is internally consistent
v Matching data to identify all records in one file that correspond to similar
records in another file
v Identifying which records from the match data survive to create a best candidate
record

Lessons learned

By completing this tutorial, you learned about the following concepts and tasks:
v About the QualityStage work flow
v How to set up a QualityStage job
v How data created in one job is the source for the next job
v How to create quality data by using QualityStage

44 IBM WebSphere QualityStage Tutorial
Product documentation
Documentation is provided in a variety of locations and formats, including in help
that is opened directly from the product interface, in a suite-wide information
center, and in PDF file books.

The information center is installed as a common service with IBM Information

Server. The information center contains help for most of the product interfaces, as
well as complete documentation for all product modules in the suite.

A subset of the product documentation is also available online from the product
documentation library at publib.boulder.ibm.com/infocenter/iisinfsv/v8r1/
index.jsp.

PDF file books are available through the IBM Information Server software installer
and the distribution media. A subset of the information center is also available
online and periodically refreshed at www.ibm.com/support/docview.wss?rs=14
&uid=swg27008803.

You can also order IBM publications in hardcopy format online or through your
local IBM representative.

To order publications online, go to the IBM Publications Center at

www.ibm.com/shop/publications/order.

You can send your comments about documentation in the following ways:
v Online reader comment form: www.ibm.com/software/data/rcf/
v E-mail: [email protected]

Contacting IBM
You can contact IBM for customer support, software services, product information,
and general information. You can also provide feedback on products and
documentation.

Customer support

For customer support for IBM products and for product download information, go
to the support and downloads site at www.ibm.com/support/us/.

You can open a support request by going to the software support service request
site at www.ibm.com/software/support/probsub.html.

My IBM

You can manage links to IBM Web sites and information that meet your specific
technical support needs by creating an account on the My IBM site at
www.ibm.com/account/us/.

Software services

For information about software, IT, and business consulting services, go to the
solutions site at www.ibm.com/businesssolutions/us/en.

IBM Information Server support

For IBM Information Server support, go to www.ibm.com/software/data/

integration/support/info_server/.

General information

To find general information about IBM, go to www.ibm.com.

Product feedback

You can provide general product feedback through the Consumability Survey at
www.ibm.com/software/data/info/consumability-survey.

Documentation feedback

You can click the feedback link in any topic in the information center to comment
on the information center.

You can also send your comments about PDF file books, the information center, or
any other documentation in the following ways:
v Online reader comment form: www.ibm.com/software/data/rcf/
v E-mail: [email protected]

46 IBM WebSphere QualityStage Tutorial

How to read syntax diagrams
The following rules apply to the syntax diagrams that are used in this information:
v Read the syntax diagrams from left to right, from top to bottom, following the
path of the line. The following conventions are used:
– The >>--- symbol indicates the beginning of a syntax diagram.
– The ---> symbol indicates that the syntax diagram is continued on the next
line.
– The >--- symbol indicates that a syntax diagram is continued from the
previous line.
– The --->< symbol indicates the end of a syntax diagram.
v Required items appear on the horizontal line (the main path).

required_item

v Optional items appear below the main path.

required_item
optional_item

If an optional item appears above the main path, that item has no effect on the
execution of the syntax element and is used only for readability.

optional_item
required_item

v If you can choose from two or more items, they appear vertically, in a stack.
If you must choose one of the items, one item of the stack appears on the main
path.

required_item required_choice1
required_choice2

If choosing one of the items is optional, the entire stack appears below the main
path.

required_item
optional_choice1
optional_choice2

If one of the items is the default, it appears above the main path, and the
remaining choices are shown below.

default_choice
required_item
optional_choice1
optional_choice2

v An arrow returning to the left, above the main line, indicates an item that can be
repeated.

required_item repeatable_item

If the repeat arrow contains a comma, you must separate repeated items with a
comma.

required_item repeatable_item

A repeat arrow above a stack indicates that you can repeat the items in the
stack.
v Sometimes a diagram must be split into fragments. The syntax fragment is
shown separately from the main syntax diagram, but the contents of the
fragment should be read as if they are on the main path of the diagram.

required_item fragment-name

Fragment-name:

required_item
optional_item

v Keywords, and their minimum abbreviations if applicable, appear in uppercase.

They must be spelled exactly as shown.
v Variables appear in all lowercase italic letters (for example, column-name). They
represent user-supplied names or values.
v Separate keywords and parameters by at least one space if no intervening
punctuation is shown in the diagram.
v Enter punctuation marks, parentheses, arithmetic operators, and other symbols,
exactly as shown in the diagram.
v Footnotes are shown by a number in parentheses, for example (1).

48 IBM WebSphere QualityStage Tutorial

Product accessibility
You can get information about the accessibility status of IBM products.

The IBM Information Server product modules and user interfaces are not fully
accessible. The installation program installs the following product modules and
components:
v IBM Information Server Business Glossary Anywhere
v IBM Information Server FastTrack
v IBM Metadata Workbench
v IBM WebSphere Business Glossary
v IBM WebSphere DataStage and QualityStage
v IBM WebSphere Information Analyzer
v IBM WebSphere Information Services Director

For more information about a product’s accessibility status, go to

http://www.ibm.com/able/product_accessibility/index.html.

Accessible documentation

Accessible documentation for IBM Information Server products is provided in an

information center. The information center presents the documentation in XHTML
1.0 format, which is viewable in most Web browsers. XHTML allows you to set
display preferences in your browser. It also allows you to use screen readers and
other assistive technologies to access the documentation.

50 IBM WebSphere QualityStage Tutorial
Notices
This information was developed for products and services offered in the U.S.A.

IBM may not offer the products, services, or features discussed in this document in
other countries. Consult your local IBM representative for information on the
products and services currently available in your area. Any reference to an IBM
product, program, or service is not intended to state or imply that only that IBM
product, program, or service may be used. Any functionally equivalent product,
program, or service that does not infringe any IBM intellectual property right may
be used instead. However, it is the user’s responsibility to evaluate and verify the
operation of any non-IBM product, program, or service.

IBM may have patents or pending patent applications covering subject matter
described in this document. The furnishing of this document does not grant you
any license to these patents. You can send license inquiries, in writing, to:

IBM Director of Licensing

IBM Corporation
North Castle Drive
Armonk, NY 10504-1785 U.S.A.

For license inquiries regarding double-byte character set (DBCS) information,

contact the IBM Intellectual Property Department in your country or send
inquiries, in writing, to:

IBM World Trade Asia Corporation

Licensing 2-31 Roppongi 3-chome, Minato-ku
Tokyo 106-0032, Japan

The following paragraph does not apply to the United Kingdom or any other
country where such provisions are inconsistent with local law:
INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS
PUBLICATION ″AS IS″ WITHOUT WARRANTY OF ANY KIND, EITHER
EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS
FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or
implied warranties in certain transactions, therefore, this statement may not apply
to you.

This information could include technical inaccuracies or typographical errors.

Changes are periodically made to the information herein; these changes will be
incorporated in new editions of the publication. IBM may make improvements
and/or changes in the product(s) and/or the program(s) described in this
publication at any time without notice.

Any references in this information to non-IBM Web sites are provided for
convenience only and do not in any manner serve as an endorsement of those Web
sites. The materials at those Web sites are not part of the materials for this IBM
product and use of those Web sites is at your own risk.

IBM may use or distribute any of the information you supply in any way it
believes appropriate without incurring any obligation to you.

Licensees of this program who wish to have information about it for the purpose
of enabling: (i) the exchange of information between independently created
programs and other programs (including this one) and (ii) the mutual use of the
information which has been exchanged, should contact:

IBM Corporation
J46A/G4
555 Bailey Avenue
San Jose, CA 95141-1003 U.S.A.

Such information may be available, subject to appropriate terms and conditions,

including in some cases, payment of a fee.

The licensed program described in this document and all licensed material
available for it are provided by IBM under terms of the IBM Customer Agreement,
IBM International Program License Agreement or any equivalent agreement
between us.

Any performance data contained herein was determined in a controlled

environment. Therefore, the results obtained in other operating environments may
vary significantly. Some measurements may have been made on development-level
systems and there is no guarantee that these measurements will be the same on
generally available systems. Furthermore, some measurements may have been
estimated through extrapolation. Actual results may vary. Users of this document
should verify the applicable data for their specific environment.

Information concerning non-IBM products was obtained from the suppliers of

those products, their published announcements or other publicly available sources.
IBM has not tested those products and cannot confirm the accuracy of
performance, compatibility or any other claims related to non-IBM products.
Questions on the capabilities of non-IBM products should be addressed to the
suppliers of those products.

All statements regarding IBM’s future direction or intent are subject to change or
withdrawal without notice, and represent goals and objectives only.

This information is for planning purposes only. The information herein is subject to
change before the products described become available.

This information contains examples of data and reports used in daily business
operations. To illustrate them as completely as possible, the examples include the
names of individuals, companies, brands, and products. All of these names are
fictitious and any similarity to the names and addresses used by an actual business
enterprise is entirely coincidental.

This information contains sample application programs in source language, which

illustrate programming techniques on various operating platforms. You may copy,
modify, and distribute these sample programs in any form without payment to
IBM, for the purposes of developing, using, marketing or distributing application
programs conforming to the application programming interface for the operating
platform for which the sample programs are written. These examples have not
been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or
imply reliability, serviceability, or function of these programs.

52 IBM WebSphere QualityStage Tutorial

Each copy or any portion of these sample programs or any derivative work, must
include a copyright notice as follows:

If you are viewing this information softcopy, the photographs and color
illustrations may not appear.

Trademarks
IBM trademarks and certain non-IBM trademarks are marked on their first
occurrence in this information with the appropriate symbol.

IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of
International Business Machines Corporation in the United States, other countries,
or both. If these and other IBM trademarked terms are marked on their first
occurrence in this information with a trademark symbol (® or ™), these symbols
indicate U.S. registered or common law trademarks owned by IBM at the time this
information was published. Such trademarks may also be registered or common
law trademarks in other countries. A current list of IBM trademarks is available on
the Web at ″Copyright and trademark information″ at www.ibm.com/legal/
copytrade.shtml.

The following terms are trademarks or registered trademarks of other companies:

Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered
trademarks or trademarks of Adobe Systems Incorporated in the United States,
and/or other countries.

IT Infrastructure Library is a registered trademark of the Central Computer and

Telecommunications Agency, which is now part of the Office of Government
Commerce.

Intel, Intel logo, Intel Inside, Intel Inside logo, Intel Centrino, Intel Centrino logo,
Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or
registered trademarks of Intel Corporation or its subsidiaries in the United States
and other countries.

Linux is a registered trademark of Linus Torvalds in the United States, other

countries, or both.

Microsoft, Windows, Windows NT, and the Windows logo are trademarks of
Microsoft Corporation in the United States, other countries, or both.

ITIL is a registered trademark and a registered community trademark of the Office

of Government Commerce, and is registered in the U.S. Patent and Trademark
Office.

UNIX is a registered trademark of The Open Group in the United States and other
countries.

Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc. in the

United States, other countries, or both and is used under license therefrom.

Notices 53
Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the
United States, other countries, or both.

The United States Postal Service owns the following trademarks: CASS, CASS
Certified, DPV, LACSLink, ZIP, ZIP + 4, ZIP Code, Post Office, Postal Service,
USPS and United States Postal Service. IBM Corporation is a non-exclusive DPV
and LACSLink licensee of the United States Postal Service.

Other company, product, or service names may be trademarks or service marks of

others.

54 IBM WebSphere QualityStage Tutorial

Index
A L Rule Set
select 14
accessibility 45 legal notices 51
address analysis 15 Lesson 1.1 9
analyze addresses 15 Lesson 3.1, setting up an Unduplicate
Match job 29 S
Lesson 3.2 checkpoint 34 scenario for tutorial project 3
C Lesson 3.2, configuring Unduplicate
source files 31
screen readers 45
select Rule Set 14
cleanse data 1 Sequential file 17, 19
Lesson 3.3, configuring the Unduplicate
client components 2 server components 2
job target files 34
columns, mapping 13 setting up
Lesson 3.4, configuring the Funnel
common attributes, grouping records 29 Investigate stage job 9
stage 33
configuring Standardize job 19
Lesson 4.2, configuring the survive
Match Frequency stage 25 setup tutorial 5
job 38
configuring the Copy stage 25 single-domain column investigation 9
links, renaming 11
Copy stage software services 45
configuring 13, 25 source file
copy tutorial data 5
copying metadata 25 M configure 12
rename 12
create tutorial data folder 5 mapping columns 25 stages
create tutorial project 5 Mapping columns 14 Copy 13, 19, 25
customer support 45 Match Frequency stage Investigate 9, 14
columns 25 Match Frequency 19, 25
configuring 25 Standardize 19, 21
D metadata 13
load 12
Transformer 19
data stages, renaming 11
Module 2, about 19 Standardize stage
parse free form 9
Module 3 conditioning data 19
standardize 19
summary for Unduplicate stage configuring 21
data cleansing 1
job 35 Standardize rule sets 21
Designer Tool Palette
Unduplicate Match stage 29 Standardize stage job
Data Quality group 2
Module 4 setting up 19
documentation
summary 41 support, customer 45
accessible 45
Survive job
configuring 38
F O Survive job, setting up 37
output reports, configure 17 Survive stage
file renaming links and stages 37
Sequential 12 setting up 37
source 12
Funnel stage, configuring 33 P Survive stage job
Module 4
Parallel job creating a single record 37
saving 6 Module 4: creating a single record 37
I parse free-form data 9
pattern report 9, 17
summary 41
IBM support 45
product accessibility
importing tutorial components 6
accessibility 49
Investigate stage 9
project elements 1 T
configure 14, 15 token report 9, 17
projects 1
Investigate stage job trademarks 53
opening 6
renaming links and stages 11 tutorial
setting up 9 setup 5
R tutorial components
importing 6
J records with common attributes
reports
29 tutorial data
jobs copy 5
configure output 17 tutorial data folder
overview 1
pattern 9, 17 create 5
token 9, 17 tutorial project
Word pattern 9 create 5
Word token 9 tutorial project goals 3

U
Unduplicate job target files
configuring 34
Unduplicate Match job
configuring source files 31
Lesson checkpoint 31
setting up 29
Unduplicate Match stage job
grouping records with common
attributes 29
Unduplicate Match stageconfiguring 32
Unduplicate stage
configuring target files 34
Unduplicate stage job
summary 35

W
WebSphere DataStage
Copy stage 13, 25
creating a job 6
Designer client 1
WebSphere DataStage Designer 6
WebSphere QualityStage
jobs 1
projects 1
stages 2
Survive stage 37, 38
Survive stage job 37
summary 41
Unduplicate Match stage 29, 31
Unduplicate stage job 33
summary 35
value 1
Word 9
Word pattern report 9