Lab3 - Cleansing Data
Lab3 - Cleansing Data
Lab 3
SQL Server Data Quality Services
December 2023
Objectives
At the end of this lab, you are required to deliver a report of your findings and observations (Take a step back
to understand the general steps for creating and using KB for data cleansing).
DQS Projects
DQS offers the possibility to create a cleansing or a matching DQS project. Creating a DQS project is a quite
simple task; a wizard guides you through individual steps. During the cleansing and matching, you can also
see the profiling results for your data. DQS profiling provides you with two data quality dimensions:
completeness and accuracy. Based on data profiling information and defined notifications, you can also get
warnings when a threshold is met.
Once a project is create on DQS, you can perform some management activities on it:
• Opening an existing data quality project. You use the Open Data Quality Project button of the Data Quality
Client tool to display a grid of existing DQS projects.
• Unlocking a DQS project. You use the Open Data Quality Project button of the Data Quality Client tool to
display a grid of existing DQS projects. Then you right-click a project in the grid and select the Unlock
option. A project is locked when someone edited it without finishing the edit.
• Renaming a DQS project. You can right-click a project and rename it.
• Deleting a DQS project. Again in the displayed list of existing projects, you select a
project and right-click it to display the Delete option.
1
SQL SERVER - DATA QUALITY SERVICES LAB
DQS uses knowledge bases for automatic, computer-assisted data cleansing. After the automatic process is
done, you can manually review and additionally edit the processed data. When you finish with editing, you
can export the cleansed data. In addition, you can use the results of a DQS project to improve a knowledge
base—for example, by adding new correct or corrected values to existing domain values in the knowledge
base.
A DQS knowledge base must exist before you can start a DQS cleansing or matching project. The
knowledge base must contain the knowledge about the type of data you are going to cleanse. For example,
if you are cleansing company names, the knowledge base you use should have high-quality data about
company names. In addition, a KB used for cleansing company names could have synonyms and term-
based relations defined. A DQS project uses a single KB; multiple projects can use the same KB.
In previous Labs, we created a knowledge Base for customers data. We will now use this KB to cleanse our
data. For that purpose, we will need to create a sample of dirty data to illustrate how data cleansing on DQS
works.
1. In SSMS, Open a new query window and create a view in DQS_STAGING_DATA database. The view
customersDirty selects every tenth customer from the dbo.Dimcustomer table, joined to the
dbo.Dimgeography table in the AdventureWorksDW database.
2
SQL SERVER - DATA QUALITY SERVICES LAB
SELECT C.CustomerKey,
C.FirstName + ' ' + c.LastName AS FullName, C.AddressLine1 AS StreetAddress,
G.City, G.StateProvinceName AS StateProvince, G.EnglishCountryRegionName AS CountryRegion,
C.EmailAddress,
C.BirthDate,
C.EnglishOccupation AS Occupation
FROM AdventureWorksDW.dbo.DimCustomer AS C
INNER JOIN AdventureWorksDW.dbo.DimGeography AS G
ON C.GeographyKey = G.GeographyKey
WHERE C.CustomerKey % 10 = 0
UNION
SELECT -11000,
N'Jon Yang',
N'3761 N. 14th St',
N'Munich',
N'Kingsland',
N'Austria',
N'jon24#adventure-works.com',
'18900224',
'Profesional'
UNION
SELECT -11100,
N'Jacquelyn Suarez',
N'7800 Corrinne Ct.',
N'Muenchen',
N'Queensland',
N'Australia',
N'[email protected]',
'19680206',
'Professional';
Describe how the data stated above is wrong and how DQS should react to this data, based on how
you configured your KB in Lab2.
3. Open the Data Quality Client application if necessary, and connect to your DQS instance.
3
SQL SERVER - DATA QUALITY SERVICES LAB
4. In the Data Quality Projects group, click the New Data Quality Project button.
5. Name your project and use the Customers knowledge base you created in Lab2. Make sure that the
Cleansing activity is selected. Click Next.
6. The Data Quality Project window opens with the first tab, the Map tab, active. Select SQL Server as the
data source, the DQS_STAGING_DATA database, and the CustomersDirty view in the Table/View drop-
down list.
7. In the Mappings area, click twice on the Add A Column Mapping button (the button with the small green
plus sign, above the mappings grid) to add two rows to the mappings grid. You need seven mappings,
and five are provided by default.
8. Use the drop-down lists in the Source Column and Domain cells to map the following columns and
domains:
Click Next.
9. On the Cleanse tab, click the Start button. Wait until the computer-assisted cleansing is finished, and
then review the profiling results. Then click Next.
10. On the Manage And View Results tab, check the results domain by domain. Start with the BirthDate
domain. There should be one invalid value. Make sure that the BirthDate domain is selected in the left
pane, and click the Invalid tab in the right pane. Note the invalid value that was detected. You could write
a correct value now in the Correct To cell of the grid with invalid values. Note that all correct values were
suggested as new. You can accept all values in a grid by clicking the Approve All Terms button.
11. Select the StreetAddress domain in the left pane. One value should be corrected. However, because only
the term-based relation was corrected, not the whole value, it does not appear between corrected
values. It should appear between the new values. Click the New tab in the right pane. Search for the
value 7800 Corrinne Ct. and note that it was corrected with 100 percent confidence to 7800 Corrinne
Court.
12. Select the State domain in the left pane. Click the New tab in the right pane. Note that one value
(Kingsland) was found as new. The similarity threshold to the original value (Queensland) was too low for
DQS to suggest or even correct the value automatically. You could correct this value manually.
4
SQL SERVER - DATA QUALITY SERVICES LAB
13. Select the City domain in the left pane. Two values should be corrected. Click the Corrected tab in the
right pane. Note the corrections of the synonyms for München to the leading value (München). Note also
that the confidence for these two corrections is 100 percent. All other values already existed in the KB,
and therefore DQS marked them as correct.
14. Select the Country domain in the left pane. One value should be suggested. Click the Suggested tab in
the right pane. Note that DQS suggests replacing Austria with Australia with 70 percent confidence. You
can approve a single value by checking the Approve option in the grid. However, don’t approve it,
because you will export results together with DQS information later. Note that DQS found other countries
as correct.
15. Select the EmailAddress domain in the left pane. One value should be invalid. Click the Invalid tab in the
right pane. DQS tells you that the jon24#adventure-works.com address does not comply with the
EmailRegEx rule. Note that all other values are marked as new.
16. Select the Occupation domain in the left pane. Note that all values are new. Click the New tab in the right
pane. Note that the value Profesional is underlined with a red squiggly line. This is because you enabled
spelling checker for the Occupation domain. Enter professional in the Correct To field for the incorrect
row. Note that because you corrected the value manually, the confidence is set to 100 percent. Select
the Approve check box for this row. The row should disappear and appear between the corrected
values. Click the Corrected tab. Note the corrected value along with the reason. Click Next.
17. On the Export tab, look at the output data preview on the left side of the window.
18. Select the SQL Server destination type in the Export Cleansing Results pane on the right side of the
screen. Select the DQS_STAGING_DATA database. Name the table customerscleansingresult. Do not
add a schema to the table name; DQS will put the results in the dbo schema. Make sure that the
Standardize Output check box is selected and that the Data And Cleansing Info option is selected. Click
Export.
19. When the export is complete, click the OK button in the pop-up window. Then click Finish.
20. Switch back to SSMS. View the results of your data cleansing.
Summarize the general steps to follow to cleanse data on DQS using a Knowledge Base, illustrate
them using the exercise above and your own dirty customers data. Describe and comment DQS’
data statuses throughout the steps and the auto-generated suggestions for cleansing.