Data Staging
LotaDoors is a company that sells building materials to trade and public customers. All
transactions are recorded in their branch based operational systems, a schema of which
follows.
Management have asked you to write a system that will report, for each quarter (three
months) and for each product group (timber, building materials, etc) how the company's
quantities of sales compare with the national market totals.
The data for full market sector will be supplied to you (at a cost!) by a market intelligence
company. The information comes as a CSV file (Comma Separated Value file) attachment
to an email at the end of each quarter. For example, the attachment might look like this:
Year 2002
Quarter 1 Value
Timber 250,000
Building 485,200
Hardware 94,500
In data format in the CSV file will look like this:
Year,Quarter,ProdGrp,Value
2002,1,"Timber",250000
2002,1,"Building",485200
2002,1,"Hardware",94500
Inspect the following star structure to see if it will give you the desired answer:
1 of 3
We will now consider the ETL process and design the Staging Area. The process needs
to capture data from the LotaDoors system, incorporate the market sector data, cleanse,
merge, transform and load the data; the result of all this will be the populated star.
Exercise
Task 1: Identify the original data sources. For each of the required data sources
(LotaDoors, market data), show the source and the column names that you need
to extract and process.
Task 2: For each original data source, create a named Staging Area table and show the
columns it contains. This is the start of a processing stream to transform that
table's data.
Task 3: For each table in the final star, create a named Star table and show the columns
it contains.
Task 4: For each table in the final star, create a named Staging Area table and show the
columns it contains.
Task 5: Assume that you are designing the Staging Area to be used for the first time, ie
there is no existing data in the Staging Area or final Star. For each processing
stream, identify validation tests that should be carried out on the data. For each
test identify possible data errors. Show new named tables and their columns,
one each to hold valid and erroneous data (assuming there can be errors)
resulting from the validation. Show where erroneous data is returned into the
main processing stream when it has been corrected.
There will be situations where processing streams merge and/or terminate
having satisfied all their processing needs. Show new named tables and their
columns for all such merging of streams.
All processing streams should either terminate or contribute to the population of
one or more of the final Staging Area tables (as created in Task 4).
Note- for space reasons you will have to omit many of the error tables and
corrective loops. It is important that you still realise their role in the processing!
Task 6: Identify the procedures related to and the sequence of loading the Star tables
from the final Staging Area tables. You should carry out some research into this
aspect.
Task 7: With reference to each of the validation tests identified in Task 5, what action
may be possible in the original systems to remove or reduce the incidence or
errors, and what corrective action should be taken with the current data (that
transferred into the error table)?
Task 8: Identify how the Staging Area and ETL processes will need to change to process
Quarter 2 data, ie when data already exists within the Star.
Task 9: What are the implications to Staging Area and ETL processes of incorporating an
ODS?
Task 10: Now download and inspect the LotaMerge database. Identify the new fact table
that has been created and how the Data Mart has been modified to deal with this
new table. Also download the CVS files and look at their structure. There is one
for each quarter of 2002.
2 of 3
Below is the entity-relationship diagram for the modified LotaStar (LotaMerge) Data Mart.
The major point that we are illustrating in this Exercise is that data Extraction,
Transformation and Loading is a complex exercise. Some authors say that it takes
70% of the budget of a data warehousing project. Unfortunately, it has to be done, even
for small projects. It isn't a project phase that can easily be shortened.
3 of 3