0% found this document useful (0 votes)
245 views

White Paper - Data Warehouse Documentation Roadmap

Walker, Data warehousing documentation roadmap

Uploaded by

Sergey Melekhin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
245 views

White Paper - Data Warehouse Documentation Roadmap

Walker, Data warehousing documentation roadmap

Uploaded by

Sergey Melekhin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Data Management & Warehousing

WHITE PAPER

Data Warehouse
Documentation Roadmap
DAVID M WALKER
Version: 1.0
Date: 05/04/2007

Data Management & Warehousing


138 Finchampstead Road, Wokingham, Berkshire, RG41 2NU, United Kingdom
http://www.datamgmt.com

White Paper - Data Warehouse Documentation Roadmap

Table of Contents
Table of Contents ...................................................................................................................... 2
Synopsis .................................................................................................................................... 3
Intended Audience ..................................................................................................................... 3
About Data Management & Warehousing ................................................................................. 3
Introduction ................................................................................................................................ 4
Considerations ........................................................................................................................... 5
Documentation as a tool ............................................................................................................ 5
Which tools and products to use ............................................................................................... 6
What about a Wiki? ............................................................................................................... 6
Put your documentation on the Internet! ................................................................................... 7
Document Short Names ............................................................................................................ 7
Overview Diagram ..................................................................................................................... 8
The Templates ........................................................................................................................... 9
1
Concept ........................................................................................................................ 9
2
Requirements ............................................................................................................... 9
3
Architecture................................................................................................................. 11
4
Data Models................................................................................................................ 12
5
Analysis ...................................................................................................................... 14
6
Design......................................................................................................................... 16
7
Build ............................................................................................................................ 17
8
Test ............................................................................................................................. 17
9
Implementation ........................................................................................................... 18
10 Project Management .................................................................................................. 21
11 Miscellaneous ............................................................................................................. 24
Summary ................................................................................................................................. 25
Appendices .............................................................................................................................. 26
Appendix 1 Lifecycle of a bug .......................................................................................... 26
Appendix 2 Project Quick Start Infrastructure .................................................................. 27
References .............................................................................................................................. 28
Web resources .................................................................................................................... 28
Copyright ................................................................................................................................. 28

2006 Data Management & Warehousing

Page 2

White Paper - Data Warehouse Documentation Roadmap

Synopsis
All projects need documentation and many companies provide templates as part of a
methodology. This document describes the templates, tools and source documents used
by Data Management & Warehousing. It serves two purposes:

For projects using other methodologies or creating their own set of documents to
use as a checklist. This allows the project to ensure that the documentation
covers the essential areas for describing the data warehouse.
To demonstrate our approach to our clients by describing the templates and
deliverables that are produced.

Documentation, methodologies and templates are inherently both incomplete and


flexible. Projects may wish to add, change, remove or ignore any part of any document.
Some may also believe that aspects of one document would sit better in another. If this
is the case then users of this document and these templates are encouraged to change
them to fit their needs.
Data Management & Warehousing believes that the approach or methodology for
building a data warehouse should be to use a series of guides and checklists. This
ensures that small teams of relatively skilled resources developing the system can cover
all aspects of the project whilst being free to deal with the specific issues of their
environment to deliver exceptional solutions, rather than a rigid methodology that
ensures that large teams of relatively unskilled staff can meet a minimum standard.

Intended Audience
Reader
Executive
Business Users
IT Management
IT Strategy
IT Project Management
IT Developers

Recommended Reading
Synopsis to Overview Diagram
Synopsis to Overview Diagram
Synopsis to Overview Diagram
Synopsis to Overview Diagram
Entire Document
Entire Document

About Data Management & Warehousing


Data Management & Warehousing is a specialist consultancy in data warehousing,
based in Wokingham, Berkshire in the United Kingdom. Founded in 1995 by David M
Walker, our consultants have worked for major corporations around the world including
the US, Europe, Africa and the Middle East. Our clients are invariably large organisations
with a pressing need for business intelligence. We have worked in many industry sectors
but have specialists in Telcos, manufacturing, retail, financial and transport as well as
technical expertise in many of the leading technologies.
For further information visit our website at: http://www.datamgmt.com

2006 Data Management & Warehousing

Page 3

White Paper - Data Warehouse Documentation Roadmap

Introduction
A data warehouse programme will often run for many years and produce much
documentation. Data Management & Warehousing has identified three essential aspects
for documentation:

A roadmap that describes what documentation is required and how it fits


together.

Team members within the project to use the templates, create quality documents
and store them to the project repositories.

Easy access for people outside the project team to the documentation including
publication or notification of changes, updates and new releases.

This document provides the roadmap and looks at some of the issues associated with
the distribution of information outside the project team. The processes and procedures
required to create and store the in formation in the first place are a matter of project
governance.1
The documents listed are the templates used by Data Management & Warehousing and
we believe that they cover all the areas necessary for a major programme of work.
Templates, however, are created to fulfil a need and should be adapted as required. By
combining this document, the project plan and suitable governance a project will have
developed a strong foundation developing a successful data warehouse.

Data Management & Warehousing have published a white paper on Data Warehouse
Governance which is available from the website at
http://www.datamgmt.com/index.php?module=article&view=78

2006 Data Management & Warehousing

Page 4

White Paper - Data Warehouse Documentation Roadmap

Considerations
This document assumes that a data warehouse is a long-term investment by an
organisation and as such will form a programme of work. This programme will be broken
down into projects and where appropriate a project will have subsidiary phases.
The document also assumes that the project will maintain tight change control. Each
document should have:

A consistent naming conventions.

A Version Number and Date.

A draft, review, publish process that will allow a document version to be signed
off.

A process that over time allows a document to have many signed off versions.

A configuration management tool that enforces good practice.

Programmes that do not achieve this will find that the documentation becomes both
contradictory and a burden in itself and this can become a risk factor in the success of
the overall programme.

Documentation as a tool
Every project acknowledges the need to document2 itself. However, this ranges from lip
service and the production of some minimal notes to volumes of shelf-ware, paper that
sits unread for years on end because no one dares throw it away. Neither of these
outcomes is of any value.
Here are some guidelines for when and how to produce documentation:

Documents should only be produced when they serve a purpose.

Documents should only be maintained whilst the information needs to be current.

Documents should only be retained whilst they have value.

Documents should refer to other documents rather than duplicate information.

Documents should be under change/version control.

Documents should be succinct.

Poor grammar and bad writing are often signs of poor comprehension.3

Good documentation takes time.

See also Agile Documentation: A Pattern Guide to Producing Lightweight Documents for
Software Projects (Wiley Software Patterns Series) by Andreas Rueping
3
From Redhat Magazine: How to write really good documentation: Four Rules and an Axiom.
http://www.redhatmagazine.com/2007/01/30/how-to-write-really-good-documentationfourrules-and-an-axiom/

2006 Data Management & Warehousing

Page 5

White Paper - Data Warehouse Documentation Roadmap

Great expertise in a subject is not automatically a prerequisite for creation of


good documentation.

Do not let working cultures that put too great a premium on knowing everything
dominate.

Therefore pull from this roadmap what you need, do not produce everything just because
it is there.

Which tools and products to use


The choice of tools and products to use for a given project is based largely on the
standards of the organisation; most are standard office productivity tools. The table
below lists some of the tools used in our example templates:

Type of Document
Code Repository4
Data Cleaning tools5
Data Models6
Data Profiling7
Diagram
Document
Document Distribution
Issue Log8
Presentation
Project Plan9
System Testing10

Example Template:
CVS

Microsoft Visio
Microsoft Word
Adobe Acrobat
Bugzilla
Microsoft PowerPoint
Microsoft Project

What about a Wiki?


A Wiki is a web application that allows users to add and edit content in a
collaborative fashion. This is an ideal alternative for many of the
documents that are used in a data warehousing project and if the
organisation supports the use of a Wiki then it is often preferable to
create these templates in the Wiki rather than as separate documents and allow
widespread collaborative access to them. Throughout this document the Works With
Wikis logo has been added to those documents that Data Management &
Warehousing believe work well on a Wiki.
4

A list of configuration management tools can be found at:


http://www.cmcrossroads.com/cgi-bin/cmwiki/view/CM/WebHome
5
A list of data quality tools can be found at:
http://mediaproducts.gartner.com/reprints/dataflux/137738.html
6
A fist of data modelling tools can be found at:
http://www.databaseanswers.com/modelling_tools.htm
7
A list of data quality tools can be found at:
http://mediaproducts.gartner.com/reprints/dataflux/137738.html (March 2006)
8
A list of issue tracking tools can be found at:
http://www.testingfaqs.org/
9
A list of project management tools can be found at:
http://www.startwright.com/project1.htm
10
A list of testing tools can be found at:
http://www.testingfaqs.org/

2006 Data Management & Warehousing

Page 6

White Paper - Data Warehouse Documentation Roadmap

Put your documentation on the Internet!


The screams at this recommendation could be heard from the second it was first written
but Data Management & Warehousing strongly recommend that you put as much, if not
all, of your documentation on the web. Most organisations building a large data
warehouse will have individuals on vendor or remote sites or support people who have to
respond out of hours and may not have everything immediately available. To this end
providing remote secure11 access improves responsiveness for all involved and ensures
collective ownership of this information. If the web is not an option for your organisation
then at least consider publishing it on the corporate intranet.
An example solution would include a Wiki for dynamic documentation, a file repository
for distributed documents in Adobe Acrobat format, Bugzilla for issue tracking and
CVSWeb to allow users to view (but not edit) the code held in CVS. All of this software is
free and can be hosted on a single secure web server running Windows or Linux.12

Document Short Names


Some of the document titles have a three-letter acronym (e.g. KDD besides Key Design
Decisions or SSA besides Source Systems Analysis). This is because these documents
are often numbered and a short code allows them to be easily identified.

11

Whilst we recommend putting it on the internet, access should, as with any web application,
be controlled and secure.
12
See Appendix 2 Project Quick Start Infrastructure for a reference configuration

2006 Data Management & Warehousing

Page 7

1.2

Sample Available

Sample Available

2006 Data Management & Warehousing

7.2

10.2

10.1

11.1

General
Purpose
Document

General
Purpose
Presentation

11.2

Project Plan

Historical Data
Migration Plan

Documentation
Roadmap

9.2

Data Cleansing
Integration

Configuration
Management
Procedures

Code
Repository

Data Profiling

5.2

Security Model

9.1

7.1

Source System
Analysis

5.1

Technical
Architecture

3.2

Data
Warehouse
Data
Requirements

Data
Warehouse
Business
Requirements

3.1

2.2

Overview
Architecture for
Enterprise Data
Warehouses

2.1

Business
Concepts for
the Data
Warehouse

1.1

Operations
Guide

11.3

Meeting
Agenda

DRIVE
Statements

10.3

9.3

Source Entity
Analysis

5.3

Resilience Plan

3.3

Sample Available

Data
Warehouse
Query
Requirements

2.3

11.4

Memo

SWOT
Analyses

Capacity
Planning

Target
Orientated
Analysis

10.4

9.4

5.4

Data Quality
Plan

3.4

Data
Warehouse
Technical
Requirements

2.4

10.5

MoSCoW
Analyses

Service Level
Agreements

9.5

Sample Available

Data
Warehouse
Interface
Requirements

2.5

Business
Definitions
Dictionary

10.6

9.6

Change
Request

Helpdesk
Scripts

Unit Testing

8.1

ETL
Execution Plan

6.1

Data Modelling
Standards

4.1

2.5

Risk Register

10.7

Training Plan

9.7

System Testing

8.2

Initial Capacity
Plan

6.2

Logical Model

4.2

Integration
Testing

Coding
Standards

10.8

Issue Log

Operational
Schedule

9.8

8.3

6.3

Repository
Data Model

4.3

Key Design
Decisions

10.8

System
Monitoring Plan

9.9

Performance
Testing

8.4

Data Mart
Data Model(s)

4.4

ID

Sample Available

Tool
Template

ID

2006 Data Management & Warehousing

Documentation Roadmap Diagram

White Paper - Data Warehouse Documentation Roadmap

Overview Diagram

Figure 1 - Overview Documentation Roadmap Diagram

Page 8

White Paper - Data Warehouse Documentation Roadmap

The Templates
The templates are divided into eleven categories. Within each category the documents
are numbered sequentially. Some templates depend on others (indicated by the arrow
on the diagram whilst others can be done at any time within the phase. Finally, category
10 documents exist across the project lifecycle, whilst category 11 templates are just
general ones that can be used as required.

Concept

The concept phase is about describing what the big idea is. The business may
have a concept and the IT team will be able to describe the major component and
concepts of a data warehouse.

1.1 Business Concepts for the Data Warehouse


In order to start the data warehouse project a document that describes the
conceptual model of the information required by the business. This document
describes subject areas and their broad relationships as well as key performance
indicators used by the business. This document is a useful introduction but once
read it is unlikely to be an ongoing reference source.

1.2 Overview Architecture for Enterprise Data Warehouses


The Overview Architecture for Enterprise Data Warehouses is a design pattern for
data warehousing to describe the basic concepts of the data warehouse. As such,
a project can just download the completed document from the Data Management
& Warehousing website13 and further terms used in this document relate to
components described in that document.

Requirements

The requirements gathering phase of any data warehouse is one of the most
difficult. The objective of these templates is to give breadth and depth to the
requirements. Breadth is the ability to ensure that all truly required information
would be covered, whilst depth is the amount of detail that is specified in the
requirements to ensure that the developers have sufficient, unambiguous, detail
with which to develop.
Requirements should have a programme-long life cycle. After the initial version of
the requirements is developed a project can start the build, however the business
moves on and therefore whilst the build phase is occurring it is important that new
versions of the requirements are also being developed. A project within a
programme of work should have a fixed version of the requirements; however each
project may work with a different version of the requirements.

13

This document is available at:


http://www.datamgmt.com/index.php?module=article&view=76

2006 Data Management & Warehousing

Page 9

White Paper - Data Warehouse Documentation Roadmap

2.1 Data Warehouse Business Requirements (WBR)


The first template that Data Management & Warehousing use is called
the Data Warehouse Business Requirements and it details the soft requirements
for business information according to a number of subject areas of interest to the
business.
A business requirement is something of the form: Provide the average and total
revenue for each product category by customer market segment for the last three
years It is a requirement that is specified in business language and without
regard for the practicalities of delivering it. These requirements should be used to
get business users to underwrite the business benefit, i.e. if I could answer all of
these questions then I would be able to increase margin by a given percentage for
a given product.

2.2 Data Warehouse Data Requirements (WDR)


The second document details the hard requirements for business information
from the data perspective. This document goes a step deeper into understanding
the requirements, but is still written from the business users perspective.
This is the refinement of the business requirements in that the analysts can use
the business requirement to drive out the data required to answer the questions.
In the example above it is clear that both some part of the product hierarchy and
of the customer hierarchy are required as well as a time dimension and
information about revenue. It has also told the analyst that the minimum retention
period is three years.
Consequently, the analyst would start to build data requirements that may fulfil
many business requirements and to add additional attributes to help make sense
of the data. The data requirements lifecycle is similar to that of the business
requirements i.e. fixed for a project and variable over the lifespan of a
programme.

2.3 Data Warehouse Query Requirements (WQR)


The third document lists a number of potential queries to which the solution
should be able to provide answers. This is not an exhaustive list, but rather
represents the types of queries that are being asked by the business.
It is used to test the relationship between the data requirements, the data model
and the business requirements. A set of queries should be able to provide the
data required to answer a business requirement. At the same time the data must
be available as described in the data requirements and joined in such a way as to
be usable in the data model.

2006 Data Management & Warehousing

Page 10

White Paper - Data Warehouse Documentation Roadmap

2.4 Data Warehouse Technical Requirements (WTR)


The fourth document details the functional and non-functional requirements that
are expected of the solution. Again, these requirements are stated from the
business perspective rather than the technical perspective. The document should
include topics such:

The functionality required of the query tools.

The general retention requirements for data.

The performance characteristics.

The systems availability expectations.

2.5 Data Warehouse Interface Requirements (WIR)


The fifth and final requirements document details the requirements for interfaces
that feed from the data warehouse out to other systems. Often a data warehouse
will be required to deliver information to downstream systems that have existing
data interface specifications, these requirements have to be gathered to ensure
that as well as the user demands (derived from the Query Requirements) the
interface demands are also met.

2.6 Business Definitions Dictionary (BDD)


Throughout all of this process a number of business terms will be used. It is
important that a common dictionary is developed and kept so that there is a
common reference for words. This does not mean that there must be one
definition for each term but there should be a definition for how a term is used
within a given context. For example, to a customer support division a customer
might be those who have an active support contract, whilst to the sales team it
may be anyone who has ever bought a product. Both definitions are right within
their context.

Architecture

The architecture category contains a number of documents that describe how the
system should be built, these provide a blueprint to developers on how to approach
any particular problem by helping them select the appropriate tools, platforms and
configurations to both meet their need and conform to the overall strategy.

3.1 Technical Architecture


The technical architecture describes the technical components that will be used to
build the system. This will include the hardware, software and network
configuration, along with specific versions where appropriate and standards as to
which software product should be used for which job.14

14

As vendors broaden out their products they start to overlap in functionality. Ensuring that
two different products are not used to build the solution should be covered in this document

2006 Data Management & Warehousing

Page 11

White Paper - Data Warehouse Documentation Roadmap

3.2 Security Model


The security model should describe all the required roles/groups etc that will be
required for each component of the system. It needs to first set out the general
policies and then list explicit permissions for each component on the system.

3.3 Resilience Plan


The resilience plan15 should describe how the system is made resilient; this
should include (as required) the need for redundant hardware and networks,
incremental, cumulative and full backups, restores of individual components or
entire systems, how to back out records individually, as a group or entire sets and
how disasters such the loss of a data centre etc are managed.

3.4 Data Quality Plan (DQP)


The data quality plan should describe how data quality is managed. This will
include the principles of where data is cleansed (in the source, in the staging, in
the data warehouse itself, etc.) how it is profiled, what type of cleansing is carried
out (e.g. rule based or heuristic16), how it is profiled, what metrics are set and
monitored for data improvement, etc.

Data Models

Data models are (normally) graphical representations of the data that is required.
These are normally created in special software that can also generate the DDL
required to create the physical objects in the database.

4.1 Data Modelling Standards


This document describes the naming conventions of objects in the database, as
well as any particular modelling methods (e.g., a hierarchy must always be
modelled in a specific way and any exceptions noted along with a justification for
the difference). This document should describe the standards for all three data
models described modelling techniques for Logical and Physical models.

4.2 Logical Model


The logical data model is a model that represents the true structure of data used
by the business, independent of software or hardware implementation constraints.
Normally the model is closely related to the information described in Data
Warehouse Business Requirements.

15

This is sometimes called a Disaster Recovery Plan; however, as a name this does not
cover the full range of activities that are required.
16
Heuristic data cleansing uses methods such as fuzzy logic, etc. to try to clean data. This is
most successful with data such as addresses where there is the opportunity for lots of human
error in the information

2006 Data Management & Warehousing

Page 12

White Paper - Data Warehouse Documentation Roadmap

4.3 Repository Data Model


The repository data model is a physical data model17 of the main storage area
within a data warehouse. This model will reflect the logical data model in overall
structure but will have a number of compromises for the practical delivery of the
solution. It normally closely reflects the information described in the Data
Warehouse Data Requirements.

4.4 Data Mart Data Model(s)


The data mart data models are the physical models of the part of the system that
the user will query. These are often, though not necessarily, star schemas18 and
closely reflect the Data Warehouse Query Requirements.
It can be seen from the definitions above that the data models are derived from the
requirements and that they combination of the six documents act together to ensure
completeness. This is highlighted in the diagram below:

Figure 2 - The relationship between requirements and data models

17

A physical data model is a representation of a data design which takes into account the
facilities and constraints of a given database management system. A complete physical data
model will include all the database artefacts required to create relationships between tables or
achieve performance goals.
Wikipedia: http://en.wikipedia.org/wiki/Physical_data_model
18

A relational database schema that is used to represent multidimensional data. The data is
stored in a central fact table, with one or more tables holding information on each dimension.
Dimensions have levels and all levels are usually shown as columns in each dimension table.
OLAP Report: http://www.olapreport.com/glossary.htm

2006 Data Management & Warehousing

Page 13

White Paper - Data Warehouse Documentation Roadmap

Analysis

The goal of the analysis phase is to identify the sources of the information required
to populate the physical data models. The main goal should be to populate the
repository data model as this is used as the source for all data in the data marts.
This is achieved in a number of steps:

5.1 Source Systems Analysis (SSA)


The source system analysis is a high-level analysis that gathers information about
available systems. Each system is a potential candidate as a source system and
generates its own analysis document. Some systems may be documented and
then rejected e.g. because it is a secondary source and only contains information
created in another system that can be used as the source. The document covers
hardware, software, network connectivity, availability and functional areas (e.g.
CRM system containing customer data etc.).

5.2 Data Profiling


Data profiling is a process whereby an existing source system is examined in
order to collect information and statistics about that data held. This allows sources
for the data warehouse to be identified, validation of the metadata held about the
system and an assessment of the data quality.
There are many commercial tools available for this process; however, a simple
set of SQL scripts will often prove adequate. The SQL scripts allow an
experienced DBA with knowledge of the system to quickly write and iteratively
explore the data. Using a tool formalises the process but often performs
unnecessary analysis and requires significant additional infrastructure and tool
specific knowledge, slowing the process down. Results from any profiling should
be kept for future comparative analysis.

5.3 Source Entity Analysis (SEA)


The source entity analysis is the detailed documentation of the sources selected
because data profiling has validated these sources as being useful for the data
warehouse. This includes detailed information about every table and column,
including data types and data quality metrics. A source entity analysis will be
produced for each system that is to be used as a source.

2006 Data Management & Warehousing

Page 14

White Paper - Data Warehouse Documentation Roadmap

5.4 Target Orientated Analysis (TOA)


The final document template of the analysis phase is the Target Orientated
Analysis document that is used to describe which sources will be used to populate
which target entities. This document is sometimes replaced by a source to target
mapping document.19 This document is used by the developers in the design and
build of the ETL20 code. A target-orientated analysis will be developed for each
subject area in the data warehouse.
The output dependencies for the analysis can described as follows:

Figure 3 - Analysis Dependencies

19

Data Management & Warehousing prefer target orientated analysis which asks the
question Which sources do I need in order to populate this target table completely? to the
source to target mapping method which asks the question Which target entities do I need to
put data into from this source? This is because the thought process used in the first method
is geared towards the delivery of the information to the user rather than the extraction of the
data by the developer.
20
ETL: Extract, Transform, Load code written to move data from the sources to the data
warehouse

2006 Data Management & Warehousing

Page 15

White Paper - Data Warehouse Documentation Roadmap

Design

The design phase concentrates on taking the analysis and creating a plan for the
code build.

6.1 ETL Execution Plan


The ETL Execution Plan is a document that explains from the high level down to
the low level how the ETL code will be put together. One of the most effective
ways of doing this is as a series of directed graphs.21 For example, there may be
a diagram that represents the overall flow. In this case each point would represent
a subject area. There are then a number of additional graphs, one for each
subject area, that represent the detailed flows within a subject area. This drilling
down is repeated until the lowest level ETL mappings are described or the
required level of detail is documented.

6.2 Initial Capacity Plan


The initial capacity plan describes the sizes of the databases and database
objects required for the initial build. This should describe the sizing for a known
period (e.g. for 1 year) and a number of environments (Development, Test,
Production). A large amount of the information required for this can often be
derived from the data-modelling tool used.

6.3 Coding Standards


A document that describes the naming conventions for all objects that will be
created, including but not limited to: database objects such as table and column
names, ETL mapping names, script names etc. It will also describe and mandate
or recommend any specific coding standards and/or algorithms.
The diagram below describes the output dependencies:

Figure 4 - Design Dependencies


21

This is a term taken from mathematics. Graph theory is the study of graphs, mathematical
structures used to model relations between objects in a given collection. A "graph" in this
context refers to a collection of mappings and their dependencies. The most famous graph
theory problem is known as The Seven Bridges of Konigsberg and was solved by Leonhard
Euler. http://mathforum.org/isaac/problems/bridges1.html

2006 Data Management & Warehousing

Page 16

White Paper - Data Warehouse Documentation Roadmap

Build

This white paper is a roadmap to the documentation that should be produced during a
data warehouse project. It should follow the structure of most projects but it is not a
substitute for a project plan.

7.1 Code Repository


A lot of the code will contain valuable documentation in the form of comments. It
is also vital that the history of changes to code is recorded. Therefore, an
important part of the documentation is the information held in the configuration
management tool.22

7.2 Data Cleansing Integration


The data profiling described above will have generated a number of rules that will
have to be implemented in order to maintain data quality. These rules will have to
be stored and integrated into the ETL. If a data-cleansing tool23 is being used then
these rules will be documented within the tool. Otherwise, the rules should be
explicitly documented for future reference.

Test
Testing software is operating the software under controlled conditions24, to
1. Verify that it behaves as specified
Verification is the checking or testing of items, including software, for
conformance and consistency by evaluating the results against prespecified requirements.
2. To detect errors
Testing should intentionally attempt to make things go wrong to
determine if things happen when they should not or things do not
happen when they should.
In this area it is important to test boundary conditions25 e.g. what
happens with a percentage over 100% or less than 0%.
3. To validate that what has been specified is what the user actually
wanted.
Validation looks at the system correctness i.e. is the process of
checking that what has been specified is what the user actually
wanted.

22

A list of configuration management tools can be found at:


http://www.cmcrossroads.com/cgi-bin/cmwiki/view/CM/WebHome
23
A list of data quality tools can be found at:
http://mediaproducts.gartner.com/reprints/dataflux/137738.html (March 2006)
24
Taken from: http://members.tripod.com/~bazman/
25
A useful description of boundary values testing can be found at:
http://www.geocities.com/xtremetesting/BoundaryValues.html

2006 Data Management & Warehousing

Page 17

White Paper - Data Warehouse Documentation Roadmap

Remember: The purpose of testing is verification, validation and error detection


in order to find problems and the purpose of finding those problems is to get
them fixed.
Each type of test document below should have inclusions and exclusions, test
cycles, expected results, entrance and exit criteria, etc. If a tool26 is used much
of this can be automated.

8.1 Unit Testing


These are tests that are designed to validate what an individual unit of
development work (normally an ETL mapping, input screen or report) is
functioning as expected.

8.2 System Testing


These are tests that are designed to check that a suite of newly developed or
changed units work correctly together in the expected manner.

8.3 Integration Testing


These are tests that are designed to ensure that the suites of newly developed or
changed units work with other suites that are already deployed on the system and
do not damage the existing product environment.

8.4 Performance Testing


The final set of tests is designed to ensure the performance of the system. A
system that is recording 10,000 transactions a day will be inserting into an empty
table on the first day of operation and a table of 3.5M records after a year, the
performance characteristics of this work vary dramatically from one database to
the next. These tests must therefore examine the short term and long-term
performance impacts of any given change.

Implementation

After the development and testing are over the system has to be deployed into
production and left operating. Implementation is often neglected on project plans. It
requires considerable thought and time to document procedures that will be used for
many years to come.

9.1 Configuration Management Procedures


The configuration management procedures should cover all aspects of the
changes to the configuration from applying patches and new releases through to
system software upgrades.

26

A list of testing tools can be found at: http://www.testingfaqs.org/

2006 Data Management & Warehousing

Page 18

White Paper - Data Warehouse Documentation Roadmap

Historical Data Migration Plan


When a data warehouse is deployed it is usual that some amount of historical
data is required. There should be a plan that identifies what data is required, how
far back in history it needs to go (one week, one year, etc.), how long this data will
take to load, whether it will be loaded before or after go live and impacts on the
day to day operation whilst it is loading etc.

9.2 Operations Guide


The operations guide is intended for those with responsibility for looking after the
system on a day-to-day basis. These will not be the developers who originally
created the system and therefore a simple, clear guide as what needs to be done
routinely, what needs to be checked regularly and the escalation procedures in
case of exceptions and failures needs to be created.

9.3 Capacity Plan


A document describing the Initial Capacity Plan will have already been produced.
Various external factors will however change the capacity requirements (e.g.
sudden growth in sales, mergers and acquisitions, new product lines, etc. or
simply more people making use of the system or a new version of some software
component) that can have a dramatic effect on the capacity of the system. As
such, there should be regularly updated capacity plan that monitors the available
disk, CPU and memory resources of the solution and ensures that sufficient
resource is available and that any procurement of additional resource is done in
line with the company budgetary cycle.

9.4 Service Level Agreements (SLA)


A Service Level Agreement27 is a formal negotiated agreement between two
parties (i.e. the business users of the data warehouse and the IT department or
the IT department and outsourced service providers). It documents the common
understanding about services, priorities, responsibilities, guarantee, etc. with the
main purpose to agree on the level of service.
For example, it may specify the levels of availability, serviceability, performance,
operation or other attributes of the service like billing and even penalties in the
case of violation of the Service Level Agreement. A Service Level Agreement is
generally business oriented and does not go into much technical detail. Its
technical specifications are commonly described through a series of appendices28
known as Service Level Specifications (or SLS) that define the technical metrics
required.

9.5 Helpdesk Scripts


The helpdesk will need to be able to handle support calls. This is normally done
as a series of help desk scripts that provide the questions for the support desk
operators to ask the user. The operator can then either give the resolution or ask
subsequent questions (which are normally dependent on the result of previous
questions).
27
28

Definition in part from Wikipedia: http://en.wikipedia.org/wiki/Service_Level_Agreement


Depending on size the Service Level Specifications may become separate documents.

2006 Data Management & Warehousing

Page 19

White Paper - Data Warehouse Documentation Roadmap

The helpdesk scripts are normally broken down into a number of categories such
as:

Support with using the front-end tools.

Server and Operational Issues.

Data Quality Issues including availability and currency of data.

Ad hoc enquiries.

9.6 Training Plan


The project will need to provide a training plan29. This is how users become
competent enough to use the system. It normally consists of the steps:

Define the training goal:


The overall results or capabilities the user should attain.

Set the learning objectives:


What the user will be able to do as a result of the learning activities.

Learning methods and activities:


What the user will do in order to achieve the learning objectives.

Documentation and evidence of learning:


Produced by the user of the learning activities, these are the tangible
results.

Evaluation:
Assessment and judgment on quality of evidence in order to conclude
whether the user has achieved the learning objectives or not.

Training plans should be created for all types of users and operators of the
system.

9.7 Operational Schedule


The Operational Schedule is the list of tasks that must be performed each hour,
day, week and month, etc. and any dependencies (e.g. must run after midnight,
must only run if a previous job is successful etc.). It is not only the ETL code but
also the backups and any maintenance windows that have to be included. These
are often implemented in Job Scheduling Tools30 that automate the process and
send alerts if anything fails.

9.8 System Monitoring Plan


The system monitoring plan is the list of system components that are going to be
monitored, along with threshold at which warnings and errors are signalled. It
should also include the way in which each message is communicated (e.g.
audible or visible alert in a control room, SMS or text message, e-mail, etc.).

29
30

Adapted from: http://www.managementhelp.org/trng_dev/gen_plan.htm


See: http://www.jobschedulingtools.com/

2006 Data Management & Warehousing

Page 20

White Paper - Data Warehouse Documentation Roadmap

Systems monitoring should also deal with heartbeat messages, i.e. messages
that tell you that the monitoring is still working. Monitoring information should be
retained so that it can be used to manage Service Level Agreements, provide
information for Capacity Plans.
The documents described in the implementation category interact as follows:

Figure 5 - Implementation Flow

10 Project Management
Up to now this document has described documents required for individual phases of
the project. There are a number of tools and templates required for the effective
governance31 of the programme or project. Project management should have the
minimal impact on the process of development whilst ensuring that control over
resources, finances and scope is maintained. This category describes documents
used to control or assist in the management of a project.

10.1

Documentation Roadmap

The Documentation Roadmap is the document that describes all the documents
that should be produced for each of the phases of a project. The document you
are reading is an example of this document.

10.2

Project Plan

The project plan is the list of tasks and activities with timescales, resources and
dependencies that must be performed to deliver the solution. A project plan is
base-lined and regularly updated throughout the life of the project. It is important
that project plans have sufficient detail without trying to micro-mange tasks in the
short-term whilst having larger objectives with less detailed activities for the
longer-term aspects of the project. The plan is then updated as sufficient detail to
plan later tasks becomes available.

31

White Paper Data Warehouse Governance


http://www.datamgmt.com/index.php?module=article&view=78

2006 Data Management & Warehousing

Page 21

White Paper - Data Warehouse Documentation Roadmap

10.3

DRIVE Statements

A drive statement is short one page template that helps a project manager assess
whether a project, or work package should be undertaken. It looks at five aspects
in order to make the assessment:

Dependencies:
What is required before this work can start?

Risks & Issues:


What can go wrong with doing this and how will it affect the overall
business, this deliverable and/or other deliverables?

Imperative:
Why do we have to do this? What makes it so important?

Value:
What value will the business, team or overall project get from doing this?

Exploitation:
Once we have this solution how will be able to take advantage of it?

10.4

SWOT Analysis

A SWOT32 analysis is often used in data warehouse projects as a way of


comparing different approaches to a problem. It does this by looking at the
following attributes of each approach:

Strengths

Weaknesses

Opportunities

Threats

10.5

MoSCoW Analysis

The MoSCoW33 analysis is a method of prioritising a list of requirements of


features of the system by breaking the list down into the following groups:

32
33

Must have in order to meet a minimum requirement.

Should have in order to get real value from the development.

Could have if there was available time or resources.

Would have if there were no limits on the development.

Further information at: http://en.wikipedia.org/wiki/SWOT_analysis


Further information at: http://en.wikipedia.org/wiki/MoSCoW_Method

2006 Data Management & Warehousing

Page 22

White Paper - Data Warehouse Documentation Roadmap

10.6

Change Requests (CR)

The change request is a critical component of any project and is vital to data
warehouse projects. At the outset of this document the requirements gathering
process was discussed, however during the lifecycle of the project the
requirements (and other aspects of the project) will change. The change request
is the template that documents a change from the original requirement to what is
now required. Change requests can be accepted or rejected as appropriate and
should be encouraged as a way to prevent uncontrolled and un-scoped
development from occurring.

10.7

Risk Register

The risk register is a list of events that may happen. If the event occurs then it will
have some negative impact on the project in terms of cost, resource or time. This
is contrasted with an issue that is something that has happened and therefore
needs to be managed.

The second is the impact, a


measure of the cost it will
have in terms of resource,
time or scope.

The hotter the risk the


attention should be paid to it.

10.8

more

Issue Log (BUG)

Low

Medium

The first is the probability of


it happening which is a
measure of how likely it is
that a risk will become an
issue.

Probability

High

A risk can be described in two


dimensions:

Low

Medium

High

Impact

Figure 6 - Risk Assessment

The issue log is the active management of issues that have arisen. This is best
managed with an issue-tracking tool34 that supports the allocation of work to
resources and tracks the history of actions taken in response to an issue. Each
issue has a lifecycle that starts with its being reported and ends in resolution.35

34

A list of issue tracking tools can be found at: http://www.testingfaqs.org/


Bugzilla has one of the best descriptions of the lifecycle of an issue. This can be found at:
http://www.bugzilla.org/docs/3.0/html/lifecycle.html and is reproduced in Appendix 2 of this
document
35

2006 Data Management & Warehousing

Page 23

White Paper - Data Warehouse Documentation Roadmap

10.9

Key Design Decisions (KDD)

The key design decision is a template to record significant design decisions. It


records the issue, the chosen option, any rejected options and rationale behind
the decision.
Examples of when to use a Key Design Decision might include the choice of a
specific tool for a specific function, the choice of data model style, etc. This is
important for long-term programmes and projects as some decisions are
questioned when reviewed at a later stage. This is often done without context and
justification for how the original decision was made and sometimes without the
original decision makers being available.
The template helps project managers from constantly returning to resolved
issues. The document should contain the justification for the decision and any
rejected opposing arguments.

11 Miscellaneous
The final category of this document describes some general-purpose documents that
a project will find useful. The direct impact of the data warehouse project will not
always be visible to business users, who may see it as a large budget line with little
benefit. It is therefore important to market the data warehouse to the wider business
audience.
Business users should understand how and when they are getting information from
the data warehouse rather than other sources and see the impact of data quality
initiatives, etc. Therefore, Data Management & Warehousing recommend that all
documents use a consistent set of templates and that where a set of templates are
used they are branded with the name of the project rather than any third party that is
contributing to the project.

11.1

General Purpose Document

A standard look and feel document with the required categories for any project
document required.

11.2

General Purpose Presentation

This document is a presentation with a standard look and feel. This should be
used for all presentations that are given either inside or outside the team.

11.3

Meeting Agenda

A standard agenda template for meetings.

11.4

Memo

This document provides a standard memo format for anyone who is recording
formal aspects of the project outside the documentation roadmap.

2006 Data Management & Warehousing

Page 24

White Paper - Data Warehouse Documentation Roadmap

Summary
Many data warehousing projects are both long running and poorly documented. This
does not mean that there is not a lot of documentation, just a lack of the right
documentation in the right place. It is the quality and availability of the documentation
that leads to an understanding of what is available and hence to the value and reputation
of the data warehouse itself.
This white paper has looked at a consistent set of documents developed over fifteen
years of project experience. It reflects a desire to develop the right amount of
documentation at the right time in the project lifecycle and stored in the right place. Doing
so means moving some documents held on project shared drives to web based media
and publishing documentation to a wider audience, whilst replacing some documents
with online tools. It is essential to the success of a data warehouse project that a culture
of open access is fostered and that the documentation is seen as the entry point to the
data warehouse.
Data Management & Warehousing has identified three aspects to essential
documentation:

A roadmap that describes what documentation is required and how it fits


together.

Team members within the project to use the templates, create quality documents
and store them to the project repositories.

Easy access for people outside the project team to the documentation including
publication or notification of changes, updates and new releases.

This white paper has provided the documentation roadmap with both explanations and a
significant number of examples. It has also looked at some of the issues associated with
the distribution of information outside the project team. It has highlighted that the
processes and procedures required to create and store the information in the first place
are a matter of good project governance.
Data Management & Warehousing believe that the documents described here cover all
the areas necessary for a major programme of work. However, this is only a guide and a
set of templates and these should be adjusted to meet the needs of the programme. By
combining the documentation roadmap, the project plan and suitable governance a
project will have developed a strong foundation for the real work of developing a
successful data warehouse.

2006 Data Management & Warehousing

Page 25

White Paper - Data Warehouse Documentation Roadmap

Appendices
Appendix 1 Lifecycle of a bug
The lifecycle of a bug (or issue) is taken from the Bugzilla36 documentation

36

The original can be found at: http://www.bugzilla.org/docs/3.0/html/lifecycle.html

2006 Data Management & Warehousing

Page 26

White Paper - Data Warehouse Documentation Roadmap

Appendix 2 Project Quick Start Infrastructure


Readers of this document and some of our other white papers37 will be concerned
about just how much effort is required to get a data warehouse project under way.
Data Management & Warehousing do not recommend products or individual vendors
unless specifically requested to do so for a particular project or need. However, there
is a need to suggest a basic infrastructure for organisations that do not have anything
in place. The basic infrastructure is neither exhaustive nor exclusive but a guideline
configuration. It is very cheap and sufficient to support a very large organisation for a
several years:

A small server38
For example a dual core 2GHz CPU, 2Gb memory, two (for mirroring) 200Gb
disks.

Network connectivity
Normally two network cards, one for the LAN and the other for the internet, IP
Addresses and server names.

Remote backup capability


For example a location on the SAN where files can be backed up to every day or
more frequently if required.

Linux
A version such Redhat, SuSe, CentOS, Debian, etc.

Apache Web Server


To provide all web services.

Bugzilla
To provide issue tracking.

Samba
To provide a Microsoft compatible shared file system.

CVS
To provide the source code control.

CVSWeb
To provide a web interface to CVS.

Perl
Pre-requisite language for Bugzilla.

CPAN Bundle::Bugzilla
Pre-requisite language modules for Bugzilla.

A Content Management System (CMS)39 package


Data Management & Warehousing use phpWebsite but any will do.

37

Overview Architecture for Enterprise Data Warehouses and Data Warehouse Governance
are both available from http://www.datamgmt.com.
38
Many organisations will have a server being de-commissioned from some other project that
could be re-used for this it does not have to be very powerful
39
List available from http://www.cmsmatrix.org/

2006 Data Management & Warehousing

Page 27

White Paper - Data Warehouse Documentation Roadmap

A Wiki
Data Management & Warehousing use the one included with phpWebsite but
any will do.

PHP
Pre-requisite language for phpWebsite.

MySQL
Pre-requisite database for Bugzilla and phpWebsite.

From a technical point of view this server can be built and the software downloaded,
installed, configured, secured and put on the internet and shared onto the LAN very
quickly. Normally an experienced Linux Systems Administrator could configure a
virtually maintenance free solution within a couple of days and as the software is all
free the only costs incurred will be for time and hardware. This is also the basic
configuration of the Data Management & Warehousing website.
In addition it is recommended that the following desktop software be provided:

Office product
e.g. Microsoft Office, Star Office, etc.

CVS Client
Data Management & Warehousing normally use WinCVS.

This list is for a server for the project governance and documentation of a data
warehouse project and does not include the development, test and production
environments from the data warehouse itself. Its simplicity, fast setup and low cost is
a demonstration of low impact governance of a data warehouse project.

References
The section below represents some useful resources for those considering building a
data warehouse solution.

Web resources
Organisation
Data Management & Warehousing
Configuration Management Wiki
Data Quality Tools
Software Testing Tools
Job Scheduling Tools
Data Modelling Tools
Project Management Tools
CMS Tools
Bugzilla

Website
http://www.datamgmt.com
http://www.cmcrossroads.com/
http://mediaproducts.gartner.com/
http://www.testingfaqs.org/
http://www.jobschedulingtools.com/
http://www.databaseanswers.com/
http://www.startwright.com/
http://www.cmsmatrix.org/
http://www.bugzilla.org

Copyright
2007 Data Management & Warehousing. All rights reserved. Reproduction not
permitted without written authorisation. References to other companies and their
products use trademarks owned by the respective companies and are for reference
purposes only.

2006 Data Management & Warehousing

Page 28

You might also like