White Paper - Data Warehouse Documentation Roadmap
White Paper - Data Warehouse Documentation Roadmap
WHITE PAPER
Data Warehouse
Documentation Roadmap
DAVID M WALKER
Version: 1.0
Date: 05/04/2007
Table of Contents
Table of Contents ...................................................................................................................... 2
Synopsis .................................................................................................................................... 3
Intended Audience ..................................................................................................................... 3
About Data Management & Warehousing ................................................................................. 3
Introduction ................................................................................................................................ 4
Considerations ........................................................................................................................... 5
Documentation as a tool ............................................................................................................ 5
Which tools and products to use ............................................................................................... 6
What about a Wiki? ............................................................................................................... 6
Put your documentation on the Internet! ................................................................................... 7
Document Short Names ............................................................................................................ 7
Overview Diagram ..................................................................................................................... 8
The Templates ........................................................................................................................... 9
1
Concept ........................................................................................................................ 9
2
Requirements ............................................................................................................... 9
3
Architecture................................................................................................................. 11
4
Data Models................................................................................................................ 12
5
Analysis ...................................................................................................................... 14
6
Design......................................................................................................................... 16
7
Build ............................................................................................................................ 17
8
Test ............................................................................................................................. 17
9
Implementation ........................................................................................................... 18
10 Project Management .................................................................................................. 21
11 Miscellaneous ............................................................................................................. 24
Summary ................................................................................................................................. 25
Appendices .............................................................................................................................. 26
Appendix 1 Lifecycle of a bug .......................................................................................... 26
Appendix 2 Project Quick Start Infrastructure .................................................................. 27
References .............................................................................................................................. 28
Web resources .................................................................................................................... 28
Copyright ................................................................................................................................. 28
Page 2
Synopsis
All projects need documentation and many companies provide templates as part of a
methodology. This document describes the templates, tools and source documents used
by Data Management & Warehousing. It serves two purposes:
For projects using other methodologies or creating their own set of documents to
use as a checklist. This allows the project to ensure that the documentation
covers the essential areas for describing the data warehouse.
To demonstrate our approach to our clients by describing the templates and
deliverables that are produced.
Intended Audience
Reader
Executive
Business Users
IT Management
IT Strategy
IT Project Management
IT Developers
Recommended Reading
Synopsis to Overview Diagram
Synopsis to Overview Diagram
Synopsis to Overview Diagram
Synopsis to Overview Diagram
Entire Document
Entire Document
Page 3
Introduction
A data warehouse programme will often run for many years and produce much
documentation. Data Management & Warehousing has identified three essential aspects
for documentation:
Team members within the project to use the templates, create quality documents
and store them to the project repositories.
Easy access for people outside the project team to the documentation including
publication or notification of changes, updates and new releases.
This document provides the roadmap and looks at some of the issues associated with
the distribution of information outside the project team. The processes and procedures
required to create and store the in formation in the first place are a matter of project
governance.1
The documents listed are the templates used by Data Management & Warehousing and
we believe that they cover all the areas necessary for a major programme of work.
Templates, however, are created to fulfil a need and should be adapted as required. By
combining this document, the project plan and suitable governance a project will have
developed a strong foundation developing a successful data warehouse.
Data Management & Warehousing have published a white paper on Data Warehouse
Governance which is available from the website at
http://www.datamgmt.com/index.php?module=article&view=78
Page 4
Considerations
This document assumes that a data warehouse is a long-term investment by an
organisation and as such will form a programme of work. This programme will be broken
down into projects and where appropriate a project will have subsidiary phases.
The document also assumes that the project will maintain tight change control. Each
document should have:
A draft, review, publish process that will allow a document version to be signed
off.
A process that over time allows a document to have many signed off versions.
Programmes that do not achieve this will find that the documentation becomes both
contradictory and a burden in itself and this can become a risk factor in the success of
the overall programme.
Documentation as a tool
Every project acknowledges the need to document2 itself. However, this ranges from lip
service and the production of some minimal notes to volumes of shelf-ware, paper that
sits unread for years on end because no one dares throw it away. Neither of these
outcomes is of any value.
Here are some guidelines for when and how to produce documentation:
Poor grammar and bad writing are often signs of poor comprehension.3
See also Agile Documentation: A Pattern Guide to Producing Lightweight Documents for
Software Projects (Wiley Software Patterns Series) by Andreas Rueping
3
From Redhat Magazine: How to write really good documentation: Four Rules and an Axiom.
http://www.redhatmagazine.com/2007/01/30/how-to-write-really-good-documentationfourrules-and-an-axiom/
Page 5
Do not let working cultures that put too great a premium on knowing everything
dominate.
Therefore pull from this roadmap what you need, do not produce everything just because
it is there.
Type of Document
Code Repository4
Data Cleaning tools5
Data Models6
Data Profiling7
Diagram
Document
Document Distribution
Issue Log8
Presentation
Project Plan9
System Testing10
Example Template:
CVS
Microsoft Visio
Microsoft Word
Adobe Acrobat
Bugzilla
Microsoft PowerPoint
Microsoft Project
Page 6
11
Whilst we recommend putting it on the internet, access should, as with any web application,
be controlled and secure.
12
See Appendix 2 Project Quick Start Infrastructure for a reference configuration
Page 7
1.2
Sample Available
Sample Available
7.2
10.2
10.1
11.1
General
Purpose
Document
General
Purpose
Presentation
11.2
Project Plan
Historical Data
Migration Plan
Documentation
Roadmap
9.2
Data Cleansing
Integration
Configuration
Management
Procedures
Code
Repository
Data Profiling
5.2
Security Model
9.1
7.1
Source System
Analysis
5.1
Technical
Architecture
3.2
Data
Warehouse
Data
Requirements
Data
Warehouse
Business
Requirements
3.1
2.2
Overview
Architecture for
Enterprise Data
Warehouses
2.1
Business
Concepts for
the Data
Warehouse
1.1
Operations
Guide
11.3
Meeting
Agenda
DRIVE
Statements
10.3
9.3
Source Entity
Analysis
5.3
Resilience Plan
3.3
Sample Available
Data
Warehouse
Query
Requirements
2.3
11.4
Memo
SWOT
Analyses
Capacity
Planning
Target
Orientated
Analysis
10.4
9.4
5.4
Data Quality
Plan
3.4
Data
Warehouse
Technical
Requirements
2.4
10.5
MoSCoW
Analyses
Service Level
Agreements
9.5
Sample Available
Data
Warehouse
Interface
Requirements
2.5
Business
Definitions
Dictionary
10.6
9.6
Change
Request
Helpdesk
Scripts
Unit Testing
8.1
ETL
Execution Plan
6.1
Data Modelling
Standards
4.1
2.5
Risk Register
10.7
Training Plan
9.7
System Testing
8.2
Initial Capacity
Plan
6.2
Logical Model
4.2
Integration
Testing
Coding
Standards
10.8
Issue Log
Operational
Schedule
9.8
8.3
6.3
Repository
Data Model
4.3
Key Design
Decisions
10.8
System
Monitoring Plan
9.9
Performance
Testing
8.4
Data Mart
Data Model(s)
4.4
ID
Sample Available
Tool
Template
ID
Overview Diagram
Page 8
The Templates
The templates are divided into eleven categories. Within each category the documents
are numbered sequentially. Some templates depend on others (indicated by the arrow
on the diagram whilst others can be done at any time within the phase. Finally, category
10 documents exist across the project lifecycle, whilst category 11 templates are just
general ones that can be used as required.
Concept
The concept phase is about describing what the big idea is. The business may
have a concept and the IT team will be able to describe the major component and
concepts of a data warehouse.
Requirements
The requirements gathering phase of any data warehouse is one of the most
difficult. The objective of these templates is to give breadth and depth to the
requirements. Breadth is the ability to ensure that all truly required information
would be covered, whilst depth is the amount of detail that is specified in the
requirements to ensure that the developers have sufficient, unambiguous, detail
with which to develop.
Requirements should have a programme-long life cycle. After the initial version of
the requirements is developed a project can start the build, however the business
moves on and therefore whilst the build phase is occurring it is important that new
versions of the requirements are also being developed. A project within a
programme of work should have a fixed version of the requirements; however each
project may work with a different version of the requirements.
13
Page 9
Page 10
Architecture
The architecture category contains a number of documents that describe how the
system should be built, these provide a blueprint to developers on how to approach
any particular problem by helping them select the appropriate tools, platforms and
configurations to both meet their need and conform to the overall strategy.
14
As vendors broaden out their products they start to overlap in functionality. Ensuring that
two different products are not used to build the solution should be covered in this document
Page 11
Data Models
Data models are (normally) graphical representations of the data that is required.
These are normally created in special software that can also generate the DDL
required to create the physical objects in the database.
15
This is sometimes called a Disaster Recovery Plan; however, as a name this does not
cover the full range of activities that are required.
16
Heuristic data cleansing uses methods such as fuzzy logic, etc. to try to clean data. This is
most successful with data such as addresses where there is the opportunity for lots of human
error in the information
Page 12
17
A physical data model is a representation of a data design which takes into account the
facilities and constraints of a given database management system. A complete physical data
model will include all the database artefacts required to create relationships between tables or
achieve performance goals.
Wikipedia: http://en.wikipedia.org/wiki/Physical_data_model
18
A relational database schema that is used to represent multidimensional data. The data is
stored in a central fact table, with one or more tables holding information on each dimension.
Dimensions have levels and all levels are usually shown as columns in each dimension table.
OLAP Report: http://www.olapreport.com/glossary.htm
Page 13
Analysis
The goal of the analysis phase is to identify the sources of the information required
to populate the physical data models. The main goal should be to populate the
repository data model as this is used as the source for all data in the data marts.
This is achieved in a number of steps:
Page 14
19
Data Management & Warehousing prefer target orientated analysis which asks the
question Which sources do I need in order to populate this target table completely? to the
source to target mapping method which asks the question Which target entities do I need to
put data into from this source? This is because the thought process used in the first method
is geared towards the delivery of the information to the user rather than the extraction of the
data by the developer.
20
ETL: Extract, Transform, Load code written to move data from the sources to the data
warehouse
Page 15
Design
The design phase concentrates on taking the analysis and creating a plan for the
code build.
This is a term taken from mathematics. Graph theory is the study of graphs, mathematical
structures used to model relations between objects in a given collection. A "graph" in this
context refers to a collection of mappings and their dependencies. The most famous graph
theory problem is known as The Seven Bridges of Konigsberg and was solved by Leonhard
Euler. http://mathforum.org/isaac/problems/bridges1.html
Page 16
Build
This white paper is a roadmap to the documentation that should be produced during a
data warehouse project. It should follow the structure of most projects but it is not a
substitute for a project plan.
Test
Testing software is operating the software under controlled conditions24, to
1. Verify that it behaves as specified
Verification is the checking or testing of items, including software, for
conformance and consistency by evaluating the results against prespecified requirements.
2. To detect errors
Testing should intentionally attempt to make things go wrong to
determine if things happen when they should not or things do not
happen when they should.
In this area it is important to test boundary conditions25 e.g. what
happens with a percentage over 100% or less than 0%.
3. To validate that what has been specified is what the user actually
wanted.
Validation looks at the system correctness i.e. is the process of
checking that what has been specified is what the user actually
wanted.
22
Page 17
Implementation
After the development and testing are over the system has to be deployed into
production and left operating. Implementation is often neglected on project plans. It
requires considerable thought and time to document procedures that will be used for
many years to come.
26
Page 18
Page 19
The helpdesk scripts are normally broken down into a number of categories such
as:
Ad hoc enquiries.
Evaluation:
Assessment and judgment on quality of evidence in order to conclude
whether the user has achieved the learning objectives or not.
Training plans should be created for all types of users and operators of the
system.
29
30
Page 20
Systems monitoring should also deal with heartbeat messages, i.e. messages
that tell you that the monitoring is still working. Monitoring information should be
retained so that it can be used to manage Service Level Agreements, provide
information for Capacity Plans.
The documents described in the implementation category interact as follows:
10 Project Management
Up to now this document has described documents required for individual phases of
the project. There are a number of tools and templates required for the effective
governance31 of the programme or project. Project management should have the
minimal impact on the process of development whilst ensuring that control over
resources, finances and scope is maintained. This category describes documents
used to control or assist in the management of a project.
10.1
Documentation Roadmap
The Documentation Roadmap is the document that describes all the documents
that should be produced for each of the phases of a project. The document you
are reading is an example of this document.
10.2
Project Plan
The project plan is the list of tasks and activities with timescales, resources and
dependencies that must be performed to deliver the solution. A project plan is
base-lined and regularly updated throughout the life of the project. It is important
that project plans have sufficient detail without trying to micro-mange tasks in the
short-term whilst having larger objectives with less detailed activities for the
longer-term aspects of the project. The plan is then updated as sufficient detail to
plan later tasks becomes available.
31
Page 21
10.3
DRIVE Statements
A drive statement is short one page template that helps a project manager assess
whether a project, or work package should be undertaken. It looks at five aspects
in order to make the assessment:
Dependencies:
What is required before this work can start?
Imperative:
Why do we have to do this? What makes it so important?
Value:
What value will the business, team or overall project get from doing this?
Exploitation:
Once we have this solution how will be able to take advantage of it?
10.4
SWOT Analysis
Strengths
Weaknesses
Opportunities
Threats
10.5
MoSCoW Analysis
32
33
Page 22
10.6
The change request is a critical component of any project and is vital to data
warehouse projects. At the outset of this document the requirements gathering
process was discussed, however during the lifecycle of the project the
requirements (and other aspects of the project) will change. The change request
is the template that documents a change from the original requirement to what is
now required. Change requests can be accepted or rejected as appropriate and
should be encouraged as a way to prevent uncontrolled and un-scoped
development from occurring.
10.7
Risk Register
The risk register is a list of events that may happen. If the event occurs then it will
have some negative impact on the project in terms of cost, resource or time. This
is contrasted with an issue that is something that has happened and therefore
needs to be managed.
10.8
more
Low
Medium
Probability
High
Low
Medium
High
Impact
The issue log is the active management of issues that have arisen. This is best
managed with an issue-tracking tool34 that supports the allocation of work to
resources and tracks the history of actions taken in response to an issue. Each
issue has a lifecycle that starts with its being reported and ends in resolution.35
34
Page 23
10.9
11 Miscellaneous
The final category of this document describes some general-purpose documents that
a project will find useful. The direct impact of the data warehouse project will not
always be visible to business users, who may see it as a large budget line with little
benefit. It is therefore important to market the data warehouse to the wider business
audience.
Business users should understand how and when they are getting information from
the data warehouse rather than other sources and see the impact of data quality
initiatives, etc. Therefore, Data Management & Warehousing recommend that all
documents use a consistent set of templates and that where a set of templates are
used they are branded with the name of the project rather than any third party that is
contributing to the project.
11.1
A standard look and feel document with the required categories for any project
document required.
11.2
This document is a presentation with a standard look and feel. This should be
used for all presentations that are given either inside or outside the team.
11.3
Meeting Agenda
11.4
Memo
This document provides a standard memo format for anyone who is recording
formal aspects of the project outside the documentation roadmap.
Page 24
Summary
Many data warehousing projects are both long running and poorly documented. This
does not mean that there is not a lot of documentation, just a lack of the right
documentation in the right place. It is the quality and availability of the documentation
that leads to an understanding of what is available and hence to the value and reputation
of the data warehouse itself.
This white paper has looked at a consistent set of documents developed over fifteen
years of project experience. It reflects a desire to develop the right amount of
documentation at the right time in the project lifecycle and stored in the right place. Doing
so means moving some documents held on project shared drives to web based media
and publishing documentation to a wider audience, whilst replacing some documents
with online tools. It is essential to the success of a data warehouse project that a culture
of open access is fostered and that the documentation is seen as the entry point to the
data warehouse.
Data Management & Warehousing has identified three aspects to essential
documentation:
Team members within the project to use the templates, create quality documents
and store them to the project repositories.
Easy access for people outside the project team to the documentation including
publication or notification of changes, updates and new releases.
This white paper has provided the documentation roadmap with both explanations and a
significant number of examples. It has also looked at some of the issues associated with
the distribution of information outside the project team. It has highlighted that the
processes and procedures required to create and store the information in the first place
are a matter of good project governance.
Data Management & Warehousing believe that the documents described here cover all
the areas necessary for a major programme of work. However, this is only a guide and a
set of templates and these should be adjusted to meet the needs of the programme. By
combining the documentation roadmap, the project plan and suitable governance a
project will have developed a strong foundation for the real work of developing a
successful data warehouse.
Page 25
Appendices
Appendix 1 Lifecycle of a bug
The lifecycle of a bug (or issue) is taken from the Bugzilla36 documentation
36
Page 26
A small server38
For example a dual core 2GHz CPU, 2Gb memory, two (for mirroring) 200Gb
disks.
Network connectivity
Normally two network cards, one for the LAN and the other for the internet, IP
Addresses and server names.
Linux
A version such Redhat, SuSe, CentOS, Debian, etc.
Bugzilla
To provide issue tracking.
Samba
To provide a Microsoft compatible shared file system.
CVS
To provide the source code control.
CVSWeb
To provide a web interface to CVS.
Perl
Pre-requisite language for Bugzilla.
CPAN Bundle::Bugzilla
Pre-requisite language modules for Bugzilla.
37
Overview Architecture for Enterprise Data Warehouses and Data Warehouse Governance
are both available from http://www.datamgmt.com.
38
Many organisations will have a server being de-commissioned from some other project that
could be re-used for this it does not have to be very powerful
39
List available from http://www.cmsmatrix.org/
Page 27
A Wiki
Data Management & Warehousing use the one included with phpWebsite but
any will do.
PHP
Pre-requisite language for phpWebsite.
MySQL
Pre-requisite database for Bugzilla and phpWebsite.
From a technical point of view this server can be built and the software downloaded,
installed, configured, secured and put on the internet and shared onto the LAN very
quickly. Normally an experienced Linux Systems Administrator could configure a
virtually maintenance free solution within a couple of days and as the software is all
free the only costs incurred will be for time and hardware. This is also the basic
configuration of the Data Management & Warehousing website.
In addition it is recommended that the following desktop software be provided:
Office product
e.g. Microsoft Office, Star Office, etc.
CVS Client
Data Management & Warehousing normally use WinCVS.
This list is for a server for the project governance and documentation of a data
warehouse project and does not include the development, test and production
environments from the data warehouse itself. Its simplicity, fast setup and low cost is
a demonstration of low impact governance of a data warehouse project.
References
The section below represents some useful resources for those considering building a
data warehouse solution.
Web resources
Organisation
Data Management & Warehousing
Configuration Management Wiki
Data Quality Tools
Software Testing Tools
Job Scheduling Tools
Data Modelling Tools
Project Management Tools
CMS Tools
Bugzilla
Website
http://www.datamgmt.com
http://www.cmcrossroads.com/
http://mediaproducts.gartner.com/
http://www.testingfaqs.org/
http://www.jobschedulingtools.com/
http://www.databaseanswers.com/
http://www.startwright.com/
http://www.cmsmatrix.org/
http://www.bugzilla.org
Copyright
2007 Data Management & Warehousing. All rights reserved. Reproduction not
permitted without written authorisation. References to other companies and their
products use trademarks owned by the respective companies and are for reference
purposes only.
Page 28