6
6
The entity-relationship data model is commonly used in the design of relational databases, where a
database schema consists of a set of entities and the relationships between them. Such a data model is
appropriate for online transaction processing. A data warehouse, however, requires a concise, subject-
oriented schema that facilitates on-line data analysis. The most popular data model for a data warehouse
is a multidimensional model. Such a model can exist in the form of a star schema, a snowflake schema, or
a fact constellation schema.
Learning Outcomes:
After successful completion of this lesson, the students will be able to understand what schemas are
available for the multidimensional model, the difference between a data warehouse and a data mart,
how to define a data warehouse using DMQL and what is a concept hierarchy.
Lesson Outline:
What are the available Schemas for the multidimensional model?
What is a data mart?
How to Define a data warehouse using Data Mining Query Language (DMQL)?
What is a concept hierarchy?
Notice that in the star schema, each dimension is represented by only one table, and each table contains
a set of attributes. For example, the location dimension table lists the attribute set location key, street,
city, province or state, country. This constraint may introduce some redundancy. For example, “Vancouver”
and “Victoria” are both cities in the Canadian province of British Columbia. Entries for such cities in the
location dimension table will create redundancy among the attributes province or state and country, that
is, (..., Vancouver, British Columbia, Canada) and (..., Victoria, British Columbia, Canada).
A snowflake schema for sales is given in Figure 6-2. The single dimension table for the item in the star
schema is normalized in the snowflake schema, resulting in new item and supplier tables. For example,
the item dimension table now contains the attributes item key, item name, brand, type, and supplier key,
where supplier key is linked to the supplier dimension table, containing supplier key and supplier type
information. Similarly, the single dimension table for location in the star schema can be normalized into
two new tables: location and city. The city key in the new location table links to the city dimension.
A fact constellation schema consists of multiple fact tables and multiple dimensions. A fact constellation
schema is shown in Figure 6-3. This schema specifies two fact tables, sales, and shipping. The sales table
definition is identical to that of the star schema (Figure 6-1). The shipping table has five dimensions, or
keys: item key, time key, shipper key, from location, and to location, and two measures: dollars cost, and
units shipped. A fact constellation schema allows dimension tables to be shared between fact tables. For
example, the dimensions tables for time, item, and location are shared between both the sales and
shipping fact tables.
This definition is similar to that of sales_star gave above, except that, here, the item and location
dimension tables are normalized. For instance, the item dimension of the sales_star data cube has been
normalized in the sales snowflake cube into two dimension tables, item, and supplier. Note that the
dimension definition for the supplier is specified within the definition for the item. Defining supplier in
this way implicitly creates a supplier key in the item dimension table definition. Similarly, the location
dimension of the sales_star data cube has been normalized in the sales snowflake cube into two
dimension tables, location, and city. The dimension definition for the city is specified within the definition
of location. In this way, a city key is implicitly created in the location dimension table definition.
A define cube statement is used to define data cubes for sales and shipping, corresponding to the two fact
tables. Note that the time, item, and location dimensions of the sales cube are shared with the shipping
cube. This is indicated for the time dimension, for example, as follows. Under the define cube statement
for shipping, the statement “define dimension time as time in cube sales” is specified.
A concept hierarchy defines a sequence of mappings from a set of low-level concepts to higher-level, more
general concepts. Consider a concept hierarchy for the dimension location. City values for location include
Vancouver, Toronto, NewYork, and Chicago. Each city, however, can be mapped to the province or state
to which it belongs. For example, Vancouver can be mapped to British Columbia, and Chicago to Illinois.
The provinces and states can, in turn, be mapped to the country to which they belong, such as Canada or
the USA. These mappings form a concept hierarchy for the dimension location, mapping a set of low-level
concepts (i.e., cities) to higher-level, more general concepts (i.e., countries).
The concept hierarchy described above is illustrated in Figure 6-4. Many concept hierarchies are implicit
within the database schema. For example, suppose that the dimension location is described by the
attributes number, street, city, province or state, zip code, and country. A total order relates these
attributes, forming a concept hierarchy such as “street < city < province or state < country”. This hierarchy
is shown in Figure 6-5.
Figure 6-5: Hierarchy for location