0% found this document useful (0 votes)
1 views

6

The document discusses schemas for the multidimensional model used in data warehousing, including star, snowflake, and fact constellation schemas, each with distinct characteristics and applications. It differentiates between data warehouses and data marts, explaining their scopes and typical schemas used. Additionally, it covers the Data Mining Query Language (DMQL) for defining data warehouses and concept hierarchies that map low-level concepts to higher-level ones.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

6

The document discusses schemas for the multidimensional model used in data warehousing, including star, snowflake, and fact constellation schemas, each with distinct characteristics and applications. It differentiates between data warehouses and data marts, explaining their scopes and typical schemas used. Additionally, it covers the Data Mining Query Language (DMQL) for defining data warehouses and concept hierarchies that map low-level concepts to higher-level ones.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

6.

Schemas for Multidimensional Model


Lesson Introduction

The entity-relationship data model is commonly used in the design of relational databases, where a
database schema consists of a set of entities and the relationships between them. Such a data model is
appropriate for online transaction processing. A data warehouse, however, requires a concise, subject-
oriented schema that facilitates on-line data analysis. The most popular data model for a data warehouse
is a multidimensional model. Such a model can exist in the form of a star schema, a snowflake schema, or
a fact constellation schema.

Learning Outcomes:
After successful completion of this lesson, the students will be able to understand what schemas are
available for the multidimensional model, the difference between a data warehouse and a data mart,
how to define a data warehouse using DMQL and what is a concept hierarchy.

Lesson Outline:
 What are the available Schemas for the multidimensional model?
 What is a data mart?
 How to Define a data warehouse using Data Mining Query Language (DMQL)?
 What is a concept hierarchy?

6.1 Star schema.


A Star schema consists of a fact table and related dimensions. Star schema is the simplest of the available
schemas for multidimensional modeling. A star schema for sales is shown in Figure 6-1. Sales are
considered along four dimensions, namely, time, item, branch, and location. The schema contains a
central fact table for sales that includes keys to each of the four dimensions, along with two measures:
dollars sold and units sold.
Figure 6-1: Star Schema

Notice that in the star schema, each dimension is represented by only one table, and each table contains
a set of attributes. For example, the location dimension table lists the attribute set location key, street,
city, province or state, country. This constraint may introduce some redundancy. For example, “Vancouver”
and “Victoria” are both cities in the Canadian province of British Columbia. Entries for such cities in the
location dimension table will create redundancy among the attributes province or state and country, that
is, (..., Vancouver, British Columbia, Canada) and (..., Victoria, British Columbia, Canada).

6.2 Snowflake Schema


When the star schema is normalized, it will be converted to a Snowflake schema. The resulting schema
graph forms a shape similar to a snowflake. The significant difference between the snowflake and star
schema models is that the dimension tables of the snowflake model may be kept in a normalized form to
reduce redundancies. Such a table is easy to maintain and saves storage space. However, this saving of
space is negligible in comparison to the typical magnitude of the fact table. Furthermore, the snowflake
structure can reduce the effectiveness of browsing, since more joins will be needed to execute a query.

A snowflake schema for sales is given in Figure 6-2. The single dimension table for the item in the star
schema is normalized in the snowflake schema, resulting in new item and supplier tables. For example,
the item dimension table now contains the attributes item key, item name, brand, type, and supplier key,
where supplier key is linked to the supplier dimension table, containing supplier key and supplier type
information. Similarly, the single dimension table for location in the star schema can be normalized into
two new tables: location and city. The city key in the new location table links to the city dimension.

Figure 6-2: Snowflake Schema


6.3 Fact constellation.

A fact constellation schema consists of multiple fact tables and multiple dimensions. A fact constellation
schema is shown in Figure 6-3. This schema specifies two fact tables, sales, and shipping. The sales table
definition is identical to that of the star schema (Figure 6-1). The shipping table has five dimensions, or
keys: item key, time key, shipper key, from location, and to location, and two measures: dollars cost, and
units shipped. A fact constellation schema allows dimension tables to be shared between fact tables. For
example, the dimensions tables for time, item, and location are shared between both the sales and
shipping fact tables.

Figure 6-3: Fact Constellation Schema

6.4 Data Warehouse vs. Data Mart


In data warehousing, there is a distinction between a data warehouse and a data mart. A data warehouse
collects information about subjects that span the entire organization, such as customers, items, sales,
assets, and personnel, and thus its scope is enterprise-wide. For data warehouses, the fact constellation
schema is commonly used, since it can model multiple, interrelated subjects. A data mart, on the other
hand, is a department subset of the data warehouse that focuses on selected subjects, and thus its scope
is departmentwide. For data marts, the star or snowflake schema are commonly used, since both are
geared toward modeling single subjects, although the star schema is more popular and efficient.
A summary of the comparison between a data warehouse and a data mart is given in table 6-1.

Table 6-1: Data Warehouse vs Data Mart

6.5 Data Mining Query Language (DMQL)


Just as relational query languages like SQL can be used to specify relational queries, a data mining query
language can be used to determine data mining tasks. Data warehouses and data marts can be defined
using two language primitives, one for cube definition and one for dimension definition.

The cube definition statement has the following syntax:

define cube <cube_name> [<dimension_list>]: <measure_list>

The dimension definition statement has the following syntax:

define dimension <dimension_name> as (<attribute_or_subdimension_list>)

Star schema definition for the Figure 6-1:

define cube sales_star [time, item, branch, location]:

dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)

define dimension time as (time_key, day, day_of_week, month, quarter, year)

define dimension item as (item_key, item_name, brand, type, supplier_type)

define dimension branch as (branch_key, branch_name, branch_type)

define dimension location as (location_key, street, city, province_or_state, country)


The define cube statement defines a data cube called sales_star, which corresponds to the central sales
fact table. This command specifies the dimensions and the two measures, dollars-sold and units_sold. The
data cube has four dimensions, namely, time, item, branch, and location. A define dimension statement
is used to define each of the dimensions.

Snowflake schema definition for the Figure 6-2:

define cube sales_snowflake [time, item, branch, location]:

dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)

define dimension time as (time_key, day, day_of_week, month, quarter, year)

define dimension item as (item_key, item_name, brand, type, supplier(supplier_key,


supplier_type))

define dimension branch as (branch_key, branch_name, branch_type)

define dimension location as (location_key, street, city(city_key, province_or_state, country))

This definition is similar to that of sales_star gave above, except that, here, the item and location
dimension tables are normalized. For instance, the item dimension of the sales_star data cube has been
normalized in the sales snowflake cube into two dimension tables, item, and supplier. Note that the
dimension definition for the supplier is specified within the definition for the item. Defining supplier in
this way implicitly creates a supplier key in the item dimension table definition. Similarly, the location
dimension of the sales_star data cube has been normalized in the sales snowflake cube into two
dimension tables, location, and city. The dimension definition for the city is specified within the definition
of location. In this way, a city key is implicitly created in the location dimension table definition.

Fact constellation schema definition for the Figure 6-3:

define cube sales [time, item, branch, location]:

dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)

define dimension time as (time_key, day, day_of_week, month, quarter, year)

define dimension item as (item_key, item_name, brand, type, supplier_type)

define dimension branch as (branch_key, branch_name, branch_type)

define dimension location as (location_key, street, city, province_or_state, country)

define cube shipping [time, item, shipper, from_location, to_location]:

dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)

define dimension time as time in cube sales


define dimension item as item in cube sales

define dimension shipper as (shipper_key, shipper_name, location as location in cube sales,


shipper_type)

define dimension from_location as location in cube sales

define dimension to_location as location in cube sales

A define cube statement is used to define data cubes for sales and shipping, corresponding to the two fact
tables. Note that the time, item, and location dimensions of the sales cube are shared with the shipping
cube. This is indicated for the time dimension, for example, as follows. Under the define cube statement
for shipping, the statement “define dimension time as time in cube sales” is specified.

6.6 Concept Hierarchies

A concept hierarchy defines a sequence of mappings from a set of low-level concepts to higher-level, more
general concepts. Consider a concept hierarchy for the dimension location. City values for location include
Vancouver, Toronto, NewYork, and Chicago. Each city, however, can be mapped to the province or state
to which it belongs. For example, Vancouver can be mapped to British Columbia, and Chicago to Illinois.
The provinces and states can, in turn, be mapped to the country to which they belong, such as Canada or
the USA. These mappings form a concept hierarchy for the dimension location, mapping a set of low-level
concepts (i.e., cities) to higher-level, more general concepts (i.e., countries).

Figure 6-4: Concept hierarchy for the dimension location

The concept hierarchy described above is illustrated in Figure 6-4. Many concept hierarchies are implicit
within the database schema. For example, suppose that the dimension location is described by the
attributes number, street, city, province or state, zip code, and country. A total order relates these
attributes, forming a concept hierarchy such as “street < city < province or state < country”. This hierarchy
is shown in Figure 6-5.
Figure 6-5: Hierarchy for location

You might also like