Week 9 Data Warehouse Concepts
Week 9 Data Warehouse Concepts
3.1
Multidimensional Model
The importance of data analysis has been steadily increasing from the early
1990s, as organizations in all sectors are being required to improve their
decision-making processes in order to maintain their competitive advantage.
Traditional database systems like the ones studied in Chap. 2 do not satisfy
the requirements of data analysis. They are designed and tuned to support
the daily operations of an organization, and their primary concern is to ensure
A. Vaisman and E. Zim
anyi, Data Warehouse Systems, Data-Centric
Systems and Applications, DOI 10.1007/978-3-642-54655-6 3,
Springer-Verlag Berlin Heidelberg 2014
53
54
55
Cu
s
(C tom
ity e
) r
dimensions
Time (Quarter)
Kln
Berlin
Lyon
Paris
Q1
measure
values
Q2
Q3
Q4
Produce Seafood
Beverages Condiments
Product (Category)
Fig. 3.1 A three-dimensional cube for sales data with dimensions Product, Time,
and Customer, and a measure Quantity
the data cube in Fig. 3.1, based on a portion of the Northwind database. We
can use this cube to analyze sales gures. The cube has three dimensions:
Product, Time, and Customer. A dimension level represents the granularity,
or level of detail, at which measures are represented for each dimension of
the cube. In the example, sales gures are aggregated to the levels Category,
Quarter, and City, respectively. Instances of a dimension are called members.
For example, Seafood and Beverages are members of the Product dimension
at the Category level. Dimensions also have associated attributes describing
them. For example, the Product dimension could contain attributes such as
ProductNumber and UnitPrice, which are not shown in the gure.
On the other hand, the cells of a data cube, or facts, have associated
numeric values (we will see later that this is not always the case), called
measures. These measures are used to evaluate quantitatively various
aspects of the analysis at hand. For example, each number shown in a cell
of the data cube in Fig. 3.1 represents a measure Quantity, indicating the
number of units sold (in thousands) by category, quarter, and customers
city. A data cube typically contains several measures. For example, another
measure, not shown in the gure, could be Amount, indicating the total sales
amount.
A data cube may be sparse or dense depending on whether it has
measures associated with each combination of dimension values. In the case
of Fig. 3.1, this depends on whether all products are bought by all customers
during the period of time considered. For example, not all customers may have
ordered products of all categories during all quarters of the year. Actually, in
real-world applications, cubes are typically sparse.
56
3.1.1
Hierarchies
Time
Customer
All
All
All
Category
Year
Continent
Product
Semester
Country
Quarter
State
Month
City
Day
Customer
At the instance level, Fig. 3.3 shows an example of the Product dimension.1
Each product at the lowest level of the hierarchy can be mapped to a
1
Note that, as indicated by the ellipses, not all nodes of the hierarchy are shown.
57
all
Beverages
Category
Product
Chai
Chang
...
...
Ikura
Seafood
Konbu
...
3.1.2
Measures
58
categories. If this were the case, each product sales would be counted twice,
one for each category.
Completeness: All instances must be included in the hierarchy and each
instance must be related to one parent in the next level. For example,
the instances of the Time hierarchy in Fig. 3.2 must contain all days in
the period of interest, and each day must be assigned to a month. If
this condition were not satised, the aggregation of the results would be
incorrect, since there would be dates for which sales will not be counted.
Correctness: It refers to the correct use of the aggregation functions. As
explained next, measures can be of various types, and this determines the
kind of aggregation function that can be applied to them.
According to the way in which they can be aggregated, measures can be
classied as follows:
Additive measures can be meaningfully summarized along all the
dimensions, using addition. These are the most common type of measures.
For example, the measure Quantity in the cube of Fig. 3.1 is additive: it can
be summarized when the hierarchies in the Product, Time, and Customer
dimensions are traversed.
Semiadditive measures can be meaningfully summarized using addition
along some, but not all, dimensions. A typical example is that of
inventory quantities, which cannot be meaningfully aggregated in the Time
dimension, for instance, by adding the inventory quantities for two dierent
quarters.
Nonadditive measures cannot be meaningfully summarized using addition across any dimension. Typical examples are item price, cost per unit,
and exchange rate.
Thus, in order to dene a measure, it is necessary to determine the
aggregation functions that will be used in the various dimensions. This is
particularly important in the case of semiadditive and nonadditive measures.
For example, a semiadditive measure representing inventory quantities
can be aggregated computing the average along the Time dimension and
computing the sum along other dimensions. Averaging can also be used
for aggregating nonadditive measures such as item price or exchange rate.
However, depending on the semantics of the application, other functions such
as the minimum, maximum, or count could be used instead.
In order to allow users to interactively explore the cube data at dierent
granularities, optimization techniques based on aggregate precomputation are
used. To avoid computing the whole aggregation from scratch each time the
data warehouse is queried, OLAP tools implement incremental aggregation
mechanisms. However, incremental aggregation computation is not always
possible, since this depends on the kind of aggregation function used. This
leads to another classication of measures, which we explain next.
59
3.2
OLAP Operations
60
b
C
( C u s to
ou m
n tr e r
y)
Time (Quarter)
Cu
s
(C tom
i ty e
) r
Kln
Berlin
Lyon
Paris
Germany
France
Time (Quarter)
Q1
Q2
Q3
Q4
Q1
Q2
Q3
Q4
Produce Seafood
Produce Seafood
Beverages Condiments
Product (Category)
Beverages Condiments
Product (Category)
Cu
s
(C tom
ity e
) r
Kln
Berlin
Lyon
Paris
Cu
s
(C tom
ity e
) r
Kln
Berlin
Lyon
Paris
Feb
Time (Quarter)
Time (Quarter)
Jan
Mar
...
Dec
Q1
Q2
Q3
Q4
Produce Seafood
Condiments Seafood
Beverages
Beverages Condiments
Product (Category)
Produce
Product (Category)
(C Pro
at du
eg c
or t
y)
Customer (City)
Seafood
Condiments
Produce
Beverages
Paris
Lyon
Berlin
Kln
Q1 Q2 Q3 Q4
Time (Quarter)
Fig. 3.4 OLAP operations. (a) Original cube. (b) Roll-up to the Country level.
(c) Drill-down to the Month level. (d) Sort product by name. (e) Pivot
61
g
Cu
s
Time
( to
(Quarter) City me
) r
f
Time (Quarter)
Q1
Lyon
Paris
Q2
Q3
Q4
Q1
Q2
Produce Seafood
Produce Seafood
Beverages Condiments
Beverages Condiments
Product (Category)
Product (Category)
Q1
Cu
s
(C tom
ity e
) r
Kln
Berlin
Lyon
Paris
Q1
Time (Quarter)
Time (Quarter)
Cu
s
(C tom
ity e
) r
Kln
Berlin
Lyon
Paris
Q2
Q3
Q4
Q2
Q3
Q4
Produce Seafood
Produce Seafood
Beverages Condiments
Product (Category)
Beverages Condiments
Product (Category)
Q1
k
Time (Quarter)
Time (Quarter)
Cu
s
(C tom
ity e
) r
Kln
Berlin
Lyon
Paris
Q2
Q3
Q4
Produce Seafood
Beverages Condiments
Product (Category)
Q1
84
89 106 84
Q2
82
77
93
79
Q3 105 72
65
88
Q4 112 61
96 102
Lyon
Kln
Paris
Berlin
Customer (City)
Fig. 3.4 (continued) (f ) Slice on City='Paris'. (g) Dice on City='Paris' or 'Lyon' and
Quarter='Q1' or 'Q2'. (h) Cube for 2011. (i) Drill-across operation. (j) Percentage
change. (k) Total sales by quarter and city
62
Cu
s
(C tom
ity e
) r
Kln
Berlin
Lyon
Paris
Cu
s
(C tom
ity e
) r
Kln
Berlin
Lyon
Paris
Q1
Time (Quarter)
Time (Quarter)
Q1
Q2
Q3
Q4
Q2
Q3
Q4
Produce Seafood
Produce Seafood
Beverages Condiments
Product (Category)
Beverages Condiments
Product (Category)
o
Kln
Berlin
Lyon
Paris
Cu
s
(C tom
ity e
) r
Cu
s
(C tom
ity e
) r
Kln
Berlin
Lyon
Paris
Q1
Time (Quarter)
Time (Quarter)
Q2
Q3
Q4
Q1
Q2
Q3
Q4
Produce Seafood
Produce Seafood
Beverages Condiments
Product (Category)
Beverages Condiments
Product (Category)
Time (Quarter)
Cu
s
(C tom
ity e
) r
Kln
Berlin
Lyon
Paris
Q1
Q2
Q3
Q4
Produce Seafood
Beverages Condiments
Product (Category)
Fig. 3.4 (continued) (l) Maximum sales by quarter and city. (m) Top two sales by
quarter and city. (n) Top 70% by city and category ordered by ascending quarter.
(o) Top 70% by city and category ordered by descending quantity. (p) Rank quarter
by category and city ordered by descending quantity
Cu
s
(C tom
ity e
) r
Kln
Berlin
Lyon
Paris
Kln
Berlin
Lyon
Paris
Cu
s
(C tom
ity e
) r
63
Jan
Time (Quarter)
Time (Quarter)
Jan
Feb
Mar
...
Dec
Feb
Mar
...
Dec
Produce Seafood
Produce Seafood
Beverages Condiments
Product (Category)
Beverages Condiments
Product (Category)
Q1
t
Cu
s
(C tom
ity e
) r
Kln
Berlin
Lyon
Paris
Time (Quarter)
Time (Quarter)
Cu
s
(C tom
ity e
) r
Madrid
Bilbao
Kln
Berlin
Lyon
Paris
Q2
Q3
Q4
Produce Seafood
Beverages Condiments
Product (Category)
Q1
Q2
Q3
Q4
Produce Seafood
Beverages Condiments
Product (Category)
Fig. 3.4 (continued) (q) Three-month moving average. (r) Year-to-date computation. (s) Union of the original cube and another cube with data from Spain.
(t) Dierence of the original cube and the cube in (m)
Our user then notices that sales of the category Seafood in France are
signicantly higher in the rst quarter compared to the other ones. Thus,
she rst takes the cube back to the City aggregation level and then applies a
drill-down along the Time dimension to the Month level to nd out whether
this high value occurred during a particular month. In this way, she discovers
that, for some reason yet unknown, sales in January soared both in Paris and
in Lyon, as shown in Fig. 3.4c.
64
65
Now, the user wants to apply window operations to the cube in Fig. 3.4c
in order to see how monthly sales behave. She starts by requesting a 3-month
moving average to obtain the result in Fig. 3.4q. Then, she asks the year-todate computation whose result is given in Fig. 3.4r.
Finally, the user wants to add to the original cube data from Spain, which
are contained in another cube. She obtains this by performing a union of the
two cubes, whose result is given in Fig. 3.4s. As another operation, she also
wants to remove from the original cube all sales measures except the top two
sales by quarter and city. For this, she performs the dierence of the original
cube in Fig. 3.4a and the cube in Fig. 3.4m, yielding the result in Fig. 3.4t.
The OLAP operations illustrated in Fig. 3.4 can be dened in a way
analogous to the relational algebra operations introduced in Chap. 2.
The roll-up operation aggregates measures along a dimension hierarchy
(using an aggregate function) to obtain measures at a coarser granularity.
The syntax for the roll-up operation is:
ROLLUP(CubeName, (Dimension Level)*, AggFunction(Measure)*)
which performs a roll-up along the Time dimension to the Quarter level and
the other dimensions (in this case Customer and Product) to the All level. On
the other hand, if the dimensions are not specied as in
ROLLUP*(Sales, SUM(Quantity))
all the dimensions of the cube will be rolled up to the All level, yielding a
single cell containing the overall sum of the Quantity measure.
66
In this case, a new measure ProdCount will be added to the cube. We will see
below other ways to add measures to a cube.
In many real-world situations, hierarchies are recursive, that is, they
contain a level that rolls up to itself. A typical example is a supervision
hierarchy over employees. Such hierarchies are discussed in detail in Chap. 4.
The particularity of such hierarchies is that the number of levels of the
hierarchy is not xed at the schema level, but it depends on its members.
The RECROLLUP operation is used to aggregate measures over recursive
hierarchies by iteratively performing roll-ups over the hierarchy until the top
level is reached. The syntax of this operation is as follows:
RECROLLUP(CubeName, Dimension Level, Hierarchy, AggFct(Measure)*)
The sort operation returns a cube where the members of a dimension have
been sorted. The syntax of the operation is as follows:
SORT(CubeName, Dimension, (Expression [ {ASC | DESC | BASC | BDESC} ])*)
sorts the members of the Product dimension in ascending order of their name,
as shown in Fig. 3.4d. Here, ProductName is supposed to be an attribute of
products. When the cube contains only one dimension, the members can
be sorted based on its measures. For example, if SalesByQuarter is obtained
from the original cube by aggregating sales by quarter for all cities and all
categories, the following expression
67
sorts the members of the Time dimension on descending order of the Quantity
measure.
The pivot (or rotate) operation rotates the axes of a cube to provide an
alternative presentation of the data. The syntax of the operation is as follows:
PIVOT(CubeName, (Dimension Axis)*)
where the axes are specied as {X, Y, Z, X1, Y1, Z1, . . . }. Thus, the example
illustrated in Fig. 3.4e is expressed by:
PIVOT(Sales, Time X, Customer Y, Product Z)
where the Dimension will be dropped by xing a single Value in the Level.
The other dimensions remain unchanged. The example illustrated in Fig. 3.4f
is expressed by:
SLICE(Sales, Customer, City = 'Paris')
The slice operation assumes that the granularity of the cube is at the specied
level of the dimension (in the example above, at the city level). Thus, a
granularity change by means of a ROLLUP or DRILLDOWN operation is often
needed prior to applying the slice operation.
The dice operation keeps the cells in a cube that satisfy a Boolean
condition . The syntax for this operation is
DICE(CubeName, )
renames the cube in Fig. 3.4a and its measure. As another example,
68
Notice that a renaming of the cube and the measure, as stated above, is
necessary prior to applying the drill-across operation. Notice also that the
resulting cube is named Sales2011-2012. On the other hand, if in the Sales
cube of Fig. 3.4c we want to compare the sales of a month with those of the
previous month, this can be expressed in two steps as follows:
Sales1 RENAME(Sales, Quantity PrevMonthQuantity)
Result DRILLACROSS(Sales1, Sales, Sales1.Time.Month+1 = Sales.Time.Month)
In the rst step, we create a temporary cube Sales1 by renaming the measure.
In the second step, we perform the drill across of the two cubes by combining
a cell in Sales1 with the cell in Sales corresponding to the subsequent month.
As already stated, the join condition above corresponds to an outer join.
Notice that the Sales cube in Fig. 3.4a contains measures for a single year.
Thus, in the result above, the cells corresponding to January and December
will contain a null value in one of the two measures. As we will see in
Sect. 4.4, when the cube contains measures for several years, the join condition
must take into account that measures of January must be joined with those
of December of the preceding year. Notice also that the cube has three
dimensions and the join condition in the query above pertains to only one
dimension. For the other dimensions, it is supposed that an outer equijoin is
performed.
The add measure operation adds new measures to the cube computed
from other measures or dimensions. The syntax for this operation is as follows:
ADDMEASURE(CubeName, (NewMeasure = Expression, [AggFct])* )
69
DROPMEASURE(CubeName, Measure*)
For example, given the result of the add measure above, the cube illustrated
in Fig. 3.4j is expressed by:
DROPMEASURE(Sales2011-2012, Quantity2011, Quantity2012)
Usual aggregation operations are SUM, AVG, COUNT, MIN, and MAX.
In addition to these, we use extended versions of MIN and MAX, which
have an additional argument that is used to obtain the n minimum or
maximum values. Further, TOPPERCENT and BOTTOMPERCENT select
the members of a dimension that cumulatively account for x percent of a
measure. Analogously, RANK and DENSERANK are used to rank the members
of a dimension according to a measure. We show next examples of these
operations.
For example, the cube in Fig. 3.4a is at the Quarter and City levels. The
total sales by quarter and city can be obtained by
SUM(Sales, Quantity) BY Time, Customer
This will yield the two-dimensional cube in Fig. 3.4k. On the other hand, to
obtain the total sales by quarter, we can write
SUM(Sales, Quantity) BY Time
which returns a one-dimensional cube with values for each quarter. Notice
that in the query above, a roll-up along the Customer dimension up to the
All level is performed before applying the aggregation operation. Finally, to
obtain the overall sales, we can write
SUM(Sales, Quantity)
70
when asking for the best-selling employee, we must compute the maximum
sales amount but also identify who is the employee that performed best.
Therefore, when applying an aggregation operation, the resulting cube will
have dierent dimension members depending on the type of the aggregation
function. For example, given the cube in Fig. 3.4a, the total overall quantity
can be obtained by the expression
SUM(Sales, Quantity)
This will yield a single cell, whose coordinates for the three dimensions will
be all equal to all. On the other hand, when computing the overall maximum
quantity as follows
MAX(Sales, Quantity)
we will obtain the cell with value 47 and coordinates Q4, Condiments, and
Paris (we suppose that cells that are hidden in Fig. 3.4a contain a smaller
value for this measure). Similarly, the following expression
SUM(Sales, Quantity) BY Time, Customer
returns the total sales by quarter and customer, resulting in the cube given
in Fig. 3.4k. This cube has three dimensions, where the Product dimension
only contains the member all. On the other hand,
MAX(Sales, Quantity) BY Time, Customer
will yield the cube in Fig. 3.4l, where only the cells containing the maximum
by time and customer will have values, while the other ones will be lled with
null values. Similarly, the two maximum quantities by product and customer
as shown in Fig. 3.4m can be obtained as follows:
MAX(Sales, Quantity, 2) BY Time, Customer
Notice that in the example above, we requested the two maximum quantities
by time and customer. If in the cube there are two or more cells that tie for
the last place in the limited result set, then the number of cells in the result
could be greater than two. For example, this is the case in Fig. 3.4m for Berlin
and Q1, where there are three values in the result, that is, 33, 25, and 25.
To compute top or bottom percentages, the order of the cells must be
specied. For example, to compute the top 70% of the measure quantity by
city and category ordered by quarter, as shown in Fig. 3.4n, we can write
TOPPERCENT(Sales, Quantity, 70) BY City, Category ORDER BY Quarter ASC
The operation computes the running sum of the sales by city and category
starting with the rst quarter and continues until the target percentage is
reached. In the example above, the sales by city and category for the rst
three quarters covers 70% of the sales. Similarly, the top 70% of the measure
quantity by city and category ordered by quantity, as shown in Fig. 3.4o, can
be obtained by
71
The rank operation also requires the specication of the order of the cells.
As an example, to rank quarters by category and city order by descending
quantity, as shown in Fig. 3.4p, we can write
RANK(Sales, Time) BY Category, City ORDER BY Quantity DESC
The rank and the dense rank operations dier in the case of ties. The former
assigns the same rank to ties, with the next ranking(s) skipped. For example,
in Fig. 3.4p, there is a tie in the quarters for Seafood and Koln, where Q2 and
Q4 are in the rst rank and Q3 and Q1 are in the third and fourth ranks,
respectively. If the dense rank is used, then Q3 and Q1 would be in the second
and third ranks, respectively.
In the examples above, the new measure value in a cell is computed from
the values of other measures in the same cell. However, we often need to
compute measures where the value of a cell is obtained by aggregating the
measures of several nearby cells. Examples of these include moving average
and year-to-date computations. For this, we need to dene a subcube that
is associated with each cell and perform the aggregation over this subcube.
These functions correspond to the window functions in SQL that will be
described in Chap. 5. For example, given the cube in Fig. 3.4c, the 3-month
moving average in Fig. 3.4q can be obtained by
ADDMEASURE(Sales, MovAvg = AVG(Quantity) OVER
Time 2 CELLS PRECEDING)
Here, the moving average for January is equal to the measure in January,
since there are no previous cells. Analogously, the measure for February is
the average of the values of January and February. Finally, the average for
the remaining months is computed from the measure value of the current
month and the two preceding ones. Notice that in the window functions,
it is supposed that the members of the dimension over which the window is
constructed are already sorted. For this, a sort operation can be applied prior
to the application of the window aggregate function.
Similarly, to compute the year-to-date sum in Fig. 3.4r, we can write
ADDMEASURE(Sales, YTDQuantity = SUM(Quantity) OVER
Time ALL CELLS PRECEDING)
Here, the window over which the aggregation function is applied contains
the current cell and all the previous ones, as indicated by ALL CELLS
PRECEDING.
The union operation merges two cubes that have the same schema
but disjoint instances. For example, if CubeSpain is a cube having the
same schema as our original cube but containing only the sales to Spanish
customers, the cube in Fig. 3.4s is obtained by
UNION(Sales, SalesSpain)
72
will result in the cube in Fig. 3.4t, which contains all sales measures except
for the top two sales by quarter and city.
Finally, the drill-through operation allows one to move from data at the
bottom level in a cube to data in the operational systems from which the cube
was derived. This could be used, for example, if one were trying to determine
the reason for outlier values in a data cube.
Table 3.1 summarizes the OLAP operations we have presented in this
section. In addition to the basic operations described above, OLAP tools
provide a great variety of mathematical, statistical, and nancial operations
for computing ratios, variances, interest, depreciation, currency conversions,
etc.
3.3
Data Warehouses
73
names but the same data), homonyms (elds with the same name but
dierent meanings), multiplicity of occurrences of data, and many others.
In operational databases these problems are typically solved in the design
phase.
Nonvolatile means that durability of data is ensured by disallowing data
modication and removal, thus expanding the scope of the data to a longer
period of time than operational systems usually oer. A data warehouse
gathers data encompassing several years, typically 510 years or beyond,
while data in operational databases is often kept for only a short period
of time, for example, from 2 to 6 months, as required for daily operations,
and it may be overwritten when necessary.
Time varying indicates the possibility of retaining dierent values for
the same information, as well as the time when changes to these values
occurred. For example, a data warehouse in a bank might store information
74
75
Aspect
User type
Usage
Data content
Data organization
Operational databases
Operators, oce employees
Predictable, repetitive
Current, detailed data
According to operational
needs
Data structures
Optimized for small
transactions
Access frequency High
Access type
Read, insert, update, delete
Number of records Few
per access
Response time
Short
Concurrency level High
Lock utilization
Needed
Update frequency High
Data redundancy Low (normalized tables)
Data modeling
UML, ER model
Data warehouses
Managers, executives
Ad hoc, nonstructured
Historical, summarized data
According to analysis needs
Optimized for complex
queries
From medium to low
Read, append only
Many
Can be long
Low
Not needed
None
High (denormalized tables)
Multidimensional model
OLAP systems are not so frequently accessed as OLTP systems. For example,
a system handling purchase orders is frequently accessed, while performing
analysis of orders may not be that frequent. Also, data warehouse records
are usually accessed in read mode (lines 58). From the above, it follows
that OLTP systems usually have a short query response time, provided the
appropriate indexing structures are dened, while complex OLAP queries can
take a longer time to complete (line 9).
OLTP systems have normally a high number of concurrent accesses and
therefore require locking or other concurrency management mechanisms to
ensure safe transaction processing (lines 1011). On the other hand, OLAP
systems are read only, and therefore queries can be submitted and computed
concurrently, with no locking or complex transaction processing requirements.
Further, the number of concurrent users in an OLAP system is usually low.
Finally, OLTP systems are constantly being updated online through transactional applications, while OLAP systems are updated o-line periodically.
This leads to dierent modeling choices. OLTP systems are modeled using
UML or some variation of the ER model studied in Chap. 2. Such models
lead to a highly normalized schema, adequate for databases that support
frequent transactions, to guarantee consistency and reduce redundancy.
OLAP designers use the multidimensional model, which, at the logical level
(as we will see in Chap. 5), leads in general to a denormalized database
schema, with a high level of redundancy, which favors query processing (lines
1214).
76
3.4
We are now ready to present a general data warehouse architecture that will
be used throughout the book. This architecture, depicted in Fig. 3.5, consists
of several tiers:
The back-end tier is composed of extraction, transformation, and
loading (ETL) tools, used to feed data into the data warehouse from
operational databases and other data sources, which can be internal or
external from the organization, and a data staging area, which is an
intermediate database where all the data integration and transformation
processes are run prior to the loading of the data into the data warehouse.
The data warehouse tier is composed of an enterprise data warehouse and/or several data marts and a metadata repository storing
information about the data warehouse and its contents.
The OLAP tier is composed of an OLAP server, which provides a
multidimensional view of the data, regardless of the actual way in which
data are stored in the underlying system.
The front-end tier is used for data analysis and visualization. It contains
client tools such as OLAP tools, reporting tools, statistical tools,
and data mining tools.
We now describe in detail the various components of the above architecture.
3.4.1
Back-End Tier
In the back-end tier, the process commonly known as extraction, transformation, and loading is performed. As the name indicates, it is a three-step
process as follows:
Extraction gathers data from multiple, heterogeneous data sources.
These sources may be operational databases but may also be les in various
formats; they may be internal to the organization or external to it.
In order to solve interoperability problems, data are extracted whenever
possible using application programming interfaces (APIs) such as ODBC
(Open Database Connectivity) and JDBC (Java Database Connectivity).
Transformation modies the data from the format of the data sources
to the warehouse format. This includes several aspects: cleaning, which
removes errors and inconsistencies in the data and converts it into a
standardized format; integration, which reconciles data from dierent data
sources, both at the schema and at the data level; and aggregation, which
summarizes the data obtained from data sources according to the level of
detail, or granularity, of the data warehouse.
Internal
sources
Data staging
77
Metadata
OLAP tools
Operational
databases
ETL
process
Enterprise
data
warehouse
OLAP
server
Reporting
tools
Statistical
tools
External
sources
Data marts
Data mining
tools
Data
sources
Back-end
tier
Data warehouse
tier
OLAP
tier
Front-end
tier
Loading feeds the data warehouse with the transformed data. This also
includes refreshing the data warehouse, that is, propagating updates from
the data sources to the data warehouse at a specied frequency in order
to provide up-to-date data for the decision-making process. Depending on
organizational policies, the refresh frequency may vary from monthly to
several times a day or even near to real time.
ETL processes usually require a data staging area, that is, a database in
which the data extracted from the sources undergoes successive modications
to eventually be ready to be loaded into the data warehouse. Such a database
is usually called operational data store.
3.4.2
The data warehouse tier in Fig. 3.5 depicts an enterprise data warehouse
and several data marts. As we have explained, while an enterprise data
warehouse is centralized and encompasses an entire organization, a data
78
3.4.3
OLAP Tier
79
3.4.4
Front-End Tier
The front-end tier in Fig. 3.5 contains client tools that allow users to
exploit the contents of the data warehouse. Typical client tools include the
following:
OLAP tools allow interactive exploration and manipulation of the
warehouse data. They facilitate the formulation of complex queries that
may involve large amounts of data. These queries are called ad hoc
queries, since the system has no prior knowledge about them.
Reporting tools enable the production, delivery, and management of
reports, which can be paper-based reports or interactive, web-based
reports. Reports use predened queries, that is, queries asking for
specic information in a specic format that are performed on a regular
basis. Modern reporting techniques include key performance indicators and
dashboards.
Statistical tools are used to analyze and visualize the cube data using
statistical methods.
Data mining tools allow users to analyze data in order to discover
valuable knowledge such as patterns and trends; they also allow predictions
to be made on the basis of current data.
In Chap. 9, we show some of the tools used to exploit the data warehouse,
like data mining tools, key performance indicators, and dashboards.
3.4.5
80
several data marts that were independently created need to be integrated into
a data warehouse for the entire enterprise.
In some other situations, an OLAP server does not exist and/or the client
tools directly access the data warehouse. This is indicated by the arrow
connecting the data warehouse tier to the front-end tier. This situation
is illustrated in Chap. 6, where the same queries for the Northwind case
study are expressed both in MDX (targeting the OLAP server) and in
SQL (targeting the data warehouse). An extreme situation is where there is
neither a data warehouse nor an OLAP server. This is called a virtual data
warehouse, which denes a set of views over operational databases that are
materialized for ecient access. The arrow connecting the data sources to
the front-end tier depicts this situation. Although a virtual data warehouse
is easy to build, it does not provide a real data warehouse solution, since it
does not contain historical data, does not contain centralized metadata, and
does not have the ability to clean and transform the data. Furthermore, a
virtual data warehouse can severely impact the performance of operational
databases.
Finally, a data staging area may not be needed when the data in the source
systems conforms very closely to the data in the warehouse. This situation
typically arises when there is one data source (or only a few) having highquality data. However, this is rarely the case in real-world situations.
3.5
Like in operational databases (studied in Sect. 2.1), there are two major
methods for the design of data warehouses and data marts. In the topdown approach, the requirements of users at dierent organizational levels
are merged before the design process starts, and one schema for the entire
data warehouse is built, from which data marts can be obtained. In the
bottom-up approach, a schema is built for each data mart, according to
the requirements of the users of each business area. The data mart schemas
produced are then merged in a global warehouse schema. The choice between
the top-down and the bottom-up approach depends on many factors that will
be studied in Chap. 10 in this book.
Requirements
specification
Conceptual
design
Logical design
Physical design
81
3.6
Nowadays, the oer in business intelligence tools is quite large. The major
database providers, such as Microsoft, Oracle, IBM, and Teradata, have their
own suite of business intelligence tools. Other popular tools include SAP,
MicroStrategy, and Targit. In addition to the above commercial tools, there
are also open-source tools, of which Pentaho is the most popular one.
In this book, we have chosen two representative suites of tools for
illustrating the topics presented: Microsofts SQL Server tools and Pentaho
Business Analytics. In this section, we briey describe these tools, while the
bibliographic notes section at the end of this chapter provides references to
other well-known business intelligence tools.
82
3.6.1
83
mode. Further, the data volumes supported by the tabular mode are smaller
than those of the multidimensional mode in Analysis Services. From the query
language perspective, each of these modes has an associated query language,
MDX and DAX (Data Analysis Expressions), respectively. Finally, from the
data access perspective, the multidimensional mode supports data access in
MOLAP (multidimensional OLAP), ROLAP (relational OLAP), or HOLAP
(hybrid OLAP), which will be described in Chap. 5. On the other hand,
the tabular mode accesses data through xVelocity, an in-memory, columnoriented database engine with compression capabilities. We will cover such
databases in Chap. 13. The tabular mode also allows the data to be retrieved
directly from relational data sources.
In this book, we cover only the multidimensional mode of BISM as well as
the MDX language.
3.6.2
84
In addition, several design tools are provided, which are described next:
Pentaho Schema Workbench provides a graphical interface for designing OLAP cubes for Mondrian. The schema created is stored as an XML
le on disk.
Pentaho Aggregation Designer operates on Mondrian XML schema
les and the database with the underlying tables described by the schema
to generate precalculated, aggregated answers to speed up analysis work
and MDX queries executed against Mondrian.
Pentaho Metadata Editor is a tool that simplies the creation of
reports and allows users to build metadata domains and relational data
models. It acts as an abstraction layer from the underlying data sources.
3.7
Summary
3.8
Bibliographic Notes
85
[72, 103]. More details on these concepts are given in Chap. 5, where we also
give further references.
There is not yet a standard denition of the OLAP operations, in a similar
way as the relational algebra operations are dened for the relational algebra.
Many dierent algebras for OLAP have been proposed in the literature,
each one dening dierent sets of operations. A comparison of these OLAP
algebras is given in [181], where the authors advocate the need for a reference
algebra for OLAP. The denition of the operations we presented in this
chapter was inspired from [32].
There are many books that describe the various business intelligence tools.
We next give some references for commercial and open-source tools. For SQL
Server, the series of books devoted to Analysis Services [79], Integration Services [105], and Reporting Services [209] cover extensively these components.
The business intelligence tools from Oracle are covered in [175, 218], while
those of IBM are covered in [147, 225]. SAP BusinessObjects is presented in
[81, 83], while MicroStrategy is covered in [50, 139]. For Pentaho, the book
[18] gives an overall description of the various components of the Pentaho
Business Intelligence Suite, while Mondrian is covered in the book [10], Kettle
in [26, 179], Reporting in [57], and Weka in [228]. The book [157] is devoted
to Big Data Analytics using Pentaho. On the academic side, a survey of
open-source tools for business intelligence is given in [199].
3.9
Review Questions
86
3.9 Describe the various components of a typical data warehouse architecture. Identify variants of this architecture and specify in what situations
they are used.
3.10 Briey describe the multidimensional model implemented in Analysis
Services.
3.10
Exercises
3.10 Exercises
87
during the
during the
in research
during the
of projects