Automated Selection of Materialized Views and Indexes For SQL Databases
Automated Selection of Materialized Views and Indexes For SQL Databases
SQL Databases
496
materialized views, at a fraction of the enumeration cost As mentioned in the introduction, searching the space
(Section 4). We introduce two key techniques that form of all syntactically relevant indexes and materialized
the basis of a scalable approach for candidate materialized views for a workload is infeasible in practice, particularly
view selection. First, we show how to identify interesting when the workload is large or complex. Therefore, it is
sets of tables such that we need to consider materialized crucial to eliminate spurious indexes and materialized
views only over such sets of tables. Next, we present a views from consideration early, thereby focusing the
view merging technique that identifies candidate search on a smaller, and interesting subset. The candidate
materialized views that while not optimal for any single selection module is responsible for identifying a set of
query, can be beneficial to multiple queries in the traditional indexes, materialized views and indexes on
workload. The techniques presented in this paper are materialized views for the given workload that are worthy
designed to be robust for handling the generality of SQL of further exploration. Efficient selection of candidate
as well as other pragmatic issues arising in index and materialized views is a key contribution of our work. For
materialized view selection. These techniques have the purposes of this paper, we assume that candidate
enabled us to build an industry-strength physical database indexes have already been picked. For details on how
design tool that can determine an appropriate set of candidate indexes may be chosen for a workload, we refer
indexes, materialized views (and indexes on materialized the reader to [4].
views) for a given database and workload consisting of
SQL queries and updates. This tool is now part of Workload
Microsoft SQL Server 2000’s upcoming release. The
extensive experimental results in this paper (Section 6)
demonstrate the value of our proposed techniques. This
work was done as part of the AutoAdmin [1] research Syntactic structure
project at Microsoft, which explores novel techniques to selection
make databases self-tuning.
Microsoft
SQL
2. Architecture for Index and Materialized Server
View Selection Candidate Candidate
Index Materialized
An architectural overview of our approach to index View Selection
Selection
and materialized view selection is shown in Figure 1. We Configuration
assume that we are given a representative workload for Simulation
which we need to recommend indexes and materialized and Cost
views. One way to obtain such a workload is to use the Estimation
logging capability of modern database systems to capture Module
a trace of queries and updates faced by the system. Configuration
Alternatively, customer or organization specific Enumeration
benchmarks may be used. As in our previous work on
index selection [4], the key components of the
architecture are: syntactic structure selection, candidate
selection, configuration enumeration, and configuration Final
simulation and cost estimation. Recommendation
Given a workload, the first step is to identify
syntactically relevant indexes, materialized views and Figure 1. Architecture of Index and Materialized View
indexes on materialized views that can potentially be used Selection Tool
to answer the query. For example, consider a query Q:
SELECT Sum(Sales) FROM Sales_Data WHERE City = Once we have chosen a set of candidate indexes and
‘Seattle’. For the query Q, the following materialized candidate materialized views, we need to search among
views (among others) are syntactically relevant: v1: these structures to determine the ideal physical design,
SELECT Sum(Sales) FROM Sales_Data WHERE City = henceforth called a configuration. In our context, a
‘Seattle’. v2: SELECT City, Sum(Sales) FROM configuration will consist of a set of traditional indexes,
Sales_Data GROUP BY City. v3: SELECT City, Product, materialized views and indexes on materialized views. In
Sum(Sales) FROM Sales_Data GROUP BY City, this paper we will not discuss issues related to selection of
Product. Optionally, we can consider additional indexes indexes on materialized views due to lack of space.
on the columns of the materialized view. Like indexes on Despite the remarkable pruning achieved by the candidate
base tables, indexes on materialized views can be single- selection module, searching through this space in a naïve
column or multi-column, clustered or non-clustered, with fashion by enumerating all subsets of structures is
the restriction that a given materialized view can have at infeasible. We adopt the same greedy algorithm for
most one clustered index on it. In this paper, we focus on configuration enumeration as was used in [4]:
the class of single-block materialized views consisting of Greedy(m,k). This algorithm returns a configuration
selection, join, grouping and aggregation. The workload consisting of a total of k indexes and materialized views.
however, may consist of arbitrary SQL statements. In this It first picks an optimal configuration of size up to m (d k)
paper, we do not consider materialized views that can be by exhaustively enumerating all configurations of size up
exploited using back-joins by the optimizer. to m. It then picks the remaining (k-m) structures
greedily. As will be shown in Section 6.2.4, this algorithm
works well even when the set of candidates contains
497
materialized views in addition to indexes. An important plan information of queries. However, since the plan is an
characteristic of our approach is that configuration artifact of the existing physical design, such an approach
enumeration is over the joint space of indexes and can lead to sub-optimal recommendations. The paper also
materialized views. suggests an alternative of examining all possible query
The configurations considered by the configuration plans of a query. However, the latter technique is not
enumeration module are compared for quality by taking scalable for even moderately sized workloads.
into account the expected impact of the proposed There is a substantial body of work in the area of
configurations on the sum of the cost of queries in the index selection that describes how to pick a good set of
workload. The configuration simulation and cost indexes for a given workload [4,8,16]. More recently,
estimation module is responsible for providing this other commercial systems have also added support for
support. We have extended Microsoft SQL server to automatically picking indexes [14,20]. The architecture
simulate the presence of indexes and materialized views adopted in our scheme is in the spirit of [4]. However, as
that do not exist (referred to as “what-if” indexes and noted above, the candidate materialized view selection as
materialized views) to the query optimizer, and have also well as the comparison of alternative strategies to pick
extended the optimizer costing module, so that given a indexes on base tables along with materialized views,
query Q and a configuration C, the cost of Q when the constitute novel and important contributions of this paper.
physical design is the configuration C, may be computed. Rozen [15] presents a framework for choosing a physical
A detailed discussion of simulation of what-if structures is design consisting of various “feature sets” including
beyond the scope of this paper (see [5]). Finally, we note indexes and materialized views. The space of materialized
that index and materialized view maintenance costs are views considered in Rozen’s thesis is restricted to single-
accounted for in our approach by the inclusion of table aggregation views with GROUP BY, whereas we
updates/inserts/deletes statements in the workload. allow materialized views to consist of join, selection,
grouping and aggregation operators,
3. Related Work Some commercial systems (e.g., Redbrick/Informix
Recently, there have been several papers on selection [16] and Oracle 8i [14]) provide tools to tune the selection
of materialized views in the OLAP/Data Cube context of materialized views for a workload. As with the body of
[9,10,11,12,18]. These papers assume that the set of the work referenced above, these tools exclusively
candidate materialized views is identical to the set of recommend materialized views. In contrast, we present an
syntactically relevant materialized views for the integrated tool that can recommend indexes on base tables
workload1. As argued earlier, such a technique is not as well as materialized views (and indexes on them) by
scalable for reasonably large SQL workloads since the weighing in the impact of both on the performance of the
space of syntactically relevant materialized views is very workload. Finally, our paper is concerned with selection
large. The focus of the above papers is almost exclusively of materialized views but not with techniques to rewrite
on the configuration enumeration problem. In principle, queries in the presence of materialized views.
their proposed enumeration schemes may be adopted in
our architecture by simply substituting Greedy(m,k). 4. Candidate Materialized View Selection
Thus, we view the work presented in the above papers to Considering all syntactically relevant materialized
be complementary to the work presented in this paper. views for a workload in the configuration enumeration
Although one of the papers [11] studies the interaction of phase (see Figure 1) is not scalable since it would explode
materialized views with indexes on materialized views, the space of configurations that must be searched. The
none of the papers consider interaction among selection of space of syntactically relevant materialized views for a
indexes on base tables and selection of materialized query (and hence a workload) is very large, since in
views. Thus, they implicitly assume that indexes are principle, a materialized view can be proposed on any
either already picked, or will be picked after selection of subset of tables in the query. Furthermore, even for a
materialized views. As will be shown in this paper, both given table-subset (a table-subset is a subset of tables
these alternatives severely impact quality of the solution. referenced in a query in the workload.), there is an
The work by Baralis et al. [3] is also set in the context explosion in the space of materialized views arising from
of OLAP/Data Cube and does not consider traditional selection conditions and group by columns in the query. If
indexes on base tables. For a given workload, they there are m selection conditions in the query on a table-
consider materialized views that exactly match queries in subset T, then materialized views containing any subset of
the workload, as well as a set of additional views that can these selection conditions are syntactically relevant.
leverage commonality among queries in the workload. Therefore, the goal of candidate materialized view
Our technique for exploiting commonality among queries selection is to quickly eliminate materialized views that
in the workload for candidate materialized view selection are syntactically relevant for one or more queries in the
(Section 4.3) is different. Further, our techniques can also workload but are never used in answering any query from
deal with arbitrary SQL workloads and materialized entering the configuration enumeration phase.
views with selection. We observe that the obvious approach of selecting one
In the context of SQL databases and workloads, the candidate materialized view per query that exactly
work by [22] picks materialized views by examining the matches each query in the workload does not work since
in many database systems the language of materialized
1
views may not match the language of queries. For
Typically, these are aggregation views over subsets of example, nested sub-queries can appear in the query but
dimensions. For each subset of dimensions, multiple aggregate may not be part of the materialized view language.
views are possible in the presence of dimension hierarchy.
498
Moreover, in storage-constrained environments, ignoring 4.1. Finding Interesting Table-Subsets
commonality across queries in the workload can result in Our goal is to find “interesting” table-subsets from
sub-optimal quality. This problem is even more severe in among all possible table-subsets for the workload, and
large workloads. The following simplified example of Q1 restrict the space of materialized views considered to only
from the TPC-H benchmark illustrates this point: those table-subsets. Intuitively, a table-subset T is
Example 1. Consider a workload consisting of 1000 interesting if materializing one or more views on T has
queries of the form: SELECT l_returnflag, l_linestatus, the potential to reduce the cost of the workload
SUM(l_quantity) FROM lineitem WHERE l_shipdate significantly, i.e., above a given threshold. Thus, the first
BETWEEN <Date1> and <Date2> GROUP BY step is to define a metric that captures the relative
l_returnflag, l_linestatus. Assume that each of the 1000 importance of a table-subset.
queries has different constants for <Date1> and <Date2>. Consider the following metric: TS-Cost(T) = total
Then, rather than recommending 1000 materialized views, cost2 of all queries in the workload (for the current
the following materialized view that can service all 1000 database) where table-subset T occurs. The above metric,
queries may be more attractive for the entire workload: while simple, is not a good measure of relative
SELECT l_shipdate, l_returnflag, l_linestatus, importance of a table-subset. For example, in the context
SUM(l_quantity) FROM lineitem GROUP BY l_shipdate, of Example 3, if all queries in the workload referenced the
l_returnflag, l_linestatus. tables lineitem, orders, nation, and region together, then
A second observation that influences our approach to using the TS-Cost(T) metric, the table-subsets T1 =
candidate materialized view selection is that there are {lineitem, orders} would have the same importance as the
certain table-subsets such that, even if we were to propose table-subset T2 = {nation, region} even though a
materialized views on those subsets it would only lead to materialized view on T1 is likely to be much more useful
a small reduction in cost for the entire workload. This can than a materialized view on T2. Therefore, we propose the
happen either because the table-subsets occur infrequently following metric that better captures the relative
in the workload or they occur only in inexpensive queries. importance of a table-subset: TS-Weight(T) = ¦i
Example 2. Consider a workload of 100 queries whose Cost(Qi)
(sum of sizes of tables in T)/ (sum of sizes of all
total cost is 10,000 units. Let T be a table-subset that tables referenced in Qi)), where the summation is only
occurs in 25 queries whose combined cost is 50 units. over queries in the workload where T occurs. Observe
Then even if we considered all syntactically relevant that TS-Weight is a simple function that can discriminate
materialized views on T, the maximum possible benefit of between table-subsets even if they occur in exactly the
those materialized views for the workload is 0.5%. same queries in the workload. A complete evaluation of
Furthermore, even among table-subsets that occur this and alternative functions, and their relationship to
frequently or occur in expensive queries, not all table- cost estimation by the query optimizer is part of our
subsets are likely to be equally useful. ongoing work.
Example 3. Consider the TPC-H 1GB database and the
workload specified in the benchmark. There are several 1. Let S1 = {T | T is a table-subset of size 1 satisfying TS-
queries in which the tables, lineitem, orders, nation, and Cost(T) t C}; i = 1
region co-occur. However, it is likely that materialized 2. While i MAX-TABLES and |Si| > 0
views proposed on the table-subset {lineitem, orders} are 3. i = i + 1; Si = {}
more useful than materialized views proposed on {nation,
4. Let G =^T | T is a table-subset of size i, and s Si-1
region}. This is because the tables lineitem and orders
have 6 million and 1.5 million rows respectively, but such that s T}
tables nation and region are very small (25 and 5 rows 5. For each T G
respectively). Hence, the benefit of pre-computing the If TS-Cost (T) t C Then Si = Si {T}
portion of the queries involving {nation, region} is 6. End For
insignificant compared to the benefit of pre-computing 7. End While
the portion of the query involving {lineitem, orders}. 8. S = S1 S2 … SMAX-TABLES
Based on these observations, we approach the task of 9. R = {T | T S and TS-Weight(T) t C}
candidate materialized view selection using three steps: 10. Return R
(1) From the large space of all possible table-subsets for
the workload, we arrive at a smaller set of interesting Figure 2. Algorithm for finding interesting table-
table-subsets (Section 4.1). (2) Based on these interesting subsets in the workload.
table-subsets, we propose a set of materialized views for Although TS-Weight(T) is a reasonable metric for
each query in the workload, and from this set we select a relative importance of a table-subset, there does not
configuration that is best for that query. This step uses a appear to be an obvious efficient algorithm for finding all
cost-based analysis for selecting the best configuration for table subsets whose TS-Weight exceeds a given threshold.
a query (Section 4.2). (3) Starting with the views selected In contrast, the TS-Cost metric has the property of
in (2), we generate an additional set of “merged” “monotonicity” since for table subsets T1, T2, T1 T2
materialized views in a controlled manner such that the TS-Cost(T1) t TS-Cost(T2). This is because in all queries
merged materialized views can service multiple queries in where T2 occurs, T1 (and likewise all other subsets of T2)
the workload (Section 4.3). The new set of merged
materialized views, along with the materialized views
selected in (2) is the set of candidate materialized views 2
The cost of a query (or update statement) Q, denoted by
that enters configuration enumeration. We now present Cost(Q) can be obtained from the configuration simulation
the details of each of these steps. and cost estimation shown in Figure 1.
499
also occur. This monotonicity property of TS-Cost allows subsets that occur in Qi, it is not sufficient to propose
us to leverage efficient algorithms proposed for materialized views only on the table-subset that exactly
identifying frequent itemsets, e.g., as in [2], to identify all matches the tables referenced in Qi. One reason for this is
table-subsets whose TS-Cost exceeds the specified that the language of views may not match the language of
threshold. Fortunately, it is also the case that if TS- queries, e.g., the query may contain a nested sub-query
Weight(T) t C (for any threshold C), then TS-Cost(T) t whereas the view cannot. Also, for complex queries the
C. Therefore, our algorithm (shown in Figure 2) for query optimizer performs algebraic transformations of the
identifying interesting table-subsets by the TS-Weight query to find a better execution plan. In such cases,
metric has two steps: (a) Prune table-subsets not determining which of the interesting table-subsets to
satisfying the given threshold using the TS-Cost metric. consider requires analysis of the structure of the query as
(b) Prune the table-subsets retained in (a) that do not well as knowledge of the transformations considered by
satisfy the given threshold using the TS-Weight metric. the query optimizer. We also note that due to the pruning
We note that the efficiency gained by the algorithm is due of table-subsets in previous step (Section 4.1), the table-
to reduced CPU and memory costs by not having to subset that exactly matches the tables referenced in the
enumerate all table-subsets. query may not even be deemed interesting. In such cases,
In Figure 2, we define the size of a table-subset T to it again becomes important to consider smaller interesting
be the number of tables in T. MAX-TABLES is the table-subsets that occur in Qi. Fortunately, due to the
maximum number of tables referenced in any query in the effective pruning achieved by the algorithm for finding
workload. A lower threshold C leads to a larger space interesting table-subsets (Figure 2), we are able to take the
being considered and vice versa. Based on experiments on simple approach of proposing syntactically relevant
various databases and workloads, we found that using C = materialized views for a query Qi on all interesting table
10% of the total workload cost had a negligible negative subsets that occur in Qi.
impact on the solution compared to the case when there is
no cut off (C = 0), but was significantly faster (see 1. M = {} /* M is the set of materialized views that is
Section 6.2.1 for details). useful for at least one query in the workload W*/
2. For i = 1 to |W|
3. Let Si = Set of materialized views proposed for
4.2. Exploiting the Query Optimizer to Prune query Qi.
Syntactically Relevant Materialized Views 4. C = Find-Best-Configuration (Qi, Si)
The algorithm for identifying interesting table-subsets 5. M = M C;
presented in Section 4.1 significantly reduces the number 6. End For
of syntactically relevant materialized views that must be 7. Return M
considered for a workload. Nonetheless, many of these
views may still not be useful for answering any query in
the workload. This is because the decision of whether or Figure 3. Cost-based pruning of syntactically
not a materialized view is useful in answering a query is relevant materialized views.
made by the query optimizer using cost estimation. For each such interesting table-subset T, we propose
Therefore, our goal is to prevent syntactically relevant (in Step 3): (1) A “pure-join”3 materialized view on T
materialized views that are not used in answering any containing join and selection conditions in Qi on tables in
query from being considered during configuration T. (2) If Qi has grouping columns, then a materialized
enumeration. We achieve this goal using the algorithm view similar to (1) but also containing GROUP BY
shown in Figure 3, which is based on the intuition that if a columns and aggregate expression from Qi on tables in T.
materialized view is not part of the best solution for even It is also possible to propose additional materialized views
a single query in the workload, then it is unlikely to be on a table-subset that include only a subset of the
part of the best solution for the entire workload. This selection conditions in the query on tables in T, since such
approach is similar to the one used in [4] for selecting views may also apply to other queries in the workload.
candidate indexes. For a given query Q, and a set S of However, in our approach, this aspect of exploiting
materialized views (and indexes on them) proposed for Q, commonality across queries in the workload is handled
Step 4 of our algorithm assumes the existence of the via view merging (Section 4.3). For each materialized
function Find-Best-Configuration(Q, S) that returns the view proposed, we also propose a set of clustered and
best configuration for Q from S. Find-Best-Configuration non-clustered indexes on the materialized view. We omit
has the property that the choice of the best configuration the details of this discussion due to lack of space. Our
for a query is cost based, i.e., it is the configuration that experiments (see Section 6.2.3) show that the above
the optimizer estimates as having the lowest cost for Q. algorithm is not only efficient, but it dramatically reduces
Any suitable search method can be used in this function, the number of materialized views that need to be
e.g., the Greedy(m,k) algorithm described in Section 2. considered in configuration enumeration (Figure 1).
We also see that in the presence of updates or storage
constraints, we may need to pick more than one 4.3. View Merging
configuration for a query (e.g., the n best configurations) We observe that if the materialized views that enter
in Step 4 to maintain quality, at the expense of increased configuration enumeration are limited to the ones selected
running time during configuration enumeration [4].
Next we discuss the issue of which syntactically 3
relevant materialized views should proposed for a query In principle, such a “pure-join” materialized view can also be
Qi in Step 3. Observe that among the interesting table- generated via view merging (Section 4.3). We omit this
discussion due to lack of space.
500
by the algorithm presented in Section 4.2, then we can get We define Parent-Closure(v) as the set of views in M
sub-optimal recommendations for the workload when from which v is derived. The goal of Step 4 in the
storage is constrained (see Example 1). This observation MergeViewPair algorithm is to achieve the second
suggests that we need to consider the space of property mentioned above by preventing a merged view
materialized views that although are not optimal for any from being generated if it is much larger than the views in
individual query, are useful for multiple queries, and Parent-Closure(v). Precisely characterizing the factors
therefore may be optimal for the workload. However, that determine the value of the size increase threshold (x)
proposing such a set of syntactically relevant materialized requires further work. In our implementation on Microsoft
views by analyzing multiple queries at once could lead to SQL Server, we have found that setting x between 1 and 2
an explosion in the number of merged materialized views works well over a variety of databases and workloads.
proposed. Instead, our approach is based on the
observation that M, the set of materialized views returned 1. Let v1 and v2 be a pair of materialized views that
by the algorithm in Figure 2 (Section 4.2), contains reference the same tables and the same join conditions.
materialized views selected on a cost-basis and are 2. Let s11, … s1m be the selection conditions that occur in
therefore sure (or very likely) to be used by the query v1 but not in v2. Let s21, … s2n be the selection
optimizer. This set M is therefore a good starting point for conditions that occur in v2 but not in v1.
generating additional “merged” materialized views that 3. Let v12 be the view obtained by (a) taking the union of
are derived by exploiting commonality among views in the projection columns of v1 and v2 (b) taking the union
M. The newly generated set of merged views, along with of the GROUP BY columns of v1 and v2 (c) pushing the
M, are our candidate materialized views. Our approach is columns s11, … s1m and s21, … s2n into the GROUP BY
significantly more scalable than the alternative of clause of v12 and (d) including selection conditions
generated merged views starting from all syntactically common to v1 and v2.
relevant materialized views. 4. If ((|v12| > Min Size (Parent-Closure (v1) Parent-
An important issue in view merging is characterizing Closure (v2)) * x) Then Return Null.
the space of merged views to be explored. In our 5. Return v12.
approach, we have decided to explore this space using a
sequence of pair-wise merges. Thus, the two key issues
that must be addressed are: (1) determining the criteria Figure 4. MergeViewPair algorithm
that govern when and how a given pair of views is We note that in Step 4, MergeViewPair requires
merged (Section 4.3.1), and (2) enumerating the space of estimating the size of a materialized view. One way to
possible merged views (Section 4.3.2). Architecturally, achieve this is to obtain an estimate of the view size from
our approach for view merging is similar to the one the query optimizer. The accuracy of such estimation
adopted in our prior work on index merging [7]. However, depends on the availability of an appropriate set of
the algorithm for merging a pair of views needs to statistics for query optimization [6]. Alternatively, less
recognize the fact that views (unlike indexes) are multi- expensive heuristic techniques have been proposed in [19]
table structures that may contain selections, grouping and for more restricted multidimensional scenarios.
aggregation. These differences significantly influence the
way in which a given pair of views is merged. 1. R=M
2. While (|R| > 1)
4.3.1. Merging a Pair of Views 3. Let M’ = The set of merged views obtained
Our goal when merging a given pair of views, referred by calling MergeViewPair on each pair of views in R.
to as the parent views, is to generate a new view, called 4. If M’ = {} Return (R–M)
the merged view, which has the following two properties. 5. R = R M’
First, all queries that can be answered using either of the 6. For each view v M’, remove both parents
parent views should be answerable using the merged of v from R
view. Second, the cost of answering these queries using 7. End While
the merged view should not be significantly higher than 8. Return (R–M).
the cost of answering the queries using views in M (the
set obtained using algorithm in Figure 2). Our algorithm
for merging a pair of views, called MergeViewPair, is Figure 5. Algorithm for generating a set of
shown in Figure 4. Intuitively, the algorithm achieves the merged views from a given set of views M
first property by structurally modifying the parent views
as little as possible when generating the merged view, i.e., 4.3.2. Algorithm for generating merged views
by retaining the common aspects of the parent views and Our algorithm for generating a set of merged views
generalizing only their differences. For simplicity, we from a given set of views is shown in Figure 5. As
present the algorithm for SPJ views with grouping and mentioned earlier, we invoke this algorithm with the set
aggregation, where the selection conditions are of materialized views M (obtained using the algorithm in
conjunctions of simple predicates. The algorithm can be Figure 3). We comment on several properties of the
generalized to handle complex selection conditions as algorithm. First, note that it is possible for a merged
well as account for differences in constants between materialized view generated in Step 3 to be merged again
conditions on the same column. in a subsequent iteration of the outer loop (Steps 2-7).
Note that a merged view v may be derived starting This allows more than two views in M to be combined
from views in M through a sequence of pair-wise merges. into one merged view even though the merging is done
501
pair-wise. Second, although the number of new merged when picking the second feature set. This raises the issue
views explored by this algorithm can be exponential in of how to determine the fraction f of the total storage
the size of M in the worst case, we observe that much bound to be allocated to the first feature set? In practice,
fewer merged materialized views are explored in practice the “optimal” fraction f depends on several attributes of
(see Section 6.2.3) because of the checks built into Step 4 the workload including amount of updates, complexity of
of MergeViewPair (Figure 4). Third, the set of merged queries; as well as the absolute value of the total storage
views returned by the algorithm does not depend on the bound. In our empirical evaluation of MVFIRST and
exact sequence in which views are merged (we omit the INDFIRST we found that the optimal value of f changes
proof due to lack of space). Furthermore, the algorithm is from one workload to the next. Furthermore, even at this
guaranteed to explore all merged views that can be optimal data point, the quality of the solution is inferior in
generated using any sequence of merges using most cases compared to our approach (JOINTSEL).
MergeViewPair starting with the views in M. The Another problem relevant to both INDFIRST and
algorithm in Figure 5 has been presented in its current MVFIRST is redundant recommendations if the feature
form for simplicity of exposition. Note however, that if selected second is better for a query than the feature
views v1 and v2 cannot be merged to form v12, then no selected first. This happens since the feature set selected
other merged view derived from v1 and v2 is possible, first is fixed and cannot be back-tracked subsequently.
e.g., v123 is not possible. We can leverage this observation A further drawback of MVFIRST is that selecting
to increase efficiency by using techniques for finding materialized views first can adversely affect the quality of
frequent itemsets, e.g., as in [2]. Finally, we can ensure candidate indexes picked. This is because for a given
that merged views generated by this algorithm are query, the best materialized view is likely to be more
actually useful in answering queries in the workload, by beneficial than the best index since a materialized view
performing a cost-based pruning using the query can pre-compute (parts of) the query (via aggregations,
optimizer (similar to the algorithm in Figure 3). grouping, joins etc.). Therefore, when materialized views
are chosen first, they are likely to preclude selection of
5. Trading Choices of Indexes and potentially useful candidate indexes for the workload.
Materialized Views
Previous work in physical database design has 5.2. Joint Enumeration
considered the problems of index selection and The two attractions of joint enumeration of candidate
materialized view selection in isolation. However, both indexes and materialized views are: (a) A graceful
indexes and materialized views are fundamentally similar adjustment to storage bounds, and (b) Considering
– both are redundant structures that speed up query interactions between candidate indexes and candidate
execution, compete for the same resource – storage, and materialized views that are not possible in the other
incur maintenance overhead in the presence of updates. approaches. For example, consider a query Q for which
Not surprisingly, indexes and materialized views can indexes I1, I2 and materialized view v are candidates.
interact with one another, i.e., the presence of an index Assume that I1 alone reduces the cost of Q by 25 units and
can make a materialized view more attractive and vice I2 reduces the cost by 30 units, but I1 and v together
versa. Therefore, as described in Section 2, our approach reduce the cost by 100 units. Then, using INDFIRST, I2
is to consider joint enumeration of the space of candidate would eliminate I1 when indexes are picked, and we
indexes and materialized views. In this section, we would not be able to get the optimal recommendation {I1,
compare our approach to alternative approaches and v}. We use the Greedy(m,k) algorithm for enumeration,
quantify the benefit of joint enumeration. which allows us to treat indexes, materialized views and
There are two alternatives to our approach of jointly indexes on materialized views on the same footing. We
enumerating the space of indexes and materialized views. demonstrate the quality and scalability of this algorithm
One alternative is to pick materialized views first, and for index and materialized view selection in Section 6.2.4.
then select indexes for the workload given the
materialized views picked earlier (we denote this 6. Experiments
alternative by MVFIRST). The second alternative We have implemented the algorithms presented in this
reverses the above order and picks indexes first, followed paper on Microsoft SQL Server 2000. In the first set of
by materialized views (INDFIRST). We have experiments, we evaluate the quality and running time of
implemented these alternatives on Microsoft SQL Server our algorithm for selecting candidate materialized views
2000, and conducted extensive experiments to compare (Section 4). We demonstrate that: (1) Our algorithm for
their quality and efficiency. These experiments (see identifying interesting table-subsets for a workload
Section 6.2.5) support the hypothesis that joint (Section 4.1) does not eliminate useful materialized
enumeration results in significantly better quality views, while substantially reducing the number of
solutions than the two alternatives, and also shows the materialized views that need to be proposed. (2) The
scalability of our approach. application of our view merging algorithm (Section 4.3)
significantly improves quality of the recommendation
5.1. Selecting one feature set followed by the other specially when storage is at a premium.
In both MVFIRST and INDFIRST, if the global Our second set of experiments is related to the
storage bound is S, then we need to determine a fraction f architectural issues in this paper. We show that: (1) Our
(0d f d1), such that a storage constraint of f*S is applied candidate selection module (Figure 1) significantly
to the selection of the first feature set. After selecting the reduces the running time compared to an exhaustive
first feature set, all the remaining storage can be used scheme that does not use this module, while maintaining
502
high quality recommendations. (2) Our configuration a threshold of C=10%. Figure 6 shows that across all
enumeration module Greedy(m,k) gives results three workloads, our algorithm achieves significant
comparable to an exhaustive algorithm that enumerates pruning of the space of syntactically relevant materialized
over all subsets of candidates, and runs significantly views. Furthermore, as seen in Figure 7, we see a small
faster. (3) Our approach for joint enumeration over the drop in quality. This experiment shows that our pruning is
space of indexes and materialized views (JOINTSEL) effective and yet does not miss out on important table-
gives significantly better solutions than MVFIRST or subsets.
INDFIRST.
Reduction in syntactically relevant materialized views
6.1. Experimental Setup proposed for workload
The experiments were run on two Dell Precision 610 80%
Reduciton in number
of materialized views
machines with 550 Mhz CPU and 256 MB RAM. The 60%
databases used for our tests were stored on an internal
16.9 GB hard drive. 40%
Databases: The algorithms presented in this paper have 20%
been extensively tested on several real and synthetic
databases as part of the shipping process of the tuning 0%
wizard for Microsoft SQL Server 2000. However, due to TPCH-22 WKLD-4TBL WKLD-8TBL
lack of space and the intrinsic difficulty of comparing our Workload
algorithms with “optimal” algorithms on large workloads,
we limit our experiments to relatively small workloads on Figure 6. Reduction in syntactically relevant
the TPC-H [20] 1GB database as well as one real-world materialized views proposed compared to Exhaustive
database used within Microsoft to track the sales of
products by the company. Therefore, the experiments Drop in quality compared to Exhaustive proposal of
presented should be interpreted as illustrative rather than syntactically relevant materialized views
exhaustive empirical validation. 8%
Drop in quality
Name #queries Remarks 6%
TPCH-22 22 TPC-H benchmark 4%
TCPH-UPD25, 25 25 % update statements 2%
TCPH-UPD75 25 75% update statements 0%
WKLD-4-TBL, 100 Max 4-table queries TPCH-22 WKLD-4TBL WKLD-8TBL
WKLD-8-TBL 100 Max 8-table queries Workload
WKLD-VM 50 Real-world workload
WKLD-SCALE n = 25, 50, Workloads of increasing Figure 7. Comparison of quality of our
(n) 75, 100, 125 size algorithm to Exhaustive.
60%
summarized in Table 1. We created the synthetic
workloads using a program that can generate Select, 40%
Insert, Delete and Update statements. The queries 20%
generated by this program are limited to Select, Project,
Join queries with Group By and Aggregation. Nested sub- 0%
queries connected via an EXISTS clause can also be 2200 2400 2600 2800 3000 3200
generated. In all experiments we use the cost of the Storage bound (MB)
workload for the recommended configuration as a
measure of the quality of that configuration.
Figure 8. Quality vs. storage bound with and
6.2. Experimental Results without view merging.
6.2.2. Evaluation of view merging algorithm
6.2.1. Evaluation of algorithm for identifying Next, we illustrate the importance of view merging
interesting table-subsets (Section 4.3) using workload WKLD-VM (see Table 1),
In this experiment, we evaluate the reduction in which consists of 50 real-world queries (SPJ with
number of syntactically relevant materialized views grouping and aggregation). We compare two versions of
proposed by our algorithm (see Section 4.1) and its our algorithm – with and without our view merging
impact on quality compared to an approach that module. Figure 8 shows the improvement in quality of the
exhaustively proposes all syntactically relevant solution as the total storage bound is varied from 2.2GB
materialized views. We carry out this comparison for to 3.2 GB. We see that at low storage constraints the
three workloads: TPCH-22 (the original benchmark), version with view merging significantly outperforms the
WKLD-4TBL, and WKLD-8TBL (see Table 1). We used version without view merging. As expected, when the
503
storage bound is increased, the two versions converge to architectures MVFIRST and INDFIRST (Section 5.1).
the same solution. For the above workload the number of We study the quality of these alternatives when they are
additional merged views proposed was about 19%, and not subject to any storage constraint (i.e., storage = f).
the increase in running time due to view merging was Table 4 shows that even with no storage constraint the
about 9%. Finally, we note that yet another positive quality of solution using MVFIRST is significantly worse
aspect of view merging is that it produces more compact than the quality of JOINTSEL, particularly in the
recommendations (i.e., having fewer materialized views). presence of updates in the workload. This confirms our
Workload Ratio of % improv. % improv. intuition that picking materialized views first adversely
running in quality in quality affects the subsequent selection of indexes (see Section
time Without With 5.1) even in a query only workload (TPCH-22). In the
TPC-H queries 64 98.1% 97.6% presence of updates, the solution of MVFIRST
Q1, Q2, Q3 degenerates rapidly (TPCH-UPD25) compared to
TPC-H queries 13 93.6% 93.6% JOINTSEL. We therefore drop this alternative from
Q4, Q5 further experiments. We note that the quality of
INDFIRST is comparable to JOINTSEL on TPCH-22
TPC-H queries 31 73.4% 73.4%
when storage is not an issue. In the presence of updates
Q6, Q7, Q8
(TPCH-UPD25) however, the INDFIRST
TPC-H queries 14 66.6% 60.1% recommendations are inferior compared to JOINTSEL.
Q9,Q10,Q11
Table 2. Comparison of schemes with and without the Number of Candidate materialized views vs. Workload size
candidate selection module.
6.2.3. Evaluation of Candidate Selection 200
Number of candidate
180
Table 2 compares the running time and quality of our
materilaized views
160
approach to an exhaustive approach in which the 140
120
candidate selection step (Section 4) is omitted, i.e. all 100
80
syntactically relevant materialized views and indexes are 60
considered in the configuration enumeration. In both 40
20
cases, we use Greedy(m,k) as the algorithm for 0
504
fraction whereas for s=0.5, f=0.50 is the right fraction. alternative algorithms. Finally, note that indexes and
For a given database and a workload, the optimal storage materialized views are only a part of the physical design
partitioning varies with the storage constraint. Finally, space. In the context of the AutoAdmin project [1], we
we study the behavior of INDFIRST vs. JOINTSEL for continue to pursue our long-term goal of a complete
three workloads and a fixed total storage, as the fraction physical design tool for SQL databases.
of storage allotted to indexes (f) is varied. Figure 11
shows that the best allocation fraction is different for each 8. Acknowledgments
workload, e.g., f = 0.25 is best for TPC-H and TPCH- We thank Gautam Das for his help in analyzing the
UPD25 but f = 0.50 is optimal for TPCH-UPD75. For a main algorithms presented in this paper. We thank the
given database and a storage space, the “right” partition Microsoft SQL Server team with their help in providing
varies with the workload. In contrast, we see the the necessary server-side support for our implementation.
consistently high quality of JOINTSEL across various
workloads. We also note that the running time of
JOINTSEL and INDFIRST are comparable to one another 9. References
(within approximately 10% of each other for the 1. AutoAdmin project, Microsoft Research.
workloads we experimented with). For example, for the http://www.research.microsoft.com/dmx/AutoAdmin
data point where additional storage allowed = 100%, for 2. Agrawal R., Ramakrishnan, S. Fast Algorithms for Mining
the TPCH-22 workload, JOINTSEL is slightly faster than Association Rules in Large Databases, VLDB 1994.
INDFIRST (f=0.50) by about 4% whereas for the TPCH- 3. Baralis E., Paraboschi S., Teniente E., Materialized View
UPD25 workload, INDFIRST is faster by about 6%. Selection in a Multidimensional Database, VLDB 1997.
4. Chaudhuri S., Narasayya V., An Efficient Cost-Driven
Dr o p in q u a lit y o f INDFIRS T c o m p a r e d t o
J O INT S EL w it h v a r y in g s t o r a g e ( T P C - H
Index Selection Tool for Microsoft SQL Server. VLDB
w o r k lo a d ) 1997.
5. Chaudhuri S., Narasayya V., AutoAdmin “What-If” Index
60% Analysis Utility. ACM SIGMOD 1998.
Drop in quality
90%
80%
Implementing Data Cubes Efficiently, ACM SIGMOD
Drop in quality
70% 1996.
60%
50%
T P CH - 22
T P CH - U P D25
13. Kotidis Y., Roussopoulos N. DynaMat: A Dynamic View
40%
30%
T P CH - U P D75 Management System for Data Warehouses. ACM
20%
10 %
SIGMOD 1999.
0%
f =0 . 2 5 f =0 . 5 f =0 . 7 5
14. http://www.oracle.com/
S t o r a g e a llo t e d f o r in d e x e s a s 15. Rozen S. Automating Physical Database Design: An
f r a c t io n o f t o t a l s t o r a g e b o u n d Extensible Approach, Ph.D. Dissertation. New York
Univeristy, 1993.
Figure 11. Quality of INDFIRST vs. JOINTSEL 16. http://www.informix.com/informix/solutions/dw/redbrick/v
with varying storage partitioning (f). ista/
17. Rozen S., Shasha D. A Framework for Automating
7. Conclusion Physical Database Design, VLDB 1991.
18. Shukla A., Deshpande P.M., Naughton J.F., Materialized
The architecture and novel algorithms presented in this View Selection for Multidimensional Datasets. VLDB
paper are the foundation of a robust physical database 1998.
design tool for Microsoft SQL Server 2000 that can 19. Shukla A., Deshpande P.M., Naughton J.F., Ramaswamy
recommend both indexes and materialized views. In a K., Storage Estimation for Multidimensional Aggregates in
recent paper, Kotidis et al.[13] present a technique for the Presence of Hierarchies. VLDB 1996.
OLAP databases to dynamically determine which 20. TPC Benchmark H (Decision Support) Revision 1.1.0.
materialized views should be maintained. Extending this http://www.tpc.org/
paradigm to SQL workloads is a significantly more 21. Valentin G., Zuliani M., Zilio D., Lohman G., Skelley A.
complex problem, but is worth exploring. Another DB2 Advisor: An Optimizer Smart Enough to Recommend
challenging task is developing a theoretical framework Its Own Indexes. ICDE 2000.
and appropriate abstractions for physical database design 22. Yang J., Karlapalem K., Li Q., Algorithms For
that is able to capture complexities of the physical design Materialized View Design in Data Warehousing
problem, and thus enables us to compare properties of Environment. VLDB 1997.
505