Multi Query Optimization and Applications
Multi Query Optimization and Applications
AND A PPLICATIONS
D OCTOR O F P HILOSOPHY
by
P RASAN ROY
Department of Computer Science and Engineering Indian Institute of Technology - Bombay 2000
Approval Sheet
The thesis entitled M ULTI -Q UERY O PTIMIZATION by P RASAN ROY is approved for the degree of D OCTOR
OF AND
A PPLICATIONS
P HILOSOPHY.
Examiners
Supervisor
Chairman
Date : Place :
Abstract
Complex queries are becoming commonplace with the growing use of decision support systems. These complex queries often have a lot of common sub-expressions, either within a single query, or across multiple such queries. The focus of this work is to speed up query execution by exploiting these common subexpressions. Given a set of queries in a batch, multi-query optimization aims at exploiting common subexpressions among these queries to reduce evaluation cost. Multi-query optimization has hitherto been viewed as impractical, since earlier algorithms were exhaustive, and explore a doubly exponential search space. We present novel heuristics for multi-query optimization, and demonstrate that optimization using these heuristics provides signicant benets over traditional optimization, at a very acceptable overhead in optimization time. In online environments, where the queries are posed as a part of an ongoing stream instead of in a batch, individual query response times can be greatly improved by caching nal/intermediate results of previous queries, and using them to answer later queries. An automatic caching system that makes intelligent decisions on what results to cache would be an important step towards knobs-free operation of a database system. We describe an automatic query caching system called Exchequer which is closely coupled with the optimizer to ensure that the caching system and the optimizer make mutually consistent decisions, and experimentally illustrate the benets of this approach. Further, because the presence of views enhances query performance, materialized views are increasingly being supported by commercial database/data warehouse systems. Whenever the data warehouse is updated, the materialized views must also be updated. We show how to nd
an efcient plan for maintenance of a set of views, by exploiting common subexpressions between different view maintenance expressions. These common subexpressions may be materialized temporarily during view maintenance. Our algorithms also choose additional subexpressions/indices to be materialized permanently (and maintained along with other materialized views), to speed up view maintenance. In addition to faster view maintenance, our algorithms can also be used to efciently select materialized views to speed up query workloads.
Contents
1 Introduction 1.1 Problem Overview and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 1.1.2 1.1.3 1.2 Transient Materialization . . . . . . . . . . . . . . . . . . . . . . . . . . Dynamic Materialization . . . . . . . . . . . . . . . . . . . . . . . . . . Permanent Materialization . . . . . . . . . . . . . . . . . . . . . . . . .
1 1 1 3 4 6 6 7 9
Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 1.2.2 1.2.3 Multi-Query Optimization . . . . . . . . . . . . . . . . . . . . . . . . . Query Result Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . Materialized View Selection and Maintenance . . . . . . . . . . . . . . .
1.3
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Design of a Cost-based Query Optimizer . . . . . . . . . . . . . . . . . . . . . . 15 2.2.1 2.2.2 2.2.3 2.2.4 2.2.5 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Logical Plan Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Physical Plan Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 The Search Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Differences from the Original Volcano Optimizer . . . . . . . . . . . . . 30
2.3
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 i
CONTENTS
34
Setting Up The Search Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Reuse Based Multi-Query Optimization Algorithms . . . . . . . . . . . . . . . . 37 3.2.1 3.2.2 3.2.3 Optimization in Presence of Materialized Views . . . . . . . . . . . . . . 37 The Volcano-SH Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 38 The Volcano-RU Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3
The Greedy Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.3.1 3.3.2 3.3.3 Sharability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Incremental Cost Update . . . . . . . . . . . . . . . . . . . . . . . . . . 47 The Monotonicity Heuristic . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4 3.5
Handling Physical Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.5.1 3.5.2 Selection of Temporary Indices . . . . . . . . . . . . . . . . . . . . . . 54 Nested Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.6
Performance Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.6.1 3.6.2 3.6.3 3.6.4 Basic Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Scaleup Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Effect of Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.7 3.8
Cache-Aware Query Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.1.1 4.1.2 4.1.3 Consolidated DAG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Query DAG Generation and Query/Cached Result Matching . . . . . . . 73 Volcano Extensions for Cache-Aware Optimization . . . . . . . . . . . . 74
4.2 4.3
CONTENTS
4.4 4.5
iii
Differences from Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Experimental Evaluation of the Algorithms . . . . . . . . . . . . . . . . . . . . 82 4.5.1 4.5.2 4.5.3 4.5.4 Test Query Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 List of algorithms compared . . . . . . . . . . . . . . . . . . . . . . . . 85 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.6 4.7
Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 96
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Overview of Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Setting up the Maintenance Plan Space . . . . . . . . . . . . . . . . . . . . . . . 104 5.3.1 5.3.2 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Propagation-Based Differential Generation for Incremental View Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.3.3 Incorporating Incremental Plans in the Query DAG Representation . . . . 107
5.4 5.5
Maintenance Cost Computation . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Transient/Permanent Materialized View Selection . . . . . . . . . . . . . . . . . 111 5.5.1 5.5.2 5.5.3 The Basic Greedy Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 111 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.6
Performance Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.6.1 5.6.2 Performance Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Performance Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.7
CONTENTS
128
A.1 List of Queries Used in Section 3.6 . . . . . . . . . . . . . . . . . . . . . . . . . 128 A.2 List of View Denitions Used in Section 5.6 . . . . . . . . . . . . . . . . . . . . 131 B List of Logical Transformations C Operator Cost Estimates 133 135
List of Figures
1.1 1.2 1.3 2.1 2.2 Example illustrating benets of sharing computation . . . . . . . . . . . . . . . Example illustrating the benet of caching intermediate results . . . . . . . . . . Example view maintenance plan. Merge refreshes a view given its delta. . . . . 2 3 5
Overview of Cost-based Transformational Query Optimization . . . . . . . . . . 16 Logical Query DAG for A B C. Commutativity not shown; every join node has another join node with inputs exchanges, below the same equivalence node. . 19 Logical Plan Space Generation for A B C. . . . . . . . . . . . . . . . . . . 20 Algorithm for Logical Query DAG Generation . . . . . . . . . . . . . . . . . . . 21 Physical Query DAG for
2.3 2.4 2.5 2.6 2.7 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9
. . . . . . . . . . . . . . . . . . . . . . . . . . 24
Algorithm for Physical Query DAG Generation . . . . . . . . . . . . . . . . . . 26 The Search Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 The Volcano-SH Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 The Volcano-RU Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 The Greedy Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Incremental Cost Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Example Showing Cost Propagation through Physical Equivalence Nodes . . . . 52 Optimization of Stand-alone TPCD Queries . . . . . . . . . . . . . . . . . . . . 58 Execution of Stand-alone TPCD Queries on MS SQL Server . . . . . . . . . . . 59 Optimization of Batched TPCD Queries . . . . . . . . . . . . . . . . . . . . . . 61 Optimization of Scaleup Queries . . . . . . . . . . . . . . . . . . . . . . . . . . 63 v
vi
LIST OF FIGURES
3.10 Complexity of the Greedy Heuristic . . . . . . . . . . . . . . . . . . . . . . . . 63 4.1 4.2 Architecture of the Exchequer System . . . . . . . . . . . . . . . . . . . . . . . 70 (a) CDAG for into CDAG (c) A B C expanded into CDAG . . . . . . . . . . . . . . . . . 73 4.3 4.4 The Greedy Algorithm for Cache Management . . . . . . . . . . . . . . . . . . 78 Distribution of distinct intermediate results generated during the processing of the CubePoints and CubeSlices workloads . . . . . . . . . . . . . . . . . . . . . 84 4.5 4.6 4.7 4.8 5.1 Performance on 900 Query CubePoints/Zipf-0.5 Workload . . . . . . . . . . . . 88 Performance on 900 Query CubePoints/Zipf-2.0 Workload . . . . . . . . . . . . 89 Performance on 900 Query CubeSlices/Zipf-0.5 Workload . . . . . . . . . . . . 89 Performance on 900 Query CubeSlices/Zipf-2.0 Workload . . . . . . . . . . . . 90 The Greedy Algorithm for Selecting Views for Transient/Permanent Materialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.2 5.3 5.4 Effect of Transient and Permanent Materialization . . . . . . . . . . . . . . . . . 117 Effect of Adaptive Maintenance Policy Selection . . . . . . . . . . . . . . . . . 120 Scalability analysis on increasing number of views . . . . . . . . . . . . . . . . 122 A C D, A C E (b) Unexpanded A B C inserted
Chapter 1 Introduction
Complex queries are becoming commonplace, especially due to the advent of automatic tools that help analyze information from large data warehouses. These complex queries often have several subexpressions in common since i) they make extensive use of views which are referred to multiple times in the query and ii) many of them are correlated nested queries in which parts of the inner subquery may not depend on the outer query variables, thus forming a common subexpression for repeated invocations of the inner query.
2
ABC BCD
CHAPTER 1. INTRODUCTION
ABC BCD
100
100
100
10
BC
10 100 10 10 D
100 10 A 10 B 10 C
100 10 B 10 C 10 D
A 10
100 10 B 10 C
Figure 1.1: Example illustrating benets of sharing computation Example 1.1.1 Consider a batch of two queries
. A tradi-
tional system will execute each of these queries independently using the individual best plans as suggested by the query optimizer; let these best plans be as shown in Figure 1.1(a). The base relations , , and each have a scan cost of 10 units.1 Each of the joins have a cost of
100 units, giving a total execution cost of 460 units. On the other hand, in the plan shown in Figure 1.1(b), the intermediate result
a cost of 10. Then, it is scanned back twice the rst time to join with A in order to compute
at
a cost of 10 per scan. Each of these joins have a cost of 100 units. The total cost of this consolidated plan is thus 370 units, which is about 20% less than the cost of the traditional plan of Figure 1.1(a), demonstrating the benet of sharing computation during query processing. The expression
and
problem of nding the cheapest execution plan for a batch of queries, exploiting transiently materialized common subexpressions; this is termed multi-query optimization. Section 1.2.1 provides further details of our work on multi-query optimization. Multi-query optimization is an important practical problem. For instance, SQL-3 stored procedures may invoke several queries, which can be executed as a batch. Further, data analysis/reporting often requires a batch of queries to be executed. Recent work on using relational databases for storing XML data, has found that queries on XML data, written in a language
1
100
100
100
BC
100 10
BC
10 100
10
100 10
B
100 10
C
10
10
10
10
10
10
10
10
Figure 1.2: Example illustrating the benet of caching intermediate results such as XML-QL and containing regular path expressions, are translated into a batch of relational queries; these queries have a large amount of overlap and can benet signicantly from multi-query optimization.
and
of Exam-
ple 1.1.1, this time occurring one after another as a part of a workload sequence. As earlier, the queries are on the base relations , , and each having a scan cost of 10 each. The
execution of the two queries, when caching is not supported, costs 230 for each query as shown in Figure 1.2(a), totaling to 460. Contrast this with the execution of the queries as shown in Figure 1.2(b). In this case, during the execution of
is cached to the disk, at a cost of 10, and reused at a cost of 10 per query; the total execution cost
CHAPTER 1. INTRODUCTION
for the two queries is now 370. This illustrates the benet of caching and reusing intermediate results.
We use the term query result caching to mean caching of nal and/or intermediate results of queries. Query result caching differs from multi-query optimization in that at the moment a given query is being executed, later queries in the workload sequence are not known. The main issue in query result caching is thus to dynamically determine the utility of a result, so as to gure out when to admit it into the cache and when to dispose it in favor of another result. Further details of our work on query result caching appear in Section 1.2.2.
merge
incremental refresh
1 0 0 1 0 1
DE BC
merge
dA
dE
Figure 1.3: Example view maintenance plan. Merge refreshes a view given its delta. and need to be taken in an interleaved manner. This is termed materialized view selection and maintenance. Further details of our approach for materialized view selection and maintenance is presented in Section 1.2.3. Example 1.1.3 Suppose we have three materialized views
and
and relations
and
and
are not changed. This reects reality in data warehouses, where only a few of the relations are updated. (However, our techniques do not have any restrictions on what is updated, or what is the form of the updates.) If the maintenance plans of the three views are chosen independently, the best view maintenance plan (incremental or recomputation) for each would be chosen, without any sharing of computation. In contrast, as an illustration of the kind of plans our optimization methods are able to generate, Figure 1.3 shows a maintenance plan for the views that exploits sharing of computation. Here,
and
and
are recomputed.
is materialized transiently, and is disposed as soon as the views are refreshed; this and which make it expensive to maintain
as a materialized view.
The result
CHAPTER 1. INTRODUCTION
. Its full result is then used to recompute
for
as well as
of
would have been used in the incremental maintenance plan while the full result of
1. The search space for multi-query optimization is doubly exponential in the size of the queries, and exhaustive strategies are therefore impractical; as a result, multi-query optimization was hitherto considered too expensive to be useful. We show how to make multi-query optimization practical, by developing novel heuristic algorithms. Further, our algorithms can be easily extended to perform multi-query optimization on nested queries as well as multiple invocations of parameterized queries (with different parameter values). Our algorithms also take into account sharing of computation based on subsumption examples of such sharing include computing
Our algorithms are independent of the data model and the cost model, and are extensible to new operators. 2. In addition to choosing what intermediate expression results to materialize and reuse, our optimization algorithms also choose physical properties, such as sort order, for the materialized results. By modelling presence of an index as a physical property, our algorithms also handle the choice of what (temporary) indices to create on materialized results/database relations. We believe that in addition to our technical contributions, another of our contributions lies in showing how to engineer a practical multi-query optimization system one which can smoothly integrate extensions, such as indexes and nested queries, allowing them to work together seamlessly.
CHAPTER 1. INTRODUCTION
performance is to allocate a limited-size area on the disk to be used as a cache for results computed by previous queries. The contents of the cache may be utilized to speed up the execution of subsequent queries. We use the term query caching to mean caching of nal and/or intermediate results of queries. Most existing decision support systems support static view selection: select a set of views apriori, and keep them permanently on disk. The selection is based on either (a) the intuition of the systems administrator, or (b) recommendation of advisor wizards as supported by Microsoft SQL-Server [1] based on a workload history. The advantage of query caching over static view selection is that it can cater to changing workloads the data access patterns of the queries cannot be expected to be static, and to answer all types of queries efciently, we need to dynamically change the cache contents. In Chapter 4, we present the techniques needed (a) for intelligently and automatically managing the cache contents, given the cache size constraints, as queries arrive, and (b) for performing query optimization exploiting the cache contents, so as to minimize the overall response time for all the queries. The contributions of this work are: 1. We show how to handle the caching of intermediate as well as nal results of queries. Intermediate results, in particular, require careful handling since caching decisions are typically made based on usage rates, and usage rates of intermediate results are dependent on what else is in the cache. Techniques for caching intermediate results were proposed in [10], but they are based only on usage rates and would be biased against results that are currently not in the cache. Our caching algorithms use sophisticated techniques for deciding what to cache, taking into account what other results are cached. Moreover, we show how to consider caching indices constructed on the y in the same way as we consider caching of intermediate results. 2. We show how to enable the optimizer to take into consideration the use of cached results and indices, piggybacked on the optimization step with negligible overhead. All prior cache-aware optimization algorithms have a separate cache result matching step.
3. Our algorithms are extensible to new operations, unlike much of the prior work on caching. Moreover, prior work has mainly concentrated on cube queries; while cube queries are important, general purpose decision support systems must support more general queries as well. Our algorithms can handle any SQL query including nested queries. To the best of our knowledge, no other caching technique is capable of handling caching of intermediate results for such a general class of queries. 4. We have implemented the proposed techniques and present a performance study that clearly demonstrates the benets of our approach. Our study shows that intelligent, workload adaptive intermediate query result caching can be done fast enough to be practical, and leads to signicant overall savings. In this work, we conne our attention only to the issue of efcient query processing, ignoring updates. Data Warehouses are an example of an application where the cache replacement algorithm can ignore updates, since updates happen only periodically (once a day or even once a week).
CHAPTER 1. INTRODUCTION
1. We show how to exploit transient materialization of common subexpressions to reduce the cost of view maintenance plans. Sharing of subexpressions occurs when multiple views are being maintained, since related views may share subexpressions, and as a result the maintenance expressions may also be shared. Furthermore, sharing can occur even within the plan for maintaining a single view if the view has common subexpressions within itself. The shared expressions could include differential expressions, as well as full expressions which are being recomputed. 2. We show how to efciently choose additional expressions for permanent materialization to speed up maintenance of the given views. Just as the presence of views allows queries to be evaluated more efciently, the maintenance of the given permanently materialized views can be made more efcient by the presence of additional permanently materialized views [45, 44]. That is, given a set of materialized views to be maintained, we choose additional views to materialize in order to minimize the overall view maintenance costs. The expressions chosen for permanent materialization may be used in only one view maintenance plan, or may be shared between different views maintenance plans. 3. We show how to determine the optimal maintenance plan for each individual view, given the choice of results for transient/permanent materialization. Maintenance of a materialized view can either be done incrementally or by recomputation. Incremental view maintenance involves computing the differential (deltas) of a materialized view, given the deltas of the base relations that are used to dene the views, and merging it with the old value of the view. However, incremental view maintenance may not always be the best way to maintain a materialized view; when the deltas are large the view may be best maintained by recomputing it from the updated base relations.
11
Our techniques determine the maintenance policy, incremental or recomputation, for each view in the given set such that the overall combination has the minimum cost. 4. We show how to make the above three choices in an integrated manner to minimize the overall cost. It is important to point out that the above three choices are highly interdependent, and must be taken in such a way that the overall costs of maintaining a set of views is minimized. Specically:
The choice of additional views must be done in conjunction with selecting the plans
for maintaining the views, as discussed above. For instance, a plan that seems quite inefcient could become the best plan if some intermediate result of the plan is chosen to be materialized and maintained. We propose a framework that cleanly integrates the choice of additional views to be transiently or permanently materialized, the choice of whether each of the given set of (userspecied) views must be maintained incrementally or by recomputation, and the choice of view maintenance plans. 5. We have implemented all our algorithms, and present a performance study, using queries from the TPC-D benchmark, showing the practical benets of our techniques.
12
CHAPTER 1. INTRODUCTION
2.1 Background
In this section, we provide a broad overview of the main issues involved in traditional query optimization and mention some of the representative work in the area. This discussion will be kept very brief; for the details we point to the comprehensive, very readable survey by Chaudhuri [7]. Traditionally, the core applications of database systems have been online transaction processing (OLTP) environments like banking, sales, etc. The queries in such an environment are simple, involving a small number of relations, say three to ve. For such simple queries, the investment in sophisticated optimization usually did not pay up in the performance gain. As such, only join-order optimization and that too in a constrained search space was effective enough. The seminal paper by Selinger et al. [51] presented a dynamic programming algorithm for searching optimal left-linear join ordered plans. The ideas presented in this paper formed the basis of most
14
optimization research and commercial development till a few years back. However, with the growing importance of online analytical processing (OLAP) environments, which routinely involve expensive queries, more sophisticated query optimization techniques have become crucial. In order to be effective in such demanding environments, the optimizers need to look at less constrained search spaces without loosing much on efciency. They need to adapt to new operators, new implementations of these operators and their cost models, changes in cost estimation techniques, etc. This calls for extensibility in the optimizer architecture. These requirements led to the current generation of query optimizers, of which two representative optimizers are Starburst [40] and Volcano [23]. While the IBM DB2 optimizer [20] is based on Startburst, the Microsoft SQL-Server optimizer [22] is based on Volcano. The main difference between the approaches taken by the two is the manner in which alternative plans are generated. Starburst generates the plans bottom-up that is, best plans for all expressions on relations are computed before expressions on more than relations are considered. On the other
hand, Volcano generates the plans top-down that is, it computes the best plans for only those expressions on being expanded. The need for effective optimization of large, complex queries has brought focus to the intimately related problem of statistics and cost estimation. This is because the cost-based decisions of an optimizations can only be as reliable as its estimates of the cost of the generated plans. A plan is composed of operators (e.g. select, join, sort). The cost of an operator is a function of the statistical summary of its input relations, which includes the size of the relation, and for each relevant attribute, the number of distinct values of the attribute, the distribution of these attribute values in terms of an histogram, etc. While the accuracy of these statistics is crucial the plan cost estimate may be sensitive to these statistics the maintenance of these statistics may be very time consuming. The problem of efciently maintaining reasonably accurate statistics has received much attention in the literature; for the details, we refer to the paper by Poosala et al. [41]. Even if we have perfect information about the input relations, modeling the cost of the operators could still be very difcult. This is because a reasonable cost model must take into account relations which are included in some expression on greater than relations
15
the affect of, for example, the buffering of the relations in the database cache, access patterns of the inputs, the memory available for the operators execution, etc. Moreover, usually the plans execute in a pipeline that is, multiple operators may execute simultaneously. Given the systems bounded resources like CPU and main memory, the execution of these operators may interfere, affecting the execution cost of the plan. There has been much research on cost modeling; an authoritative, very comprehensive survey by Graefe [21] provides the details of the prior work in this area.
2.2.1 Overview
Figure 2.1 gives an overview of the optimizer. Given the input query, the optimizer works in three distinct steps:
16
P11
Logical Plan Space
P12 Q1 ...
(Input Query) Q
Figure 2.1: Overview of Cost-based Transformational Query Optimization 1. Generate all the semantically equivalent rewritings of the input query. In Figure 2.1,
are created by applying transformations on different parts of the query; a transformation gives an alternative semantically equivalent way to compute the given part. For example, consider the query
.
is semantically equivalent to
giving
as a rewriting.
An issue here is how to manage the application of the transformation so as to guarantee that all rewritings of the query possible using the given set of transformations are generated, in as efcient way as possible. For even moderately complex queries, the number of possible rewritings can be very large. So, another issue is how to efciently generate and compactly represent the set of rewritings. This step is explained in Section 2.2.2. 2. Generate the set of executable plans for each rewriting generated in the rst step. Each rewriting generated in the rst step serves as a template that denes the order in
17
which the logical operations (selects, joins, aggregates) are to be performed how these operations are to be executed is not xed. This step generates the possible alternative execution plans for the rewriting. For example, the rewriting is to be joined with the result of joining with
species that
supported are nested-loops-join, merge-join and hash-join. Then, each of the two joins can be performed using any of these three implementations, giving nine possible executions of the given rewriting. In Figure 2.1,
are the
The issue here, again, is how to efciently generate the plans and also how to compactly store the enormous space of query plans. This step is explained in Section 2.2.3 3. Search the plan space generated in the second step for the best plan. Given the cost estimates for the different algorithms that implement the logical operations, the cost of each execution plans is estimated. The goal of this step is to nd the plan with the minimum cost. Since the size of the search space is enormous for most queries, the core issue here is how to perform the search efciently. The Volcano search algorithm is based on top-down dynamic programming (memoization) coupled with branch-and-bound. Details of the search algorithm appear in Section 2.2.4. For clarity of understanding, we take the approach of executing one step fully before moving to the next in the rest of this chapter. This is the approach that will be extended on in the later chapters. However, this may not be the case in practice; in particular, the original Volcano algorithm does not follow this execution order; Volcanos approach is discussed in Section 2.2.5. In order to emphasize the template-instance relationship between the rewritings and the execution plans, we hereafter refer to them as logical plans and physical plans respectively.
18
The complexity of the logical plan generation step, described below, depends on the given set of transformations; an unfortunate choice of transformations can lead to the generation of the same logical plan multiple times along different paths. Pellenkroft et al. [39] present a set of transformations that avoid this redundancy. The complete list of logical transformations used in our optimizer is given in Appendix B. Logical Query DAG Representation A Logical Query DAG (LQDAG) is a directed acyclic graph whose nodes can be divided into equivalence nodes and operation nodes; the equivalence nodes have only operation nodes as children and operation nodes have only equivalence nodes as children.
19
AB
BC
AC
Figure 2.2: Logical Query DAG for A B C. Commutativity not shown; every join node has another join node with inputs exchanges, below the same equivalence node. An operation node in the LQDAG corresponds to an algebraic operation, such as join (), select ( ), etc. It represents the expression dened by the operation and its inputs. An equivalence node in the LQDAG represents the equivalence class of logical expressions (rewritings) that generate the same result set, each expression being dened by a child operation node of the equivalence node, and its inputs. An important property of the LQDAG is that there are no two equivalence nodes that correspond to the same result set. The algorithm for expansion of an input query into its LQDAG is presented later in this section. Figure 2.2 shows a LQDAG for the query A equivalence node for every subset of
joins of the relations in that subset. Though the LQDAG in this example represents only a single query A B C, in general a LQDAG can represent multiple queries in a consolidated manner. Logical Plan Space Generation The given query tree is initially represented directly in the LQDAG formulation. For example, the query tree of Figure 2.3(a) for the query
formulation, as shown in Figure 2.3(b). The equivalence nodes are shown as boxes, while the operation nodes are shown as circles. The initial LQDAG is then expanded by applying all possible transformations on every node of the initial LQDAG representing the given query. In the example, suppose the only transformations possible are join associativity and commutativity. Then the plans
and
20
ABC
AB
AB
BC
AC
B A B C A B C
tained by transformations on the initial LQDAG of Figure 2.3(b). These are represented in the LQDAG shown in Figure 2.3(c). Procedure E XPAND DAG, presented in Figure 2.4, expands the input querys LQDAG (as in Figure 2.3(b)) to include all possible logical plans for the query (as in Figure 2.3(c)) in one pass that is, without revisiting any node. The procedure applies the transformations to the nodes in a bottom-up topological manner that is, all the inputs of a node are fully expanded before the node is expanded. In the process, new subexpressions are generated. Some of these subexpressions may be equivalent to expressions already in the LQDAG. Further, subexpressions of the query may be equivalent to each other, even if syntactically different. For example, suppose the query contains and two subexpressions that are logically equivalent but syntactically different (e.g., ,
). Before the second subexpression is expanded, the Query DAG would con-
tain two different equivalence nodes representing the two subexpressions. Whenever it is found that by applying a transformation to an expression in one equivalence node leads to an expression in the other equivalence node (in the above example, after applying join associativity), the two equivalence nodes are deduced as representing the same result and unied, that is, replaced by a single equivalence node. The unication of the two equivalence nodes may cause the unication of their ancestors. For example, if the query had the subexpressions
21
will cause the equivalence nodes containing the above two subexpressions
to be unied as well. Thus, the unication has a cascading effect up the LQDAG. In order to efciently check the presence of a logical expression in the LQDAG, a hash table is used. Recall that an expression is identied by a logical operator (called the root operator) and its input equivalence nodes; for example, the expression operator and its two input equivalence nodes corresponding to id of its input equivalence nodes. A logical space generation algorithm is called complete iff it acts on the initial LQDAG for a query
and
As such, the
has value of an expression is computed as a function of the type-id of the root operator and the
and expands it into an LQDAG containing all possible logical plans possible
using the given set of transformations. We end this description with a proof of completeness of E XPAND DAG. Theorem 2.2.1 E XPAND DAG is complete. Proof: Let
and, by
applying the given set of transformations as shown in the pseudocode in Figure 2.4, generates
into
by C OMPLETE as follows:
where
, resulting
Clearly, the plan , generated by the application of transformation to the subplan during the execution of C OMPLETE, is a subplan of above, is also contained in
by ; is contained in
Let
. This implies that (a) the subplan is present below in otherwise, would be
present in
, which a contradiction due to the choice of . Next, we use (b) to contradict (a). , it applies all the available transformations, including
, to
Now, because E XPAND DAG visits nodes in a bottom-up topological manner, neither
of its descendents are visited later during the expansion. This implies that is never generated during the execution of E XPAND DAG and is therefore not present below contradiction. in , leading to a
23
joined. It does not specify the actual execution in terms of the algorithms used for the different operators; for example, a can be either a nested-loops join, a merge-join, an indexed nestedloops or a hash-join. As such, the cost for these plans is undened. Further, the logical plan does not consider the physical properties of the results, like sort order on some attribute, into account since results with different physical properties are logically equivalent. However, the physical properties are important since (a) they affect the execution costs of the algorithms (e.g., the merge join does not need to sort its input if it is already sorted on the appropriate attribute), and (b) they need to be taken into account when specied in the query using the ORDER BY clause. In this section, we give the details of how the physical plan space for a query is generated. Since the physical plan space is very large, a compact representation for the same is needed. We start with a description of the representation used in our implementation, called the Physical Query DAG. This representation is a renement of the Logical Query DAG (LQDAG) representation for the logical plan space described in Section 2.2.2. This is followed by a description of the algorithm to generate the physical plan space in the Physical Query DAG representation given the LQDAG for the input query.
Physical Query DAG Representation The Physical Query DAG (PQDAG) is a renement of the LQDAG. Given an equivalence node in the LQDAG, and a physical property required on the result of , there exists an equivalence node in the PQDAG representing the set of physical plans for computing the result of with exactly the physical property . A physical plan in this set is identied by a child operation node of the equivalence node (called the physical plans root operation node), and its input equivalence nodes. For contrast, we hereafter term the equivalence nodes in the LQDAG logical equivalence nodes and the equivalence nodes in the PQDAG physical equivalence nodes. Similarly, we hereafter term the operation nodes in the physical plans as physical operation nodes to disambiguate
24
Merge Join
[A, null]
[B, null]
Figure 2.5: Physical Query DAG for from the logical operation nodes in the logical plans.
The physical operation nodes can either be (a) algorithms for computing the logical operations (e.g., the algorithm merge join for the logical operation ), or (b) enforcers that enforce the required physical property (e.g., the enforcer sort to enforce the physical property sort-order on an unsorted result). Figure 2.5 illustrates the PQDAG for
equivalence nodes, labelled alongside with the corresponding relational expressions. The solid boxes within are the corresponding physical equivalence nodes for the respective physical properties stated alongside. The circles denote the physical operators: those within the dotted boxes are the enforcers (sort operations), while those within the dashed box are the algorithms (nested loops join and merge join) corresponding to the logical join operator as shown.
and the other representing plans to compute plan that computes no sort order. sorted on
with
25
node iff any plan that computes can be used as a plan that computes ; this denes a partial order on the set of physical equivalence nodes corresponding to a given logical equivalence node. While nding the best plan for the physical equivalence node (see Section 2.2.4), the pro-
cedure F IND B EST P LAN not only looks at the plans below , but also at plans below physical quivalence nodes that subsume , and returns the overall cheapest plan. To save on expensive physical property comparisons during the search, the physical equivalence nodes corresponding to the same logical equivalence node are explicitly structured into a DAG representing the partial order. Furthering the terminology, we say that the physical equivalence node strictly subsumes the physical equivalence node iff
such that
subsumes , but
immediately subsumes iff strictly subsumes but there does not exist another distinct node strictly subsumes and strictly subsumes .
Physical Plan Space Generation The PQDAG for the input query is generated from its LQDAG using Procedure P HYS DAGG EN listed in Figure 2.6. Given a subgoal
where
property, P HYS DAGG EN creates a physical equivalence node corresponding to not exist already, and then populates it with the physical plans that compute
if it does
physical property. Depending on the root operation node being an algorithm or an enforcer, the corresponding physical plan is called an algorithm plan or an enforcer plan respectively. An algorithm plan is generated by taking a logical plan for as a template and instantiating it as follows. The algorithm that forms the root of the physical plan implements the logical operation at the root of the logical plan, generating the result with the physical property . The inputs of are the physical equivalence nodes returned by recursive invocations of P HYS DAGG EN on that enforces the physical property , an enforcer plan is generated with is the physical equivalence node returned by a the respective input equivalence nodes of with physical properties as required by . For each enforcer
26
Procedure P HYS DAGG EN Input: , a equivalence node in the Logical Query DAG, , the desired physical property Output: , the equivalence node in the Physical Query DAG for with physical property , populated with the corresponding plans Begin if an equivalence node exists for with property return it create an equivalence node for every operation node below for every algorithm for that guarantees property create an algorithm node under . for each input of let be the th input let the physical property required from input by algorithm set input of = P HYS DAGG EN( , ) for every enforcer that generates property create an enforcer node under set the input of = P HYS DAGG EN( , ) /* denotes no physical property requirement */ return End
27
recursive invocation of P HYS DAGG EN on the same equivalence node with no required physical property. In the PQDAG of Figure 2.5, the logical equivalence node
is rened into
the two physical equivalence nodes one for no physical property and the other for sort order on . The logical join instantiated as nested loops join forms the root of the algorithm plan
for the former. For the latter, the same logical join instantiated as merge-join forms the root of the algorithm plan while the sort operator forms the root of the enforcer plan. From the PQDAG shown, it is apparent that the nested loops join requires no physical property on its input relations and , while the merge join requires its input relations and sorted on and
respectively. The entire PQDAG is generated by invoking P HYS DAGG EN on the root of the input querys LQDAG, with the desired physical properties of the query.
28
Procedure F IND B EST P LAN Input: , a physical equivalence node in the PQDAG Output: The best plan for Begin bestEnfPlan = F IND B EST E NF P LAN( ) bestAlgPlan = F IND B ESTA LG P LAN( ) return the cheaper of bestEnfPlan and bestAlgPlan End Procedure F IND B EST E NF P LAN Input: , a physical equivalence node in the PQDAG Output: The best enforcer plan for Begin if best enforcer plan for is present /* memoized */ return best enforcer plan for bestEnfPlan = dummy plan with cost for each enforcer child of planCost = cost of for each input equivalence node of inpBestPlan = F IND B ESTA LG P LAN( ) planCost = planCost + cost of inpBestPlan if planCost cost of bestEnfPlan bestEnfPlan = plan rooted at memoize bestEnfPlan as best enforcer plan for return bestEnfPlan End Procedure F IND B ESTA LG P LAN Input: , a physical equivalence node in the PQDAG Output: The best algorithm plan for Begin if best algorithm plan for is present /* memoized */ return best algorithm plan for bestAlgPlan = dummy plan with cost for each algorithm child of planCost = cost of for each input equivalence node of inpBestPlan = F IND B EST P LAN( ) planCost = planCost + cost of inpBestPlan if planCost cost of bestAlgPlan bestAlgPlan = plan rooted at for each equivalence node that immediately subsumes subsBestAlgPlan = F IND B ESTA LG P LAN( ) if cost of subsBestAlgPlan cost of bestAlgPlan bestAlgPlan = subsBestAlgPlan memoize bestAlgPlan as best algorithm plan for return bestAlgPlan End
29
sort-cum-index enforcer. The space of enforcer plans generated using the resulting enforcer set contains the best enforcer plan. Procedure F IND B EST P LAN, shown in Figure 2.7, nds the best plan for an equivalence node in the PQDAG. F IND B EST P LAN calls the procedures F IND B EST E NF P LAN and F IND -
B ESTA LG P LAN that respectively nd the best enforcer plan and algorithm plan for , and returns the cheaper of the two plans. F IND B EST E NF P LAN looks at each enforcer child of , and constructs the best plan for that enforcer by taking the best algorithm plan for its input physical equivalence node. The cheapest of these plans is the best enforcer plan for . F IND B ESTA LG P LAN looks at each algorithm child of , and builds the best plan for that algorithm by taking the best plan for each of its input physical equivalence nodes, determined by recursive invocations of F IND B EST P LAN. Further, it looks at the best plan for each immediately subsuming node (see Section 2.2.3), determined recursively. The cheapest of all these plans is the best algorithm plan for . Observe that subsuming physical equivalence nodes are considered only while searching for the best algorithm plan (in F IND B ESTA LG P LAN) and not while searching for the best enforcer plan (in F IND B EST E NF P LAN). This is because an enforcer plan for the subsuming physical equivalence node has a cost at least as much as the best enforcer plan for the subsumed physical equivalence node.2
Branch-and-Bound Pruning.
parameter, the cost limit, which species an upper limit on the cost of the plans to be considered. The cost limit for the root equivalence node is initially innity. When a plan for a physical equivalence node with cost less than the current cost limit is found, its cost becomes the new
cost limit for future search of the best plan for . The cost limit is propagated down the DAG during the search and helps prune the search space as follows. Consider the invocation of F IND B EST P LAN on the physical equivalence node . In the call to F IND B EST E NF P LAN, the cost limit for the input of the enforcer
2
is the cost
on
invoking F IND B EST P LAN on the th input of an algorithm node child of , the cost limit for the plan for the th input is the cost limit for minus the sum of the costs of best plans for earlier inputs to as well as the local cost of computing . The recursive plan generation occurs only till the cost limit is positive; when the cost limit becomes non-positive, the current plan is pruned. If all the plans for are pruned for the given cost limit, then the cost limit is a lower bound on the best plan for this lower bound is used to prune later invocations on with higher cost limits. Branch-and-bound pruning is not shown in the pseudocode for F IND B EST P LAN in Figure 2.7, for sake of simplicity.
Separation of Logical/Physical Plan Space Generation and Search Our approach in this chapter has been to assume that the three steps of (1) LQDAG generation, (2) PQDAG generation, and (3) search for the best plan are executed one after another, independently. In other words, the optimization task goes breadth-rst on the graph of Figure 2.1 given the input query tion plans returned. This may not be the case in reality, where these three steps may interleave. For example, on the other extreme, the optimizer may choose to go depth-rst on the graph of Figure 2.1. First
are generated, then all its execu are generated, and nally the best execution plan is identied and
is generated, then its corresponding execution plans , are generated and the best plan so far identied. Then, the next rewriting is generated, folowed by its corresponding
execution plans and the best plan so far is updated, if a better plan is seen. This repeats for all the successive rewritings upto , and nally the overall best plan is returned. This is essentially the Volcano algorithm, as described in [23]. This approach may be advantageous when the complete
31
space of plans is too big to t in memory, since here the rewritings and the plans that have already been found to be suboptimal can be discarded before the end of the algorithm.
Unication of Equivalent Subexpressions The original Volcano algorithm does not generate the unied LQDAG as explained in Section 2.2.2. Instead, the generated LQDAG may have multiple logical equivalence nodes representing the same logical expressions. For example, consider the query does not consider the two occurences of occurences of
or
as refering to the same relation. Similarly for the two is considered a distinct relation; effectively,
where
and
are clones of
respectively. This does not alter the search space, since during execution the two accesses (or ) are going to be independent, anyway. However, by doing so, it fails to recognise
and
In our version of Volcano, since the equivalent subexpressions are unied (see Section 2.2.2), the subexpression is going to be optimized only once and the best plan reused for both of its occurrences. In general, the common subexpression may be rather complex, and unication may reduce the optimization effort signicantly.
Separation of the Enforcer and Algorithm Plan Spaces Our version of Volcano memoizes the best algorithm plan as well as the best enforcer plan for each physical equivalence node. On the other hand, Volcano stores only the overall best plan. While searching for the best plan for, say, the enforcer plan with the sort operation on
sorted on
, Volcano explores
result as input. In order to determine the best plan for this input node, in the naive case, it visits the equivalence nodes that subsume the same. In particular, it explores the equivalence node for the sort order as well, landing back where it had started and thus gets into an innite
32
recursion. Volcano tries to avoid this by passing down an extra parameter, the excluding physical property, to the search function. In the above example, the excluding physical property is sort order on and helps the recursive call to determine the best plan for the unsorted result should not be explored while looking
gure that the equivalence node with sort order on for the best plan.
However, this approach has its own problems. The best plan thus found for the equivalence node with no sort order is subject to the exclusion of the said physical property and may not be its overall best plan; in particular, the merge-join plan for the result that is present below the equivalence code for sort order may be the overall best plan for the unsorted result, but has
not been considered above. Thus, at each equivalence node, the optimizer needs to memoize the best plan for each excluded physical property apart from the overall best plan a signicant amount of book-keeping. Our version obviates the above problem, as discussed earlier in Section 2.2.4 by observing that one need only consider algorithm plans as input to an enforcer while looking for the best enforcer plan. While searching for the best plan for
sorted on
considered only cosists of the sort operation over the best algorithm plan for the unsorted result. In general, it can be seen that in Figure 2.7, neither of F IND B EST P LAN, F IND B EST E NF P LAN and F IND B ESTA LG P LAN are ever invoked more than once on the same equivalence node, thus proving that the recursion always terminates.
2.3 Summary
In this chapter, we rst gave a brief overview of the issues in traditional query optimization, and pointed out the important research and development work in this area. We then gave a detailed description of the design of our version of the Volcano query optimizer, which provides the basic framework for the work presented in this thesis. Later chapters of this thesis modify this basic optimizer, enabling it to perform multi-query optimization, query result cache management and materialized view selection and materialization respectively. For sake of simplicity, the later chapters restrict to the logical plan space. The Query DAG
2.3. SUMMARY
33
refered hereafter will refer to the Logical Query DAG, unless explicitly stated otherwise. However, the descriptions therein can be easily extended in terms of the physical plan space.
and
and
do not have any common sub-expressions, hence they cannot share However, if we choose the
is a common sub-expression and can be computed once and used in both queries. alternative with sharing of may be the globally optimal choice.
with is very large compared to the cost of the plan
1
alternative plan
On the other hand, blindly using a common sub-expression may not always lead to a globally optimal strategy. For example, there may be cases where the cost of joining the expression
Joint work with S. Seshadri, S. Sudarshan and Siddhesh Bhobe. Parts of this chapter appeared in SIGMOD
2000 [47]
35 no sense to reuse even if it were available. Example 3.0.1 illustrates that the job of multi-query optimizer, over and above that of ordinary query optimizer, is to (i) recognize the possibilities of shared computation, and (ii) modify the optimizer search strategy to explicitly account for shared computation and nd a globally optimal plan. While there has been work on multi-query optimization in the past ([54, 56, 53, 13, 38]), prior work has concentrated primarily on exhaustive algorithms. Other work has concentrated on nding common subexpressions as a post-phase to query optimization [18, 59], but this gives limited scope for cost improvement. The search space for multi-query optimization is doubly exponential in the size of the queries, and exhaustive strategies are therefore impractical; as a result, multi-query optimization was hitherto considered too expensive to be useful. We show how to make multi-query optimization practical, by developing novel heuristic algorithms, and presenting a performance study that demonstrates their practical benets. We have decomposed our approach into two distinct tasks: (i) recognize possibilities of shared computation (thus essentially setting up the search space by identifying common subexpressions), and (ii) modify the optimizer search strategy to explicitly account for shared computation and nd a globally optimal plan. Both of the above tasks are important and crucial for a multi-query optimizer but are orthogonal. In other words, the details of the search strategy do not depend on how aggressively we identify common sub-expressions (of course, the efcacy of the approach does). The rest of this chapter is structured as follows: We describe how to set up the search space for multi-query optimization in Section 3.1. Next, we present three heuristics for nding the globally optimal plan. Two of the heuristics we present, Volcano-SH and Volcano-RU are lightweight modications of the Volcano optimization algorithm, and are described in Section 3.2. The third heuristic is a greedy strategy which iteratively picks the subexpression that gives the maximum benet (reduction in cost) if it is materialized and reused; this strategy is covered in Section 3.3. Our extensions to create indexes on intermediate relations and nested queries are discussed in Sections 3.5. We describe the results of our performance study in Section 3.6. Section 3.7
36
and
appear in the
and
Similarly, given :
and
we
and
. The new node represents the sharing of accesses between the two selection. In general, , we create a single new node representing the
Similar derivations also help with aggregations. For example, if we have and :
add derivations of
and
by further groupbys on
and
and
37
has been proposed earlier [45, 54, 59]. Integrating such options into the Query DAG, as we do, clearly separates the space of alternative plans (represented by the Query DAG) from the optimization algorithms. Thereby, it simplies our optimization algorithms, allowing them to avoid dealing explicitly with such derivations. Physical Properties. Our search algorithms can be easily understood on the Logical Query
DAG representation (without physical properties), although they actually work on Physical Query DAGs (ref. Section 2.2.3). For brevity, therefore, we do not explicitly consider physical properties further.
The only change from the algorithm presented in Chapter 2 is as follows. When computing is materialized (i.e., in the minimum of
and
when computing
expression instead:
cost of executing
38 where
if
if
decide to materialize if
The left hand side of this inequality gives the cost of materializing the result when rst computed, and using the materialized result thereafter; the right hand side gives the cost of the alternative wherein the result is not materialized but recomputed on every use. The above test can be simplied to
(3.1)
39
and
other nodes have been materialized, For instance, suppose node and node
is
If
is materialized,
and
.
can be computed recursively based on the number of uses of the parents of : , while for all other nodes, , where if is not materialized, and if is materialized. Thus, computing requires us to know the materialization status of parents. On the other hand, as we have seen earlier, depends on what descendants have been materialized.
depends on which ancestors of in the Volcano best plan are maand depends on which descendants have been materialized. Specically,
and
A naive exhaustive strategy to decide what nodes in the Volcano best plan to materialize is to consider each subset of the nodes in the best plan, and compute the cost of the best plan given that all nodes in this subset are materialized at their rst computation; the subset giving the minimum cost is selected for actual materialization. Unfortunately, this strategy is exponential in the number of nodes in the Volcano best plan, and therefore is very expensive; we require cheaper heuristics. To avoid enumerating all sets as above, the Volcano-SH algorithm, which is shown in Figure 3.1, traverses the tree bottom-up. As each equivalence node is encountered in the traversal, Volcano-SH decides whether or not to materialize . When making a materialization decision for a node, the materialization decisions for all descendants ia already known. When Volcano-SH is examining a node , let
Un-
of
number of uses of . Such an underestimate can be obtained by simply counting the number of ancestors of in the Volcano best plan. We use this underestimate in our cost formulae, to make
40
Procedure VOLCANO -SH Input: Consolidated Volcano best plan for virtual root of DAG Output: Set of nodes to materialize , and the corresponding best plan Global variable: , the set of nodes chosen to be materialized Begin
Perform prepass on to introduce subsumption derivations Let C OMPUTE M AT S ET() Set Undo all subsumption derivations on where the subsumption node is not chosen to be materialized. return (M,P)
End Procedure C OMPUTE M AT S ET Input: , equivalence node Output: Cost of computing Global variable: , the set of nodes chosen to be materialized Begin If is already memoized, return Let operator be the child of in For each input equivalence node of Let = C OMPUTE M AT S ET ( ) // returns computation cost of If is materialized, let Compute = cost of operation + If ( ) If is not introduced by a subsumption derivation // Decide to materialize add to else if is less than savings to parents of due to introducing materialized add to // Decide to materialize Memoize and return End
41
Based on the above, Volcano-SH makes the decision on materialization as follows: node is materialized if
in place of
bound guarantees that if we decide to materialize a node, materialization will result in cost sav-
The nal step of Volcano-SH is to factor in the cost of computing and materializing all nodes that were chosen to be materialized. Thus, to the cost of the pseudoroot computed as above, we add
Let us now return to the rst step of Volcano-SH. Note that the basic Volcano optimization algorithm will not exploit subsumption derivations, such as deriving
by using
since the cost of the latter will be more than the former, and thus will not
be locally optimal. To consider such plans, we perform a pre-pass, checking for subsumption amongst nodes in the plan produced by the basic Volcano optimization algorithm. If a subsumption derivation is applicable, we replace the original derivation by the subsumption derivation. At the end of Volcano-SH, if the shared subexpression is not chosen to be materialized, we replace the derivation by the original expressions. In the above example, in the prepass we replace
by
If
by
The algorithm of [59] also nds best plans and then chooses which shared subexpressions to materialize. Unlike Volcano-SH, it does not factor earlier materialization choices into the cost of computation.
and , no sharing is
42
Procedure VOLCANO -RU Input: Expanded DAG on queries (including subsumption derivations) Output: Set of nodes to materialize , and the corresponding best plan Begin = // Set of potentially materialized nodes For each equivalence node , Set For to Compute , the best plan for , using Volcano, assuming nodes in are materialized For every equivalence node in set If ( ) // Worth materializing if used once more add to set Combine to get a single DAG-structured plan (M,P) = VOLCANO -SH( ) // Volcano-SH makes nal materialization decision End
is already used in in the best plan for and can be shared, the choice of plan may be found to be the best for .
The intuition behind the Volcano-RU algorithm is therefore as follows. Given a batch of queries, Volcano-RU optimizes them in sequence, keeping track of what plans have already been chosen for earlier queries, and considering the possibility of reusing parts of the plans. The resultant plan depends on the ordering chosen for the queries; we return to this issue after discussing the Volcano-RU algorithm. The pseudocode for the Volcano-RU algorithm is shown in Figure 3.2. Let
be , we
the queries to be optimized together (and thus under the same pseudo-root of the DAG). The Volcano-RU algorithm optimizes them in the sequence
. note equivalence nodes in the DAG that are part of the best plan
as candidates for
potential reuse later. We maintain counts of number of uses of these nodes. We also check if each node is worth materializing, if it is used one more time. If so, we add the node to , and when optimizing the next query, we will assume it to be available materialized. Thus, in our example earlier in this section, after nding the best plan for the rst query, we
43
it to be materialized when optimizing the second query. After optimizing all the individual queries, the second phase of Volcano-RU executes VolcanoSH on the overall best plan found as above to further detect and exploit common subexpressions. This step is essential since the earlier phase of Volcano-RU does not consider the possibility of after optimizing an entire query. Adding a node to in our algorithm does not imply it will get sharing common subexpressions within a single query equivalence nodes are added to only
reused and therefore materialized. Instead Volcano-SH makes the nal decision on what nodes to materialize. The difference from directly applying Volcano-SH to the result of Volcano optimization is that the plan that is given to Volcano-SH has been chosen taking sharing of parts of earlier queries into account, unlike the Volcano plan. A related implementation issue is in caching of best plans in the DAG. When optimizing we cache best plans in nodes of the DAG that are descendants of query , if we nd a node that is not in (the plan chosen for query ) for some
must recompute the best plan for the node; for, the set of nodes may have changed, leading to a different best plan. Therefore we note with each cached best plan which query was being optimized when the plan was computed; we recompute the plan as required above. Note that the result of Volcano-RU depends on the order in which queries are considered. In our implementation we consider the queries in the order in which they are given, as well as in the reverse of that order, and pick the cheaper one of the two resultant plans. Note that the DAG is still constructed only once, so the extra cost of considering the two orders is relatively quite small. Considering further (possibly random) orderings is possible, but the optimization time would increase further.
44
rithm picks a set of nodes to be materialized and then nds the optimal plan given that nodes in set of nodes to be materialized.
are materialized. This is then repeated on different sets of nodes to nd the best (or a good)
Before coming to the greedy algorithm, we present some denitions, and an exhaustive algorithm. As before, we shall assume there is a virtual root node for the DAG; this node has as input a no-op logical operator whose inputs are the queries
node. For a set of nodes that nodes in are to be materialized (this cost includes the cost of computing and materializing with an appropriate denition of the cost for nodes in can be used to nd
nodes in ). As described in the Volcano-SH algorithm, the basic Volcano optimization algorithm
To motivate our greedy heuristic, we rst describe a simple exhaustive algorithm. The exhaustive algorithm, iterates over each subset of the set of nodes in the DAG, and chooses the of the globally optimal plan for . subset with the minimum value for
It is easy to see that the exhaustive algorithm is doubly exponential in the size of the initial query DAG and is therefore impractical. In Figure 3.3 we outline a greedy heuristic that attempts to approximate by constructing it one node at a time. The algorithm iteratively picks nodes to materialize. At each iteration, the node that gives the maximum reduction in the cost if it is materialized is chosen to be added to . The greedy algorithm as described above can be very expensive due to the large number of nodes in the set and the large number of times the function
is called.
We now
present three important and novel optimizations to the greedy algorithm which make it efcient and practical. 1. The rst optimization is based on the observation that the nodes materialized in the globally optimal plan are obviously a subset of the ones that are shared in some plan for the query.
45
X)
Figure 3.3: The Greedy Algorithm Therefore, it is sufcient to initialize in Figure 3.3, with nodes that are shared in some
plan for the query. We call such nodes sharable nodes. For instance, in the expanded DAG for and corresponding to Example 3.0.1, is sharable while is not. We present an efcient algorithm for nding sharable nodes in Section 3.3.1. 2. The second optimization is based on the observation that there are many calls to
at line L1 of Figure 3.3, with different parameters. A simple option is to process each call
independent of other calls. However, observe that the symmetric difference3 in the sets passed as parameters to successive calls to is very small sucessive calls , where only varies. It makes sense for take parameters of the form
to a call to leverage the work done by a previous call. We describe a novel incremental cost update algorithm, in Section 3.3.2, that maintains the state of the optimization across calls to
3. The third optimization, which we call the monotonicity heuristic, avoids having to invoke for every
and consists of elements that are in one of the two but not both; formally the symmetric difference of sets and is , where denotes set difference.
3
46
3.3.1 Sharability
In this subsection, we outline how to detect whether an equivalence node can be shared in some plan. The plan tree of a plan is the tree obtained from the DAG structured plan, by replicating all shared nodes of the plan, to completely eliminate sharing. The degree of sharing of a logical equivalence node in an evaluation plan is the number of times it occurs in the plan tree of . The degree of sharing of a logical equivalence node in an expanded DAG is the maximum of the degree of sharing of the equivalence node amongst all evaluation plans represented by the DAG. A logical equivalence node is sharable if its degree of sharing in the expanded DAG is greater than one. We now present a simple algorithm to compute the degree of sharing of each node and thereby detect whether a node is shared. A subDAG of a node consists of the nodes below and every equivalence node with the edges between these nodes that are in the original DAG. For each node of the DAG,
along
in the sub-DAG rooted at , let represent the degree of sharing of in the subDAG rooted at . Clearly for all equivalence nodes , is . For a given node , all other values can be computed given the values for all children of as follows. If is an operation node and if is an equivalence node, The degree of sharing of an equivalence node in the overall DAG is given by , where
is the root of the DAG. Space is minimized in the above by computing
the row
In a reasonable implementation of the above algorithm, the time complexity of computing number of children of (say ). Thus, the overall complexity of the algorithm is proportional to
, and
47
.
However, typically,
is fairly sparse since the DAG is typically short and fat as the
number of queries grows, the height of the DAG may not increase, but it becomes wider. Thus,
for most , making this sharability computation algorithm fairly efcient in practice.
In fact, for the queries we considered in our performance study (Section 3.6), the computation took at most a few tens of milliseconds.
is called successively at line L1 of Figure 3.3 are closely related, with their (symmetric) difference being very small. For, line L1 nds the node with the max, for different values of imum benet, which is implemented by calling . Thus the second parameter to changes by dropping one node and adding another . We now present an incremental cost update algorithm that exploits the results of earlier
The sets with which cost computations to incrementally compute the new plan. Figure 3.4 outlines our incremental cost update algorithm. Let be the set of nodes shared at a given point of time, i.e., the previous call to
The
incremental cost update algorithm maintains the cost of computing every equivalence node, given that all nodes in are shared, and no other node is shared. Let be the new set of nodes that are shared, i.e., the next call to algorithm starts from the nodes that have changed in going from to (i.e., nodes in
and ) and propagates the change in cost for the nodes upwards to all their parents; these in
turn propagate any changes in cost to their parents if their cost changed, and so on, until there is no change in cost. Finally, to get the total cost we add the cost of computing and materializing all the nodes in . If we perform this propagation in an arbitrary order then in the worst case we could propagate the change in cost through a node multiple times (for example, once from a node which is an ancestor of another node
propagation uses topological numbers for nodes of the DAG. During DAG generation the DAG
48
Procedure U PDATE C OST Input: , previous set of shared nodes, corresponding best plan , new set of shared nodes Output: Best plan corresponding to Begin // PropHeap is a priority heap (initially empty), containing // equivalence nodes are ordered by their topological sort number add to PropHeap while (PropHeap is not empty) = equivalence node with minimum topological sort number in PropHeap Remove from PropHeap oldCost = old value of cost( ) // are operation nodes cost( ) = Min cost() if (cost( ) oldCost) or or
End
for every parent operation node of cost() = cost of executing operation + where = cost( ) if , and cost( )) if add s parent equivalence node to PropHeap if not already present TotalCost = cost()
49
is sorted topologically such that a descendant always comes before an ancestor in the sort order, and nodes are numbered in this order. As shown in Figure 3.4, cost propagation is performed in the topological number ordering using PropHeap, a heap built on the topological number. The heap is used to efciently nd the node with the minimum topological sort number at each step. In our implementation, we additionally take care of physical property subsumption. Details of how to perform incremental cost update on Physical Query DAGs with physical property subsumption are given in the appendix of this chapter.
stances. We now outline how to determine the node with the smallest value of more efciently, using the monotonicity heuristic. Let us dene ing
much
as . Notice that, minimiz in line corresponds to maximizing benet as dened here. Suppose the benet is is monotonic if
,
monotonic. Intuitively, the benet of a node is monotonic if it never increases as more nodes get materialized; more formally
ordered on these upper bounds.4 The initial upper bound on the benet of a node in
notion of the maximum degree of sharing of the node (which we described earlier). The initial upper bound is then just the cost of evaluating the node (without any materializations) times the maximum degree of sharing. The heap the maximum is now used to efciently nd the node
with
is reordered. If remains at the top, it is deleted from the . Assuming the monotonicity property holds,
the other values in the heap are upper bounds, and therefore, the node indeed the node with the maximum real benet.
added to
above, is
If the monotonicity property does not hold, the node with maximum current benet may not
4
This cost heap is not to be confused with the heap on topological numbering used earlier.
with the greatest benet. Our experiments in Section 3.6 demonstrate that the above procedure greatly speeds up the greedy algorithm. Further, for all queries we experimented with, the results were exactly the same even if the monotonicity heuristic was not used.
51 , and
is materialized then this cost includes the cost of materializing The rst step is to compute the best algorithm plan for each
the cost of the corresponding algorithm plans and pick the cheapest one. An example scenario is shown in Figure 3.5(a).
,
and
nodes representing the same logical equivalence node with different physical properties; among these,
and
and
, the
cost of obtaining
. Further, the
and
plan. An obvious approach is to rst compute the best enforcer plan for
the enforcer plans and select the cheapest one; comparing the best enforcer plan with the best algorithm plan for determined earlier will then give the nodes best plan. We illustrate this
approach by an example. Consider again the scenario of Figure 3.5(a). also has two enforcer plans. The rst computes then derives derives
s
from
at a cost of
as materialized. Comparing the costs, the second enforcer plan is chosen as the best plan for computing
.
Similarly,
best algorithm plan has a cost of and its two enforcer plans are
from the
from materialized
at a
cost of . Breaking the tie among the two enforcer plans arbitrarily, the second enforcer plan is chosen as the best plan. Thus, the best plan for plan for
The above example shows that while the approach described above works for unmaterialized nodes, it may not work for materialized nodes. We now give the details of our approach of
52
0 E1
0000000 1111111 1111 0000 0000 1111 0000000 1111111 1 0000 1111 0000 1111 0000000 1111111 0000 1111 0000 1111 0000000 1111111 0000 1111 0000 1111 0000000 1111111 0000 1111 0000 1111 0000000 1111111 0000 1111 0000 1111 0000000 1111111 0000 1111 0000 1111 0000000 1111111 0000 1111 0000 1111 0000000 1111111 0000 1111 0000 1111 0000000 1111111 0000 1111 0000 1111 0000000 1111111 0000 1111 0000 1111 0000000 1111111 0000 1111 0000 1111 0000000 1111111 0000 1111 0000 1111 0000000 1111111 0000 1111 0000 1111 0000000 1111111 0000 1111 0000 1111 0000000 1111111 0000 1111 0000 1111
3 1 3
1111 0000 0000 1111 0000000 1111111 0000 1111 0000 1111 0000000 1111111 000 111 000 111 0000000 1111111 0000 1111 0000 1111 0000000 1111111 000 111 000 111 0000000 1111111 0000 1111 0000 1111 0000000 1111111 000 111 000 111 0000000 1111111
E2 1 3 2 2
0 E3 E1
1111 0000 0000 1111 0000 1111 0000 1111 000 111 000 111 0000 1111 0000 1111 000 111 000 0000 1111 0000 1111 000 111 111 000 111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 00001111 1111 0000 1111 0000 1111 0000 1111 00 11 0000 1111 0000 1111 00 11 11 00
E2 1 3 2 2 1 1 3 3 X
2 E1
E3
11111 00000 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00 11 00000 11111 00000 11111
1 3 2 D 11 00 00 11
E3
(a)
(b)
E2
(c)
E2
E1
D 00 11 11 00
1111 0000 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 00 11 0000 1111
1 2
E3
E1
E3
E1
11 00 X 00 11 11 00
1111111 0000000 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 1 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111
1
E3
(d)
(e)
(f)
Figure 3.5: Example Showing Cost Propagation through Physical Equivalence Nodes
53
plan. Figure 3.5(b) shows the result of the above transformation on our running example. Next, for each , we nd the shortest path from assuming none of
to
; this shortest path represents the are materialized. To keep track of and add an edge to each from
representing the shortest path found as above. The edge is weighted by the sum of the edges in the shortest path. Now, we consider the subgraph induced by Each edge into the node and the materialized nodes among
path, and materializing it; otherwise if it is from some other materialized node sponds to reading the result, deriving edges, one into each
and without generating any cycles, such that the sum of the costs on the
edges (the total cost) is minimized. This corresponds to a minimum cost directed spanning tree of the graph, which can be computed efciently using Edmonds algorithm [17]. This spanning tree gives us after expanding out any edges out of included in this tree into the correspond-
ing path the best plan for each materialized node, taking into consideration other materialized nodes. For our running example, Figure 3.5(c) shows the subgraph induced by the materialized nodes
and
in Figure 3.5(e).
,
available as materialized.
Note that the solution is heuristic to the extent that some of the materialized nodes may not be needed in the overall best plan, and if eliminated, some other minimum spanning tree may have
54
resulted. However, we do not know the set of nodes that will get used. Hence, we conservatively assume that all of them may be used, and compute the spanning tree across all the materialized nodes.
3.5 Extensions
In this section, we briey outline extensions to i) incorporate creation and use of temporary indices, ii) optimize nested queries to exploit common sub-expressions and iii) optimize multiple invocations of parameterized queries.
3.5. EXTENSIONS
55
relational operations [43]. In correlated evaluation, the nested query is repeatedly invoked with different values for correlation variables. Consider the following query. Query: select * from a, b, c where a.x = b.x and b.y = c.y and a.cost = (select min(a1.cost) from a as a1, b as b1
where a1.x = b1.x and b1.y = c.y) One option for optimizing correlated evaluation of this query is to materialize with the outer level query and across nested query invocations. An index on
is required for efcient access to it in the nested query, since there is a selection on from the correlation variable. If the best plan for the outer level query uses the join order , materializing and sharing may provide the best plan.
In general, parts of the nested query that do not depend on the value of correlation variables can potentially be shared across invocations [43]. We now show how to extend our algorithms to consider such reuse across multiple invocations of a nested query. The key intuition is that when a nested query is invoked many times, benets due to materialization must be multiplied by the number of times it is invoked; results that depend on correlation variables, however, must not be considered for materialization. The nested query invariant optimization techniques of [43] then fall out as a special case of ours. The inner subquery forms part of a predicate of some select or join operation of an outer query. This predicate has a pointer to an equivalence node that forms the root of the Query DAG for the inner subquery. Common results between the Query DAGs of the inner subquery and outer query are unied. Thus, unlike optimizers that perform block at a time optimization, we can share optimization effort between the outer and the inner subquery. In the Query DAG for the inner subquery, the predicate for a select or a join operation node can contain a reference to a correlation variable from the outer query. Let us call such a node a referencer node. Clearly, the result of an expression that contains a referencer node varies across different calls to the subquery (depending on the value of the correlation variable) and therefore can not be materialized and shared across calls with different parameter values. Hence, we tag
56
the equivalence node under which a referencer node occurs as well as all its ancestor nodes in the inner subquerys Query DAG as non-materializable. Such tagging can be performed efciently while the inner subquerys Query DAG is being constructed. The cost of the inner subquery is the product of (a) the cost of the best plan in the inner Query DAG, and (b) an estimate of the number of times the inner subquery is invoked. After the above constructions, the rest of our optimization algorithms are used unchanged, except that they do not consider materializing nodes tagged as non-materializable. An important point to note here is that the above construction allows us to share computation not only across multiple invocations of the inner subquery, but also between the inner subquery and the outer query (see Example 3.0.1). Extensions that allow memoization of results of the different invocation of the inner subquery (or even intermediate results of these invocations), along with the corresponding correlation variable values, are possible. These will reduce the number of times the inner subquery is evaluated [51]. Such optimizations are independent of the optimizations we present, and can be used in conjunction. Note that if inner subquerys results are memoed, the inner subquery is invoked as many times as there are distinct parameter values.
Parameterized Queries.
of parameterized queries. Parameterized queries are queries that take parameter values, which are used in selection predicates; stored procedures are a common example. Parts of the query may be invariant, just as in nested queries, and these can be exploited by multi-query optimization. Although there has been much work on optimizing parameterized queries (e.g., [19]), to the best of our knowledge all the work in this area aims at nding the best way of executing an individual instance, not at multiquery optimization across multiple executions.
57
17,000 lines, common MQO code took 1000 lines, Volcano-SH and Volcano-RU took around 500 lines each, and Greedy took about 1,500 lines. The optimizer transformation rule set is listed in Appendix B. Implementation algorithms included sort-based aggregation, merge join, nested loops join, indexed join, indexed select and relation scan. The cost estimation formulae for these operators appear in Appendix C. Our implementation incorporates all the techniques discussed in this chapter, including handling physical properties (sort order and presence of indices) on base and intermediate relations, unication and subsumption during DAG generation, and the sharability algorithm for the greedy heuristic. The block size was taken as 4KB and our cost functions assume 6MB is available to each operator during execution (we also conducted experiments with larger memory sizes up to 128 MB, with similar results). Standard techniques were used for estimating costs, using statistics about relations. The cost estimates contain an I/O component and a CPU component, with seek time as 10 msec, transfer time of 2 msec/block for read and 4 msec/block for write, and CPU cost of 0.2 msec/block of data processed. We assume that intermediate results are pipelined
to the next input, using an iterator model as in Volcano; they are saved to disk only if the result is to be materialized for sharing. The materialization cost is the cost of writing out the result sequentially. The tests were performed on a single processor 233 Mhz Pentium-II machine with 64 MB memory, running Linux. Optimization times are measured as CPU time (user+system).
58
2.000 1.000 0.500 0.250 0.125 0.062 0.031 0.016 0.008 Q2 Q2-D Q11 Q15 Volcano Volcano-SH Volcano-RU Greedy
100
50
Figure 3.6: Optimization of Stand-alone TPCD Queries Experiment 1 (Stand-Alone TPCD) The workload for the rst experiment consisted of four queries based on the TPCD benchmark [60]. The queries are listed in Appendix A. We used the TPCD database at scale of 1 (i.e., 1 GB total size), with a clustered index on the primary keys for all the base relations. The results are discussed below and plotted in Figure 3.6. TPCD query Q2 has a large nested query, and repeated invocations of the nested query in a correlated evaluation could benet from reusing some of the intermediate results. For this query, though Volcano-SH and Volcano-RU do not lead to any improvement over the plan of estimated cost 126 secs. returned by Volcano, Greedy results in a plan of with signicantly reduced cost estimate of 79 secs. Decorrelation is an alternative to correlated evaluation, and Q2-D is a (manually) decorrelated version of Q2 (due to decorrelation, Q2-D is actually a batch of queries). Multi-query optimization also gives substantial gains on the decorrelated query Q2-D, resulting in a plan with estimated costs of 46 secs., since decorrelation results in common subexpressions. Clearly the best plan here is multi-query optimization coupled with decorrelation. Observe also that the cost of Q2 (without decorrelation) with Greedy is much less than with Volcano, and is less than even the cost of Q2-D with plain Volcano this results indicates that multi-query optimization can be very useful in other queries where decorrelation is not possible. To test this, we ran our optimizer on a variant of Q2 where the in clause is changed to not in clause, which prevents decorrelation from being introduced without introduc-
59
400
200
Figure 3.7: Execution of Stand-alone TPCD Queries on MS SQL Server ing new internal operators such as anti-semijoin [43]. We also replaced the correlated predicate
by
ied query, Volcano gave a plan with estimated cost of 62927 secs., while Greedy was able to arrive at a plan with estimated cost 7331, an improvement by almost a factor of 9. We next considered the TPCD queries Q11 and Q15, both of which have common subexpressions, and hence make a case for multi-query optimization. 5 For Q11, each of our three
algorithms lead to a plan of approximately half the cost as that returned by Volcano. Greedy arrives at similar improvements for Q15 also, but Volcano-SH and Volcano-RU do not lead to any appreciable benet for this query. Overall, Volcano-SH and Volcano-RU take the same time and space as Volcano. Greedy takes more time than the others for all the queries. In terms of relative time taken, Greedy needed a maximum of about 5 times as much time as Volcano, but took a maximum of just over 2 seconds, which is very small compared to its benets. The total space required by Greedy ranged from 1.5 to 2.5 times that of the other algorithms, and again the absolute values were quite small (up to just over 130KB). Results on Microsoft SQL-Server 6.5: To study the benets of multi-query optimization on a real database, we tested its effect on
5
As mentioned earlier, we use the term multi-query optimization to mean optimization that exploits common
60
the queries mentioned above, executed on Microsoft SQL Server 6.5, running on Windows NT, on a 333 Mhz Pentium-II machine with 64MB memory. We used the TPCD database at scale 1 for the tests. To do so, we encoded the plans generated by Greedy into SQL. We modeled sharing decisions by creating temporary relations, populating, using and deleting them. If so indicated by Greedy, we created indexes on these temporary relations. We could not encode the exact evaluation plan in SQL since SQL-Server does its own optimization. We measured the total elapsed time for executing all these steps. The results are shown in Figure 3.7. For query Q2, the time taken reduced from 513 secs. to 415 secs. Here, SQL-Server performed decorrelation on the original Q2 as well as on the result of multi-query optimization. Thus, the numbers do not match our cost estimates, but clearly multi-query optimization was useful here. The reduction for the decorrelated version Q2-D was from 345 secs. to 262 secs; thus the best plan for Q2 overall, even on SQL-Server, was using multi-query optimization as per Greedy on a decorrelated query. The query Q11 speeded up by just under 50%, from 808 secs. to 424 secs. and Q15 from 63 secs. to 42 secs. using plans with sharing generated by Greedy. The results indicate that multi-query optimization gives signicant time improvements on a real system. It is important to note that the measured benets are underestimates of potential benets, for the following reasons. (a) Due to encoding of sharing in SQL, temporary relations had to be stored and re-read even for the rst use. If sharing were incorporated within the evaluation engine, the rst (non-index) use can be pipelined, reducing the cost further. (b) The operator set for SQL-Server 6.5 seems to be rather restricted, and does not seem to support sort-merge join; for all queries we submitted, it only used (index)nested-loops. Our optimizer at times indicated that it was worthwhile to materialize the relation in a sorted order so that it could be cheaply used by a merge-join or aggregation over it, which we could not encode in SQL/SQL-Server. In other words, if multi-query optimization were properly integrated into the system, the benets are likely to be signicantly larger, and more consistent with benets according to our cost estimates.
61
400
200
8.000 4.000 2.000 1.000 0.500 0.250 0.125 0.062 0.031 0.016 0.008 BQ1 BQ2 BQ3 BQ4 BQ5
Figure 3.8: Optimization of Batched TPCD Queries Experiment 2 (Batched TPCD Queries) In the second experiment, the workload models a system where several TPCD queries are executed as a batch. The workload consists of subsequences of the queries Q3, Q5, Q7, Q9 and Q10 from TPCD; none of these queries has any common subexpressions within itself. These queries are listed in Appendix A. Each query was repeated twice with different selection constants. Composite query BQi consists of the rst i of the above queries, and we used composite queries BQ1 to BQ5 in our experiments. Like in Experiment 1, we used the TPCD database at scale of 1 and assumed that there are clustered indices on the primary keys of the database relations. Note that although a query is repeated with two different values for a selection constant, we found that the selection operation generally lands up at the bottom of the best Volcano plan tree, and the two best plan trees may not have common subexpressions. The results on the above workload are shown in Figure 3.8. Across the workload, VolcanoSH and Volcano-RU achieve up to only about 14% improvement over Volcano with respect to the cost of the returned plan, while incurring negligible overheads. There was no difference between Volcano-SH and Volcano-RU on these queries, implying the choice of plans for earlier queries did not change the local best plans for later queries. Greedy performs better, achieving up to 56% improvement over Volcano, and is uniformly better than the other two algorithms. As expected, Volcano-SH and Volcano-RU have essentially the same execution time and space requirements as Volcano. Greedy takes about 15 seconds on the largest query in the set,
62
BQ5, while Volcano takes slightly more than 1 second on the same. However, the estimated cost savings on BQ5 is 260 seconds, which is clearly much more than the extra optimization time cost of 14 secs. Thus the extra time spent on Greedy is well spent. Similarly, the space requirements for Greedy were more by about a factor of three to four over Volcano, but the absolute difference for BQ5 was only 60KB. The benets of Greedy, therefore, clearly outweigh the cost.
to : component query was a pair of chain queries on ve consecutive relations to , with the join condition being , for . One while the other had a selection of the queries in the pair had a selection where and are arbitrary values with . To measure scaleup, we use the composite queries to , where is consists of queries to . Thus, uses relations to , and has join
predicates and
144 join predicates and 36 select predicates. The size of the 22 base relations assumed on the base relations.
varied from 20000 to 40000 tuples (assigned randomly) with 25 tuples per block. No index was
The cost of the plan and optimization time for the above workload is shown in Figure 3.9. The relative benets of the algorithms remains similar to that in the earlier workloads, except that Volcano-RU now gives somewhat better plans than Volcano-SH. Greedy continues to be the best, although it is relatively more expensive. The optimization time for Volcano, VolcanoSH and Volcano-RU increases linearly. The increase in optimization time for Greedy is also practically linear, although it has a very small super-linear component. But even for the largest query, CQ5 (with 22 relations, 144 join predicates and 36 select predicates) the time taken was only 35 seconds. The size of the DAG increases linearly for this sequence of queries. From the
63
30
20
400
200
10
0 CQ1 CQ2 CQ3 CQ4 CQ5 CQ1 CQ2 CQ3 CQ4 CQ5
2000
150000
1500 Greedy
100000 Greedy
1000
50000
500
Figure 3.10: Complexity of the Greedy Heuristic above, we can conclude that Greedy is scalable to quite large query batch sizes. To better understand the complexity of the Greedy heuristic on the scaleup workload, in addition to the optimization time we measured the total number of times cost propagation occurs across equivalence nodes, and the total number of times cost recomputation is initiated. The result is plotted in Figure 3.10. Note that in addition to the size of the DAG, the number of sharable nodes also increases linearly across queries CQ1 to CQ5. Greedy was considered expensive by [57] because of its worst case complexity: it can be as much as
where
is the
number of edges in the DAG. However, for multi-query optimization, the DAG tends to be wide rather than tall as we add queries, the DAG gets wider, but its height does not increase, since the height is dened by individual queries.
64
across equivalence nodes, and the number of times cost recomputation is initiated both increase almost linearly with number of queries. The observed complexity is thus much less than the worst case complexity. The number of times costs are propagated across equivalence nodes is almost constant per cost recomputation. This is because the number of nodes of the DAG affected by a single materialization does not vary much with number of queries, which is exploited by incremental cost recomputation. The height of the DAG remains constant (since the number of relations per query is xed, which is a reasonable assumption).
65
order of magnitude improvement in its performance, and are critical for it to be of practical use.
3.6.4 Discussion
To check the effect of memory size on our results, we ran all the above experiments increasing the memory available to the operators from 6MB to 32MB and further to 128MB. We found that the cost estimates for the plans decreased slightly, but the relative gains (i.e., cost ratio with respect to Volcano) essentially remained the same throughout for the different heuristics. We stress that while the cost of optimization is independent of the database size, the execution cost of a query, and hence the benet due to optimization, depends upon the size of the underlying data. Correspondingly, the benet to cost ratio for our algorithms increase markedly with the size of the data. To illustrate this fact, we ran the batched TPCD query BQ5 (considered in Experiment 2) on TPCD database with scale of 100 (total size 100GB). Volcano returned a plan with estimated cost of 106897 seconds while Greedy obtains a plan with cost estimate 73143 seconds, an improvement of 33754 seconds. The extra time spent during optimization is 14 seconds, as before, which is negligible relative to the gain. While the benets of using MQO show up on query workloads with common subexpressions, a relevant issue is the performance on workloads with rare or nonexistent overlaps. If it is known apriori that the workload is not going to benet from MQO, then we can set a ag in our optimizer that bypasses the MQO related algorithms described in this chapter, reducing to plain Volcano. To study the overheads of our algorithms in a case with no sharing, we took TPCD queries Q3, Q5, Q7, Q9 and Q10, renamed the relations to remove all overlaps between queries, and created a batch consisting of the queries with relations renamed. The overheads of Volcano-SH and Volcano-RU are neglibible, as discussed earlier. Basic Volcano optimization took 650 msec, while the Greedy algorithm took 820 msec. Thus the overhead was around 25%, but note that the absolute numbers are very small. With no overlap, the sharability detection algorithm nds no node sharable, causing the Greedy algorithm to terminate immediately (returning the same plan as Volcano). Thus, the overhead in Greedy is due to (a) expansion of the entire DAG, and (b) the execution of the sharability detection algorithm. Of this overhead, cause (a) is predominant, and
66
the sharability computation was quite cheap on queries with no sharing. In our experiments, Volcano-RU was better than Volcano-SH only in a few cases, but since their run times are similar, Volcano-RU is preferable. There exist cases where Volcano-RU nds out plans as good as Greedy in a much less time and using much less space; but on the other hand, in the above experiments we saw many cases where additional investment of time and space in Greedy pays off and we get substantial improvements in the plan. To summarize, for very low cost queries, which take only a few seconds, one may want to use Volcano-RU, which does a quick-and-dirty job; especially so if the query is also syntactically complex. For more expensive queries, as well as canned queries that are optimized rarely but executed frequently over large databases, it clearly makes sense to use Greedy.
tasks may be shared between plans for different queries. They do not exploit the hierarchical nature of query optimization problems, where tasks have subtasks. Finally, these solutions are not integrated with an optimizer. The work in [59] considers sharing only amongst the best plans of each query this is similar to Volcano-SH, and as we have seen, this often does not yield the best sharing. The problem of materialized view/index selection [45, 44, 63, 9, 34, 26] is related to the multi-query optimization problem. The issue of materialized view/index selection for the special case of aggregates/data-cubes is considered in [29, 27] and implemented in Redbrick Vista [11]. The view selection problem can be viewed as nding the best set of sub-expressions to materialize, given a workload consisting of both queries and updates. The multi-query optimization problem differs from the above since it assumes absence of updates, but it must keep in mind the cost of computing the shared expressions, whereas the view selection problem concentrates on the cost of keeping shared expressions up-to-date. It is also interesting to note that multi-
67
query optimization is needed for nding the best way of propagating updates on base relations to materialized views [44]. Several of the algorithms presented for the view selection problem ([29, 27, 26]) are similar in spirit to our greedy algorithm, but none of them described how to efciently implement the greedy heuristic. Our major contribution here lies in making the greedy heuristic practical through our optimizations of its implementation. We show how to integrate the heuristic with the optimizer, allowing incremental recomputation of benets, which was not considered in any of the earlier work, and our sharability and monotonicity optimizations also result in great savings. The lack of an efcient implementation could be one reason for the authors in [57] to claim that the greedy algorithm can be quite inefcient for selecting views to materialize for cube queries. Another reason is that, for multi-query optimization of normal SQL queries (modeled by our TPC-D based benchmarks) the DAG is short and fat, whereas DAGs for complicated cube queries tend to be taller. Our performance study (Section 3.6) indicates the greedy heuristic is quite efcient, thanks to our optimizations. Another related area is that of caching of query results. Whereas multiquery optimization can optimize a batch of queries given together, caching takes a sequence of queries over time, deciding what to materialize and keep in the cache as each query is processed. Related work in caching includes [10, 64, 33]. The work in [64, 33] considers only queries that can be expressed as a single multi-dimensional expression. The work in [10] addresses the issue of management of a cache of previous results but considers only select-project-join (SPJ) queries. We consider a more general class of queries. Our multi-query optimization algorithms implement query optimization in the presence of materialized/cached views, as a subroutine. By virtue of working on a general DAG structure, our techniques are extensible, unlike the solutions of [8] and [10]. The problem of detecting whether an expression can be used to compute another has also been studied in [35, 62, 52]; however, they do not address the problem of choosing what to materialize, or the problem of nding the best query plans in a cost-based fashion. Recently, [43] considers the problem of detecting invariant parts of a nested subquery, and teaching the optimizer to choose a plan that keeps the invariant part as large as possible. Perform-
68
ing multi-query optimization on nested queries automatically solves the problem they address. Our algorithms have been described in the context of a Volcano-like optimizer; at least two commercial database systems, from Microsoft and Tandem, use Volcano based optimizers. However, our algorithms can also be modied to be added on top of existing System-R style bottomup optimizers; the main change would be in the way the DAG is represented and constructed.
3.8 Summary
We have described three novel heuristic search algorithms, Volcano-SH, Volcano-RU and Greedy, for multi-query optimization. We presented a a number of techniques to greatly speed up the greedy algorithm. Our algorithms are based on the AND/OR Query DAG representation of queries, and are thereby can be easily extended to handle new operators. Our algorithms also handle index selection and nested queries, in a very natural manner. We also developed extensions to the DAG generation algorithm to detect all common sub expressions and include subsumption derivations. Our implementation demonstrated that the algorithms can be added to an existing optimizer with a reasonably small amount of effort. Our performance study, using queries based on the TPC-D benchmark, demonstrates that multi-query optimization is practical and gives signicant benets at a reasonable cost. The benets of multi-query optimization were also demonstrated on a real database system. The greedy strategy uniformly gave the best plans, across all our benchmarks, and is best for most queries; Volcano-RU, which is cheaper, may be appropriate for inexpensive queries. Our multi-query optimization algorithms were partially prototyped on Microsoft SQL Server in summer 99, and are currently being evaluated by Microsoft for possible inclusion in SQL Server. In conclusion, we believe we have laid the groundwork for practical use of multi-query optimization, and multi-query optimization will form a critical part of all query optimizers in the future.
70
Update Transaction
Query
Execution Engine
cached relations
Query Result
relations to be cached
Result Cache
Figure 4.1: Architecture of the Exchequer System cess patterns of the queries cannot be expected to be static, and to answer all types of queries efciently, we need to dynamically change the cache contents. The techniques needed for (a) for intelligently and automatically managing the cache contents, given the cache size contraints, as queries arrive, and (b) for performing query optimization exploiting the cache contents, so as to minimize the overall response time for all the queries, form the crux of this work. These techniques form a part of the Exchequer 2 query caching system. The architecture of the Exchequer system is portrayed in Figure 4.1. Query results are cached on a xed-size disk area, called the result cache. Thus the caching of a result incurs an overhead of writing the result to disk. If the cached result is to be indexed, the caching overhead includes the index creation overhead. A use of the cached result corresponds to index probes if it is indexed, a full scan otherwise. Our techniques also apply to -memory caching as well as to hybrid two-level (disk cum main-memory) caching. These variants are discussed in Section 4.6. The cache manager and the optimizer are tightly integrated: (a) the optimizer optimizes an incoming query based on the current cache state, and (b) the cache manager decides which results to cache and which cached results to evict based on the workload (which depends on the sequence of queries in the past). We assume that the workload presents queries in an ordered sequence, and only one query is
2
71
processed at a time. Extending for concurrent optimization and execution, wherein new queries arrive and are to be optimized and executed while a previous query is being optimized and executed, is a topic of future study. In particular, we assume that the cache contents do not change between the optimization and execution of a query. The results are cached without any projections, to maximize the number of queries that can benet from a cached result. Extensions to avoid caching very large attributes are possible. In addition to the above functionality, a caching system should also support invalidation or refresh of cached results in the face of updates to the underlying database. In this chapter, however, we will conne our attention only to the issue of efcient query processing, ignoring updates. Data Warehouses are an example of an application where the cache replacement algorithm can ignore updates, since updates happen only periodically (once a day or even once a week).
optimization. In order to perform workload-adaptive caching, it is essential to dynamically maintain a characterization of the current workload; how Exchequer achieves this is discussed in Section 4.2. Next, Section 4.3 outlines Exchequers cache management algorithm. Differences of this work from earlier related work are covered in detail in Section 4.4. Results of experimental evaluation of the proposed algorithms are discussed in Section 4.5. The chapter is summarized in in Section 4.7.
tion 4.1.1 describes the Consolidated DAG, an auxiliary Query DAG (ref. Section 2.2.2) that is used to keep track of the queries in the workload as well as the cache contents. In Section 4.1.2, we outline how a Query DAG for the query is generated and melded with the Consolidated DAG; as we shall show, this takes care of cached result matching and expressing the query in terms of these cached results. Next, in Section 4.1.3, we describe Exchequers variant of the Volcano query optimization algorithm that uses this Query DAG to nd the best plan for the query in the
C D, A C E
73
AC (cached)
AB
AC (cached)
AB
BC
AC (cached)
(a)
(b)
(c)
A C D, A C E
C arrives, its
C D,
initial unexpanded representation is created and added to the CDAG as shown in Figure 4.2(b). The next step is the expansion of this query tree into the Query DAG for the query shown in Figure 4.2(c). This is achieved by applying all possible transformations on every equivalence node of the query tree. In our example, we assume that the only transformations applied are join associativity and commutativity. (To avoid clutter, the gure does not show the results of applying commutativity on the respective expressions.) In the process, when the expression (A
C) B is generated, the new expression A C is found to already exist in the CDAG. It turns
74
out that the equivalence node for A C is marked as present in the cache (see Figure 4.2(c)); the
Exchequer also detects and handles subsumption derivations. For example, suppose two subexpressions and
can be
.
To
represent this possibility, we add an extra operation node DAG. Similarly, given :
between
and
in the Query
and
and
selections on an expression , we create a single new equivalence node representing the disjunction of all the selection conditions. Similar derivations also help with aggregations. For example, if we have : :
and
and
groupbys on
and
Subsumption derivations are important because (a) they allow reuse of cached results even though the cached result does not exactly match a subexpression of the query, but can be used to compute the same; and dually, (b) they make explicit the different ways in which a result may be used, which is important for determining the benet of caching the result while making the dynamic caching decisions as explained in Section 4.3. Volcano neither performs unication nor introduces subsumption derivations these extensions were proposed as a part of our earlier work on multi-query optimization (Chapter 3). The novelty here is to show how this Query DAG framework can be used to perform matching of queries and cached results during optimization with neglegible overhead on the optimizer. In the following section, we discuss how the Query DAG for the new query, generated as explained in this section, is used to generate the best plan for the query in a cache-aware manner.
75
The main extension to Volcano for Exchequer involves considering possible use of cached results while determining the minimum-cost plan for a query. To nd the cost of a node given a set of equivalence nodes whose results are present in the cache, we use the Volcano cost formulae stated above for the query, with the following change.
denote the cost of reusing the cached result. When computing the cost of an operation node , if an input equivalence node , the minimum of and is used for . Thus,
we use the following expression instead: cost = cost of executing + where
For the equivalence node , whose result is present in the cache, let
if
if Thus, the extended optimizer computes best plans for the query in the presence of cached results.
workload at this point as a sequence of queries picked from some xed set according to some xed probability distribution. Thus, in this model, the set of queries and probability distribution together fully characterize the workload at this point; however, neither of these are known, and need to be predicted. These predictions need to be dynamic, and must be continuously updated to keep track of the changing workload as time progresses. Our predictions for the future are entirely based on the past. As such, we predict the set of future queries as the set of queries present in CDAG at the given point in time. We denote this set by . Further, let the estimate of the probability distribution at this point be denoted by . We assume the presence of (a) an arbitrary non-empty initial set of queries, , and (b) an arbitrary initial probability distribution, , on . In the discussion below, we show how and
are
76
updated to and respectively on the arrival of the query . When in Section 4.1.2, enables us to determine whether or not . If , the CDAG remains
is 1 if
, and 0 otherwise.
The smoothing factor
if if
if
and and
denotes the bias of the estimator in favour of the recent queries in our experiments. The exponential smooting estimator
was chosen because of its simplicity and low overhead. The probability estimates need to be maintained dynamically as the workload progresses. An option is to compute this estimate on the arrival of each successive query using the equations above for each query in the current CDAG. This is clearly not viable due to the large number of queries involved. In practice, therefore, these estimates are maintained lazily and computed only when accessed.
Consider an arbitrary query in the workload. The algorithm outlined in this section attempts
Possibly replacing some other queries due to space constraints. This case is not considered in the presented
scheme for sake of simplicity; it is trivial extension to the same. 4 It can be veried that is a valid probability distribution if is one.
77
by (a) the set of queries and (b) the probability distribution on . Let be the set then optimized using the results in as explained in Section 4.1. The expected execution cost of the best plan for chosen by the optimizer is given by
As outlined in Section 4.2, the future workload at the point of execution of is characterized
of results present in the cache when a query arrives, as a part of the predicted workload. is
, where
However,
since contains a large number of queries, computation of the above sum is expensive. Thus, occur as the next query (as suggested by the distribution on ) and compute the sum with
we identify a representative set, , a subset of containing queries that are most likely to
of reference; therefore, restricting the sum with respect to the most probable queries should give a reasonable approximation of the actual expected cost. We thus compute an approximation
; Exchequers execution engine recongures the cache accordingly during the execution of . Given a set of results already chosen for caching by the algorithm, and a result , , the benet of additionally caching node , is dened as the decrease in (the payoff), minus the cost of caching , if it is not already present in the cache (the investment). Formally:
The algorithm described below, thus, chooses the set that minimizes
where
is the cost of caching the new result , which involves writing to the disk. The benet measured as above is conservative since it does not amortize the over
multiple uses; computing a tighter measure of benet is nontrivial since it is difcult to compute
78
Procedure G REEDY Input: , the set of candidate results for caching Output: , the set of results to be cached Begin = while ( ) Among results L1: Pick the result with the maximum /* i.e., maximum benet per unit space */ ) if ( or break; /* No further benets to be had, stop */ ; return End
Figure 4.3: The Greedy Algorithm for Cache Management apriori how many times the result is going to be used between its admission into the cache and
Figure 4.3 outlines an algorithm, hereafter called Greedy, that takes as input a candidate set of results, , and heuristically selects (for caching) the subset of overall under the cache space constraint of
benets of caching the intermediate results that are computed during the execution of the best plan of
against the benet of retaining results that are already in the cache. As such, the contains:
candidate set
1. The nal and intermediate results in the best plan of , and 2. The set of results that was selected as having the maximum benet by the preceding invocation of the algorithm (this set is present in the cache). Greedy works iteratively as follows. Starting with empty, in each iteration, the algorithm greedily selects the node among the results in unit space, and moves it from that, if cached, gives the maximum benet per becomes empty, benet to . The algorithm terminates when
becomes zero/negative, or the size of the nodes in exceed the cache size, whichever is earlier.
79
The nal value of is the set of results to be placed in the cache, and is returned as the output of the algorithm. After has been computed by Greedy, the best plan of is executed. Two variants of the Exchequer algorithm are possible depending upon what is cached during the execution:
Optimizations of Greedy Algorithm: Two important optimizations to a greedy algorithm for multi-query optimization, originally proposed in the context of multi-query optimization (Chapter 3), can be adapted for the purpose of selecting the cachable nodes efciently: 1. Since there are many calls to benet (and thereby to
) at line L1 of Figure 4.3, with different parameters, a simple option is to process each call to independent . Details can be found in Chapter 3.
of other calls. Our optimization is to incrementally update the costs, maintaining the state of the Query DAG (which includes previously computed best plans for the equivalence nodes) across calls to
80
if the new benet of some result is higher than the previously computed benet of . It is clearly preferable to cache at this stage, rather than under the above assumption, the benet of could not have increased since it
compute the benet of a result was last computed.
81
Furthermore, in earlier work that considers general queries (e.g. WatchMan [49]), the cached results are matched syntactically. Our work carries out sematic matching of cached results during cache-aware query optimization. It is important to contrast the caching problem with the materialized view/index selection problem, where the cache contents do not vary and the query workload is known fully apriori (e.g., see [44, 34, 26] for general views, [29, 27, 57] for data cubes, and [9] for index selection). Techniques for materialized view/index selection use sophisticated ways of deciding what to materialize, where the computation of the benet of materializing a view takes into account what other views are materialized. The major disadvantage of static cache contents is that they cannot cater to changing workloads the data access patterns of the queries cannot be expected to be static, and to answer all types of queries efciently, we need to dynamically change the cache contents. Moreover, the cost of materializing the selected views is ignored. Another related area is multi-query optimization (MQO), where (e.g., the work presented in Chapter 3) the optimizer takes the cost of temporarily materializing the selected views, but still makes a static decision on what to materialize based on a xed set of queries. Still, as we saw in Section 4.3, dynamic cache management can benet from some of the techniques developed for the efcient implementation of MQO. In particular, the Greedy algorithm presented in Section 4.3 is derived from the Greedy algorithm used in our earlier work on MQO (Chapter 3). However, that algorithm was concerned with minimizing the total one-time execution cost of the queries in a given batch, with no restriction on the storage space. The Greedy algorithm presented in Section 4.3, on the other hand, is concerned with minimizing the cost of an innite workload, where each query can occur multiple times, under xed constraints on the storage space for cached results. This leads to a very different notion of the benet of sharing a result. Apart from this, a major design issue in this work is to make Greedy suitable for online operation, as is apparent from our discussion in Section 4.3. Recently, there has been some interest in caching in context of LDAP queries [31]; these queries are simple in nature and involve only multi-attribute selects on a single table. The caching algorithm proposed in [31] performs complete reorganization of the cache contents (called rev-
82
olution) whenever the estimated benet of the cached data drops below a dynamically estimated value. In between revolutions, the cache contents undergo incremental modications (called evolution). Exchequer performs only evolution; our experiences with performing revolutions as well are presented in Section 4.6.
were implemented as extensions of the multi-query optimization code (Chapter 3) that we have integrated into our Volcano-based query optimizer. The basic optimizer took approx. 17,000 lines of C++ code, with caching code taking about 3,000 lines. The block size was taken as 4KB and our cost functions assume 6MB is available to each operator during execution (we also conducted experiments with memory sizes up to 128 MB, with similar results). Standard techniques were used for estimating costs, using statistics about relations. The cost estimates contain an I/O component and a CPU component, with seek time as 10 msec, transfer time of 2 msec/block for read and 4 msec/block for write, and CPU cost of 0.2 msec/block of data processed. We assume that intermediate results are pipelined to the next input, using an iterator model as in Volcano. Caching a result has the cost of writing out the result sequentially to the disk. The tests were performed on a Sun workstation with UltraSparc 10 333Mhz processor, 256MB RAM, running Solaris 2.7.
83
The join-list enforces equality between attributes of the order fact table and primary keys of the dimension tables. We pick suppkey, partkey, custkey, month, year as the set of groupby attributes . An additional attribute from each of PART, SUPPLIER and CUSTOMER was
picked to form the list of select attributes . The groupby-list was generated by picking a subset of at random. The select-list, i.e. the and and
predicates for the selects, were generated by selecting attributes at random from
creating equality or inequality predicates on these attributes using random values picked from the respective domains. The select predicates involving attribues in dene different cubes. Thus,
in effect, the workload models simultaneous analysis of a large number of distinct cubes. A query is thus dened uniquely by the pair (select-list, groupby-list). Even though our algorithms can handle a more general class of queries, the above class of cube queries was chosen so that we can have a fair comparison with DynaMat [33] and Watchman2 [50]. There are two independent criteria based on which the pair (select-list, groupby-list) was generated. 1. The kind of predicates comprising the select-list. Accordingly, we classify the workloads as:
CubePoints: Predicates are restricted to equalities, or CubeSlices: Predicates are a random mix of equalities and inequalities.
Figure 4.4 gives the distribution of the distinct intermediates results computed during the processing of the CubePoints and CubeSlices workloads. Since each predicate in CubePoints is a highly selective equality, the size of most intermediate results is small, at most
84
1000
Number of Results
10
1 0% 8% 32% 64%
Result Size (% of DB Size)
128%
Figure 4.4: Distribution of distinct intermediate results generated during the processing of the CubePoints and CubeSlices workloads 10% of the database size. On the other hand, since CubeSlices contains inequalities as well, a number of larger intermediate results, with size upto 40% of the database size, are also present. 2. The distribution from which the attributes and values are picked up in order to form the groupby-list and the predicates in the select-list. We consider a moderately skewed and a highly skewed workload, based on the Zipan distribution:5
Zipf-0.5: Uses Zipan distribution with parameter 0.5. This workload is moderately
skewed.
Zipf-2.0: Uses Zipan distribution with parameter 2.0. This workload is highly
skewed. The distribution additionally rotates after every interval of 128 queries, i.e. the most frequent subset of groupbys becomes the least frequent, and all the rest shift up one position.
5
species
85
Thus, within each block of 128 queries, some groupby combinations and selection constants are more likely to occur than others. Based on the four combinations that result from the above criteria, the following four workloads are considered in the experiments:
CubePoints/Zipf-0.5: a moderately skewed workload of CubePoints, CubePoints/Zipf-2.0: a highly skewed workload of CubePoints, CubeSlices/Zipf-0.5: a moderately skewed workload of CubeSlices, and CubeSlices/Zipf-2.0: a highly skewed workload of CubeSlices
4.5.2 Metric
The metric used to compare the goodness of caching algorithms is the total response time of a set of queries. We report the total response time for a sequence of 900 queries that enter the system after a sequence of 100 queries warm up the cache. This total response time is as estimated by the optimizer and hence denoted as estimated cost in the experimental results presented in Section 4.5.4. These estimates are the same as used in Section 3.6 and as demonstrated there, are a close approximation to the real execution costs on Microsoft SQL-Server 6.5.
86
experimental study in Section 4.5.4, we evaluate these variants against each other as well as against the following prior approaches.
LCS/LRU: This approach uses the caching policy found to be the best in ADMS [10],
namely replacing the result occupying the largest cache space (LCS), picking the least recently used (LRU) result in case of a tie. The incoming query is optimized taking the cache contents into account. The nal as well as intermediate results in the best plan are considered for admission into the cache based on LCS.
DynaMat: We simulate DynaMat [33] by considering only the top-level query results
(in order to be fair to DynaMat, our benchmark queries were chosen to have either no selection or only single value selections). The original DynaMat performs matching of cube slices using R-trees on the dimension space. In our implementation, query matching is performed semantically, using our unication algorithm, rather than syntactically. We use our algorithms to optimize the query taking into account the current cache contents; this covers the subsumption dependency relationships explicitly maintained in [33]. The replacement metric is computed as: (number-of-accesses cost-of-computation)/(query-result-size) where the number of accesses are from the entire history (observed so far).
WatchMan: Watchman [49] also considers caching only the top level query results. The
original Watchman does syntactic matching of queries, with semantic matching left for future work. We improve on that by considering semantic matching. The difference between our implementation of DynaMat and WatchMan is in the replacement metric: instead of considering the number of accesses as in the Dynamat implementation, our WatchMan implementation considers the rate of use on a window of last ve accesses for each query.
87
where the cost of computation is with respect to the current cache contents. The original algorithms did not consider subsumption dependencies between the queries; our implementation considers aggregation subsumption among the cube queries considered. Given the enhancements mentioned above, our implementations of the above algorithms are slightly more sophisticated than the originally proposed versions. It is important to investigate the promise dynamic materialized view selection hold over static materialized view selection. In order to do so, we consider our version of static view selection wizard as follows:
Static: We use Exchequer/NoFullCache on the rst 100 queries in the workload, with
the representative set consisting of all queries so far. After the 100 query, the cache contents are xed and never changed in the duration of the remaining workload. The cost of computing the materialized views is not added in the execution cost of the workload. In order to evaluate the absolute benets and competitivity of the algorithms considered. we also consider the following baseline approaches:
NoCache: Queries are run assuming that there is no cache. This gives an upper bound on
the running time of any well-behaved caching algorithm.
InfCache: The purpose of this simulation is to give a lower bound on the running time of
any caching algorithm. We assume an innite cache and do not include the materialization cost. Each new result is computed and cached the rst time it occurs, and reused whenever it occurs later.
88
900 Query CubePoint/Zipf-0.5: Estimated Cost (seconds)
10000 NoCache DynaMat LCS/LRU Exchequer/FinalRes WatchMan Exchequer/FullCache Exchequer/NoFullCache Static InfCache
5000
0 0% 8% 32% 64%
Cache Size (% of DB Size)
128%
Figure 4.5: Performance on 900 Query CubePoints/Zipf-0.5 Workload 2. Merit of dynamic intermediate result caching over static result caching, for moderately and highly skewed workloads. 3. Merit of cost-benet based approach over simpler policies like LCS/LRU. 4. Merit of keeping the cache full by caching additional results in case the results selected by greedy do not ll up the entire cache (as in Exchequer/FullCache) over caching only the results selected by greedy, as in Exchequer/NoFullCache). 5. Whether the overheads incurred by Exchequer/FullCache are acceptable. We experiment with different cache sizes, corresponding to roughly 0%, 32% and 64% and 128% of the total database size of approximately 40 MB. For each of these cache sizes, the set of 9 algorithms mentioned in Section 4.5.3 (viz. NoCache, DynaMat, LCS/LRU, WatchMan, Exchequer/FinalRes, Exchequer/FullCache, Exchequer/NoFullCache, Static and InfCache) were executed on the four workloads listed in Section 4.5.1. The results for CubePoints/Zipf-0.5 and CubePoints/Zipf-2.0 workloads are shown in Figure 4.5 and Figure 4.6 respectively, while the results for CubeSlices/Zipf-0.5 and CubeSlices/Zipf-2.0 are shown in Figure 4.7 and Figure 4.8
89
10000 NoCache DynaMat LCS/LRU Exchequer/FinalRes WatchMan Exchequer/FullCache Exchequer/NoFullCache Static InfCache
5000
0 0% 8% 32% 64%
Cache Size (% of DB Size)
128%
15000
10000
5000
0 0% 8% 32% 64%
Cache Size (% of DB Size)
128%
90
15000
900 Query CubeSlice/Zipf-2.0: Estimated Cost (seconds)
10000
5000
0 0% 8% 32% 64%
Cache Size (% of DB Size)
128%
Effect of Intermediate Result Caching. For all the four workloads, DynaMat, WatchMan and Exchequer/FinalRes which cache only the full query results perform very poorly. This is because though there is a large amount of overlap among the queries in each workload, there is hardly any repetition of the same query. In fact, because of the select predicates involving the set (ref.
Section 4.5.1), the subsumption possibilities among the results (that can be exploited by these algorithms) are minimal. The importance of intermediate result caching can be gauged by the fact that even Static, which maintains a xed set of intermediate results, consistently performs far better than these algorithms. This is because the intermediate results cached by static, though xed, can be used by a greater number of queries in the workload. This clearly demonstrates the heavy improvement in performance that can be achieved using intermediate result caching.
Effect of Dynamic Caching. We now compare the performance of Static with that of the algorithms which dynamically maintain the cached results, viz. LCS/LRU, Exchequer/NoFullCache
91
Recall that Static builds up the cache contents using the query distribution of the rst 100 queries, and keeps it xed for the duration of the remaining 900 queries. However, each of the workloads changes the skew after every 128 queries, making the caching decisions by Static mostly ineffective. Naturally, therefore, we nd that these dynamic intermediate result caching algorithms consistently perform much better than Static for all the workloads considered, with the sole exception of CubeSlices/Zipf-0.5. In the case of CubeSlices/Zipf-0.5, Static performs better than LCS/LRU for the whole range of cache sizes considered. This is because CubeSlices/Zipf-0.5 workload contains large intermediate results with high benet due to subsumption. While Static caches these results, LCS/LRU does not because of its bias against larger results. Surprisingly, for small cache sizes on the CubeSlices/Zipf-0.5 workload, even Exchequer/NoFullCache and Exchequer/FullCache perform better than Static. This is because for small cache sizes, these large cache results lead to signicant overheads due to their repeated materialization and disposal in the dynamic algorithms, and the xed caching approach of Static holds an advantage. However, for larger cache sizes, Exchequer/NoFullCache and Exchequer/FullCache are able to maintain these larger results in the cache longer, leading to the sharp gain in performance over Static. Thus, overall, we conclude that dynamic intermediate can lead to large improvements over static caching. For consistent behaviour, however, it is important that the intermediate result caching policy be intelligent, taking into account the cost versus benet of caching the results, unlike LCS/LRU. This is further discussed next.
Need for Cost-Benet Based Algorithms. We now compare the sophisticated approach of Exchequer/FullCache, with the much simpler approach of LCS/LRU. We nd that while Exchequer/FullCache performs very well for all the four workloads, the relative performance of LCS/LRU varies from very good to poor (even worse than Static), markedly depending upon the distribution of the intermediate results (ref. Figure 4.4) and the skew of the workload. On the CubePoints workloads (both Zipf-0.5 and Zipf-2.0), LCS/LRU performs extremely well; in fact its performance is close to that of Exchequer/FullCache for this workload. This
92
is because the size of the intermediate results in these workloads is small; moreover, because of the predicates being exclusively equalities, subsumption plays little role and therefore larger results have small benet given the space they occupy. Thus, on these workloads, the LCS/LRU strategy of preferably caching smaller results pays well, and the advantage due to occasional high benet larger results cached by Exchequer/FullCache is not much. Thus, for the workloads having small intermediate results and low subsumption opportunities, the benets offered by the more sophisticated Exchequer/FullCache over much simpler LCS/LRU are modest. On the CubeSlices workloads, however, Exchequer/FullCache performs much better than LCS/LRU. This is because, due to subsumption, the larger results have a higher benet, but LCS/LRU preferentially maintains smaller results in cache. LCS/LRU works on the assumption that smaller intermediate results have high benet. In the cases when this assumption is satised, the performance of LCS/LRU is almost as well as Exchequer/FullCache. However, in case this assumption does not hold and larger intermediate results have greater benet, LCS/LRU does not perform well. Exchequer/FullCache explicitly takes into account the costs and benets of intermediate results while taking the caching decisions and, unlike LCS/LRU, does not rely on an ad-hoc rule. This makes it much less sensitive to the size of intermediate results, and it performs much better than other earlier algorithms on all the four workloads. Thus, at the cost of the extra sophistication, Exchequer/FullCache gives a performance that is not only better, but is much more stable than that given by the simpler LCS/LRU.
basic Exchequer algorithm, Exchequer/NoFullCache and Exchequer/FullCache, differ in the decision about whether or not to make extra investments by caching additional results in the cache that may remain unlled after all the results selected by Greedy are cached; this extra space is managed using LCS/LRU. Exchequer/FullCache makes this investment expecting to benet in the future due to having more results in the cache. On the other hand, Exchequer/NoFullCache is more conservative and does not make this investment. Our results show that Exchequer/FullCache benets signicantly in performance over Ex-
93
chequer/NoFullCache by making use of the extra cache space. There are instances when the investment does not pay off, as in the case of CubePoints/Zipf-0.5 for the cache size of 128%, and the performance actually deteriorates. But this occasional loss is neglegible as compared to the benets obtained, as can be seen by comparing the graphs of Exchequer/FullCache and Exchequer/NoFullCache for all the four workloads. It may be argued that since Exchequer/NoFullCache selects results for caching after carefully weighing their benets against their costs, the extra benet due to caching additional results should be minimal. However, the accuracy of these benets depends on the how accurately the past workload estimates the future workload (ref. Section 4.2). In face of sudden changes in the workload skew (recall that each of our workloads changes skew after a block of 128 queries), the estimate may be inaccurate for a certain transient period. During this period, therefore, the benet may not be accurate. Caching additional results reduces the impact of such occasional inaccuracies, and makes the caching policy more stable.
gorithm, we determined the space taken by CDAG during the execution of the Exchequer algorithm; recall that the CDAG includes the best plans for the 10 queries in the representative set, the expanded DAG for the current query, and the best plans for the results currently in the cache. For the run of Exchequer/FullCache on the CubeSlices/Zipf-2.0 workload, the maximum size of CDAG was approximately 23M of memory, and was independent of the cache size. The time taken by Exchequer/FullCache depends on the cache size since the Greedy algorithm (ref. Section 4.3) chooses results only till their size does not exceed the cache size. The table below shows the average optimization costs and optimization times per query for Exchequer/FullCache on the 900 query CubeSlices/Zipf-2.0 workload for different cache sizes; the corresponding numbers for other workloads are similar.
Cache Size (% of DB Size) Metric Avg. Optimization Time/Query (secs) Avg. Estimated Cost/Query (secs) 0% 0.16 16.95 8% 1.01 10.92 32% 1.18 8.26 64% 1.22 7.00 128% 1.05 6.45
94
is an order of magnitude less than the execution cost of the workload (the ratio can be expected to be even less on datasets larger than TPC-0.1), thus showing that the optimization of queries and cache management in Exchequer has negligible overhead.
4.6 Extensions
We have developed several extensions of our techniques, which we outline below. We implemented a version of the Exchequer algorithm with periodic reorganization, which is similar to revolution [31]. This involved invoking Greedy with the candidate set containing all results in the best plan of each query in the Representative Set. However, for reasonably complex queries involving joins this leads to a large candidate set, and thus the reorganization step is very expensive. In many cases, this led to poor gains at a high cost. Therefore, we abandoned this strategy. The Exhequer system described in this chapter supports only disk-caching. However, the techniques described can be extended for main-memory caching and hybrid (disk cum mainmemory) caching. A main-memory caching system contains a xed size area in memory allocated as the cache. The modication is restricted to the cost-model there is no I/O overhead for caching results or for using them; the techniques as presented in this chapter remain unchanged. A hybrid caching system contains (a) a xed size area in memory allocated as the main-memory cache, as well as (b) a xed size area on disk allocated as the disk cache. We modify the Greedy algorithm to work in two phases: the rst phase lls up the main-memory cache, while the second phase lls up the disk cache, choosing results from those that remain in the candidate set after the rst phase is over. The two phases are identical in all respects, except that results in the rst phase are chosen using the main-memory based cost model (no I/O overhead for caching or use of cached results), while the results in the second phase are chosen using the disk based cost model (same as considered in this chapter).
4.7. SUMMARY
95
4.7 Summary
In this chapter we have presented new techniques for query result caching, which can help speed up query processing in data warehouses. The novel features incorporated in our Exchequer system include optimization aware cache maintenance and the use of a cache aware optimizer. In contrast, in existing work, the module that makes cost-benet decisions is part of the cache manager and works independent of the optimizer which essentially reconsiders these decisions while nding the best plan for a query. Whereas existing approaches are either restricted to cube (slice/point) queries, or cache just the query results, our work presents a data-model independent framework and algorithm. Our experimental results attest to the efcacy of our cache management techniques.
view
and
to be inserted into .
and by
, one of
which may be substantially cheaper to compute. Further, in some cases the view may be best maintained by recomputing it, rather than by nding the differentials as above. Our work addresses the problem of optimizing the maintenance of a set of materialized views. If there are multiple materialized views, as is common, signicant opportunities exist for sharing computation between the maintenance of different views. Specically, common subexpressions between the view maintenance expressions can reduce maintenance costs greatly. Whether or not there are multiple materialized views, signicant benets can be had in many cases by materializing extra views or indices, whose presence can decrease maintenance costs signicantly. The choice of what to materialize permanently depends on the choice of view maintenance plans, and vice versa. The choices of the two must therefore be closely coupled to get the best overall maintenance plans.
Contributions.
plans. Specically, the contributions are as follows. 1. We show how to exploit transient materialization of common subexpressions to reduce the cost of view maintenance plans. Sharing of subexpressions occurs when multiple views are being maintained, since related views may share subexpressions, and as a result the maintenance expressions may also be shared. Furthermore, sharing can occur even within the plan for maintaining a single view if the view has common subexpressions within itself. The shared expressions could include differential expressions, as well as full expressions which are being recomputed. Here, transient materialization means that these results are materialized during the evaluation of the maintenance plan and disposed on its completion. 2. We show how to efciently choose additional expressions for permanent materialization to speed up maintenance of the given views.
98
99 during the evaluation of the maintenance plan and discarded after maintenance of the given views; such transient views themselves need not be maintained. On the other hand, the permanent views are materialized a priori, so there is no (re)computation cost; however, there is a maintenance cost, and a storage cost (which is long term in that it persists beyond the view maintenance period) due to the permanently materialized views.
The choice of additional views must be done in conjunction with selecting the plans
for maintaining the views, as discussed above. For instance, a plan that seems quite inefcient could become the best plan if some intermediate result of the plan is chosen to be materialized and maintained. We propose a framework that cleanly integrates the choice of additional views to be transiently or permanently materialized, the choice of whether each of the given set of (userspecied) views must be maintained incrementally or by recomputation, and the choice of view maintenance plans. 5. We have implemented all our algorithms, and present a performance study, using queries from the TPC-D benchmark, showing the practical benets of our techniques. Our contributions go beyond the existing state of the art in several ways: 1. Earlier work on selecting views for materialization addresses either transient view selection (for multi-query optimization, but not for view maintenance) without considering permanent view selection, or permanent view selection, without considering transient view selection. Neither approach is integrated with the choice of view maintenance plans. To the best of our knowledge, ours is the rst work that addresses the above aspects simultaneously, taking into account the intricate interdependence of the decisions. Making the decisions separately may lead to a non-optimal choice. See Section 5.1 for more details of related work. Moreover, as far as we know, the problem of automatically selecting the optimum maintenance policy for a materialized view in the presence of other materialized views has not
100
2. Earlier work on transient materialization (done in the context of multiquery optimization) is not coupled with view maintenance. While those algorithms can be used directly on view maintenance expressions to decide on transient view materialization, using them naively would lead to very poor performance. We show how to integrate view maintenance choices into an optimizer in a way that leads to very good performance. 3. We have shown the practicality of our work by implementing all our algorithms and presenting a performance study illustrating the benets to be had by using our techniques. Earlier work does not cover efcient techniques for the implementation of materialized view selection algorithms. Moreover, our implementation is built on top of an existing state-of-the-art query optimizer, showing the practicality of using our techniques on existing database systems. Our performance study, detailed in Section 5.6 shows that signicant benets, often by factors of 2 or more, can be obtained using our techniques. Although the focus of our work is to speed up view maintenance, and we assume an initial set of views have been chosen to be materialized, our algorithms can also be used to choose extra materialized views to speed up a workload containing queries and updates. Paper Organization. Related work is outlined in Section 5.1. Section 5.2 gives an overview of the techniques presented in this chapter. Section 5.3 describes our system model, and how the search space of the maintenance plans is set up. Section 5.4 shows how to compute the optimal maintenance cost for a given set of permanently materialized views, and a given set of views to be transiently materialized during the maintenance. Section 5.5 describes a heuristic that uses this cost calculation to determine the set of views to be transiently or permanently materialized so as to minimize the overall maintenance cost. Section 5.6 outlines results of a performance study, and Section 5.7 presents a summary of the chapter.
101
View Maintenance
and expressions was Blakeley et al. [3]. More recent work in this area includes [24, 12, 37, 36] and [48]. Gupta and Mumick [25] provide a survey of view maintenance techniques. Vista [61] describes how to extend the Volcano query optimizer to compute the best maintenance plan, but does not consider the materialization of expressions, whether transient or permanent. [42] and [61] propose optimizations that exploit knowledge of foreign key dependencies to detect that certain join results involving differentials will be empty. Such optimizations are orthogonal and complementary to our work.
Transiently Materialized View Selection (Multi-Query Optimization) Blakeley et al. [3] and Ross et al. [44] noted that the computation of the expression differentials has the potential for beneting from multi-query optimization. In the past, multi-query optimization was viewed as too expensive for practical use. As a result they did not go beyond stating that multi-query optimization could be useful for view maintenance. Early work on multi-query optimization includes [54, 56, 53]. More recently [59] and [47] (Chapter 3 of this thesis) considered how to perform multi-query optimization by selecting subexpressions for transient materialization, and showed that multiquery optimization is practical and can give signicant performance benets at acceptable cost. However, none of the work on multi-query optimization considers updates or view maintenance, which is the focus of this chapter. Using these techniques naively on differential maintenance expressions would be very expensive, since incremental maintenance expressions can
102
be very large. We utilize the optimizations proposed in Chapter 3 but signicant extensions are required to to take update costs into account, and to efciently optimize view maintenance expressions. Permanently Materialized View Selection There has been much work on selection of views
to be materialized. One notable early work in this area was by Roussopolous [45]. Ross et al. [44] considered the selection of extra materialized views to optimize maintenance of other materialized views/assertions, and mention some heuristics. Labio et al. [34] provide further heuristics. The problem of materialized view selection for data cubes has seen much work, such as [29], who propose a greedy heuristic for the problem. Gupta [26] and Gupta and Mumick [28] extend some of these ideas to a wider class of queries. The major differences between our work and the above work on materialized view selection can be summarized as follows: 1. Earlier work in this area has not addressed optimization of view maintenance plans in the presence of other materialized views. Earlier work simply assumes that the cost of view maintenance for a given set of materialized views can be computed, without providing any details. 2. Earlier work does not consider how to exploit common subexpressions by temporarily materializing them because of their focus on permanent materialization. In particular, common subexpressions involving differential relations cannot be permanently materialized. 3. Earlier work does not cover efcient techniques for the implementation of materialized view selection algorithms, and their integration into state-of-the-art query optimizers. Showing how to do the above is amongst our important contributions.
103
We extend the Query DAG representation (ref. Chapter 2), which represents just the space of recomputation plans, to include the space of incremental plans as well. This new extension uses propagation-based differential generation, which propagates the effect of one delta relation at a time in a predened order. Our approach has a lower space cost of optimization as compared to using incremental view maintenance expressions, and is easier to implement. Propagation-based differential generation is explained in Section 5.3.2, and the extended Query DAG generation is explained in Section 5.3.3. 2. Choosing the Policy for Maintenance and Computing the Cost of Maintenance We show how to compute the minimum overall maintenance cost of the given set of permanently materialized views, given a xed set of additional views to be transiently materialized. In addition to computing the cost, the proposed technique generates the best consolidated maintenance plan for the given set of permanently materialized views. The maintenance plan chosen for each materialized view can be incremental or recomputation, based on costs. Maintenance cost computation is explained in Section 5.4. 3. Transient/Permanent Materialized View Selection Finally, we address the problem of determining the respective sets of transient and permanently materialized views that minimize the overall cost. Our technique uses, as a subroutine, the previously mentioned technique for computing the best maintenance policy given xed sets of permanently and temporarily materialized views. The costs of materialization of transiently materialized views and maintenance of permanently materialized views are taken into account by this step. We propose a greedy heuristic that iteratively picks up views in order of benet where benet is dened as the decrease in the overall materialization cost if this view is tran-
104
relations, which are made available to the view refresh mechanism; for each relation , there are
from the relation . The maintenance expressions in our examples assume that the old value of relations in case the updates have already been performed on the base relations.
the relation is available, but we can use maintenance expressions based on the new values of the
We assume that the given set of materialized views is refreshed at times chosen by users, which are typically regular intervals. For optimization purposes, we need estimates of the sizes of these delta relations. In production environments, the rates of changes are usually stable across
105
refresh periods, and these rates can be used to make decisions on what relations to materialize permanently. We will assume that the average insert and delete sizes for each relation are provided as percentages of the full relation size. The insert and delete percentages can be different for different relations. Other statistics, such as number of new distinct values for attributes (in each refresh interval), if available, can also be used to improve the cost estimates of the optimizer.
and
and
if is used in neither
nor
If is used only in
,
. If
is used in
The process of computing differentials starts at the bottom, and proceeds upwards, so when we compute the differential to
ready. The full results are computed when required, if they are not available already (materialized views and base relations are available already). Extending the above technique to operations other than join is straightforward, using standard techniques for computing the differentials of operations, such as those in [3]; see [25] for a survey of view maintenance techniques. It may appear that computing the change to
106
space will include plans where every intermediate result includes the differential of . To illus-
and
would both be among the plans considered, and the cheapest plan is selected. Similarly, if we wish to compute the differential of the view when tuples are inserted into
and
or
propagating all the base relation differentials. Our optimizers search space includes all of the alternatives for computing the differentials to
including the above two, and the cheapest one is chosen for propagating the
differential of each base relation. Propagating differentials of only one type (inserts or deletes) to one relation at a time, simplies choosing of a separate plan for each differential propagation. It is straightforward to extend the techniques to permit propagation of inserts and deletes to a single relation together, to reduce the number of different expressions computed. We assume that the updates to the base relations are propagated one relation at a time. After each one is propagated, the base relation is itself updated, and the computed differentials are applied to all incrementally maintained materialized views. 1 We leave unspecied the order in which the base relations are considered. The order is not expected to have a signicant effect when the deltas of all the relations are small percentages of the relation sizes: the relation statistics then do not change greatly due to the updates, and thus the costs of the plans should not be affected greatly by the order. For large deltas, our experimental results show that recomputation of the view is generally preferable to incremental maintenance, so the order of incremental propagation is not relevant.
1
The differentials must be logically applied. The database system can give such a logical view, yet postpone
physically applying the updates. By postponing physical application, multiple updates can be gathered and executed at once, reducing disk access costs.
107
An alternative approach for computing differentials is to generate the entire differential expression, and optimize it (see, e.g. [24]). However, the resultant expression can be very large exponential in the size of the view expression. For instance, consider the view
with inserts on all three relations. The differential in the result of the view can be computed as:
There are many common subexpressions in the above expression, and the above expression could be simplied by factoring, to get:
This simplied expression is equivalent in effect to our technique for propagating differentials. Creating differential expressions (whether in the unsimplied or in the simplied form) is difcult with more complex expressions containing operations other than join (see, e.g. [24]). Moreover, the size of the unsimplied expression is exponential in the number of relations. Optimizing such large expressions can be quite expensive, since query optimization is exponential in the size of the expression. In contrast, the process of propagating differentials can be expressed purely in terms of how to compute the differentials for individual operations, given the differential of their inputs. As a result it is also easy to extend the technique to new operations.
relations:
, where and (for ) correspond to the differentials of with respectively. For example, the equivalence node and is rened respect to , and , into four additional equivalence nodes .
108
of
with respect to
has
of
the corresponding base relation update. In the example above, consider equivalence node a child operation node
which is a join operation; the children of are the equivalents nodes representing and . The node has as its child an operation node which is a join operation, and the children of are the equivalence nodes for and . The other nodes are similar in structure.2 As can be seen from the above example, the children of can be full
results as well as differentials. The rationale of this construction was given in Section 5.3.2. As also mentioned in that section, the approach is easily extended to other operations. The equivalence node entials represents the full result; but this result varies as successive differ-
are merged with it. For cost computation purposes, the system keeps an array with , where is the list of logical properties (such as schema and estimated statis , is the list of logical properties of the result after the tics) of the old result and , for . result has been merged with the differentials given by
It might seem that by including all the differential expres-
Space-Efcient Implementation.
sions for each equivalence node, we have increased the size of the Query DAG by a factor of
. However, our implementation reduces the cost by piggybacking the differential equivalence
and operation nodes on the equivalence and operation nodes in the original Query DAG. These implementation details are explained next; however, for ease of explanation, in the rest of the chapter, we stick to the above logical description. For space efciency, the equivalence nodes for each differential are not created separately in our implementation. Instead, each equivalence node stores an array
the differential result , and (b) the best plan for computing . properties and best plan ((a) and (b) above) for
2
logically represents the differential equivalence node , and contains: (a) logical properties of If does not depend on a relation , or if there is no corresponding update, then the logical
, where
and
The structure is a little more complicated when a relation is used in both children of a join node, requiring a
union of several join operations. The details are straightforward and we omit them for simplicity.
recomputing the entire result of the node after all updates have been made on the base relations.
), as well as views
sponding to entire results; this is because the differentials are only used during view maintenance. The computation cost of the equivalence node , denoted follows, where
, is computed as
if
if
(i.e.
is a relation)
In terms of forming the execution plan, the above equation represents the choice of the operation node with the minimum cost in order to compute the expression corresponding to the equivalence node . The computation cost of an operation node , denoted
, is:
nodes of , and where
During transient materialization, the view is computed and materialized on the disk for the duration of the maintenance processing. Thus, the cost of transiently materializing a view
110
, denoted by
, is:
where
is the cost of materializing the view (on disk, assuming materialized views
; and the cost of computing the differential is . Let denote the cost of merging the differential corresponding to with the view after the differentials corresponding to have already been merged. Then, the cost of incrementally maintaining , denoted , is:
On the other hand, maintenance by recomputation involves computing the view and materializing it, replacing the old value. The recomputation maintenance cost, denoted by is:
where
, as before, is the cost of materializing the view. Notice that is the same as
, the cost of
transiently materializing derived above. As such, we do not consider materializing a view permanently and maintaining using recomputation, unless it was already specied as permanently materialized. For, if recomputation is the cheapest way of maintaining a view, we may as well materialize it transiently: keeping it permanently would not help the next round of view maintenance. Thus, the cost of maintaining the permanently materialized view
, denoted by
in the system.
if
if
111
, the choice corresponds to selecting the refresh mode incremental refresh or reThus, the total cost incurred in maintaining the materialized views in given that the views
, is:
the set
(5.1)
Given the set of views given as already materialized in the system, we need to determine be transiently materialized, such that
propose a heuristic greedy algorithm to determine and . As mentioned earlier, the optimizer performs a depth-rst traversal of the Query DAG structure, applying these formulae at each node, to nd the overall cost.
112
Procedure G REEDY Input: , the set of equivalence nodes for the initial materialized views , the set of candidate equivalence nodes for materialization Output: , set of equivalence nodes to be materialized permanently , set of equivalence nodes to be materialized transiently Begin = ; while ( ) L1: Pick the node with the highest if ( ) break; /* No further benets to be had, stop */ if ( is a full result and )
End
else return
Figure 5.1: The Greedy Algorithm for Selecting Views for Transient/Permanent Materialization Using Equation (5.1), and since (a) if is a full result, then for all ,
where
and
Figure 5.1 outlines a greedy algorithm that iteratively picks nodes to be materialized. The procedure takes as input the set of candidates (equivalence nodes, and their differentials) for
113
materialization, and returns the sets and of equivalence nodes to be materialized permanently and transiently, respectively. is initialized to , the set of equivalence nodes for the node initial materialized views, while is initialized as empty. At each iteration, the equivalence with the maximum benet is selected for materialization. If is a full result, then
it is added to either or based on whether maintaining it or transiently materializing it would be cheaper; if materialized. Naively, the candidate set
results as well as differentials). In Section 5.5.2, we consider approaches to reduce the candidate set.
5.5.2 Optimizations
Three important optimizations to the greedy algorithm for multi-query optimization were presented in Chapter 3. While monotonicity optimization applies unchanges, the incremental cost update and sharability computation need to be extended to handle differentials, as follows. 1. The incremental cost update algorithm presented in Chapter 3 maintains the state of the Query DAG (which includes previously computed best plans for the equivalence nodes) across calls, and may even avoid visiting many of the ancestors of a node whose cost has been modied due to materialization or unmaterialization. We modify the incremental cost update algorithm to handle differentials as follows. (a) If the full result of a node is materialized, we update not only the cost of computing
differentials of each ancestor node since the full result may be used in any of the differentials.
the full result of each ancestor node, but also the costs for the
Propagation up from an ancestor node can be stopped if there is no change in cost to computing the full result or any of the differentials. (b) If the differential of a node with respect to a given update is materialized, we update only the differentials of its ancestors with respect to the same update. Propagation can
114
2. It is wasteful to transiently materialize nodes unless they are used multiple times during the refresh. An algorithm for computing sharability of nodes as proposed in Chapter 3, which detects equivalence nodes that can potentially be used multiple times in a single plan. We consider differential results for transient materialization only if the corresponding full result is detected to be sharable. The sharability optimization cannot be applied to full results in our context, since a full result may be worth materializing permanently even if it is used in only one query. Thus all full results are candidates for optimization. We also observed that when it is worth transiently materializing the differential of an expression with respect to the update of a particular base relation, it is often worth transiently materializing the differentials with respect to updates of the other base relations as well. To reduce the cost of the greedy algorithm, we consider all differentials of an expression (with respect to different base relation updates) as a single unit of materialization. The number of candidates considered by the greedy algorithm reduces greatly as a result, reducing its execution time signicantly.
5.5.3 Extensions
The algorithms we have outlined can be extended in several ways. One direction is to deal with limited space for storing materialized results. To deal with this problem, we can modify the greedy algorithm to prioritize results in order of benet per unit space (got by dividing the benet by the size of the result). If the space available for permanent and transient materialized results are separate, we can modify the algorithm to continue considering results for permanent (resp. transient) materialization even after the space of transient (resp. permanent) materialization is exhausted. Another direction of extension would be to select materialized views in order to speed up a
115
workload of queries. The greedy algorithm can be modied for this task as follows: candidates would be nal/intermediate results of queries, and benets to queries would be included when computing benets. In fact, many of the approaches proposed earlier for selecting materialized views use such a greedy approach, and our implementation techniques provide an efcient way to implement these algorithms. Longer term future work would include dealing with large sets of queries efciently.
Set of Views Workload. A set of 10 views, 5 with aggregates and 5 without, on a total of
8 distinct relations. There is some amount of overlap across these views, but most of the views have selections that are not present in other views, limiting the amount of overlap.
Single Views Workload. The same views as above, but each optimized and executed separately, and we show the sum of the view maintenance times. Since the views are optimized separately, as if they were on separate copies of the database, sharing between views cannot be exploited. The materialized views are shown in Appendix A.2. The purpose of choosing a simple workload in addition to the complex workload is to show that our methods are very effective not only for big sets of overlapping complex views, where one might argue that simple multi-query optimization may be as effective, but also for singleton views without common subexpressions, where a
116
technique based exclusively on multi-query optimization would be useless. The performance measure is estimated maintenance cost. The cost model used takes into account number of seeks, amount of data read, amount of data written, and CPU time for inmemory processing. While we would have liked to give actual run times on a real database, we do not currently have a query execution engine which we can extend to perform differential view maintenance. We are working on translation of the plans into SQL queries that can be run on any SQL database. However, the results would not be as good as if we had ne grain control, since the translation will split queries into small pieces whose results are stored in disk and then used, resulting in decreased pipelining benets. Our cost model is fairly sophisticated, and we have veried its accuracy by comparing its estimates with numbers obtained by running queries on commercial database systems. We found close agreement (within around 10 percent) on most queries, which indicates that the numbers obtained in our performance study are fairly accurate. We provide performance numbers for different percentages of updates to the database relations; we assume that all relations are updated by the same percentage. In our notation, a 10% update to a relation consists of inserting 10% as many tuples are currently in the relation. We assume a TPC-D database at scale factor of 0.1, that is the relations occupy a total of 100 MB. The buffer size is set at 8000 blocks, each of size 4KB, for a total of 32 MB, although we also ran some tests at a much smaller buffer size of 1000 blocks. However, the numbers are not greatly affected by the buffer size, and in fact smaller buffer sizes can be expected to benet more from sharing of common subexpressions. The tests were run on an Ultrasparc 10, with 256 MB of memory.
117
4000
5000
5000
4000
3000
2000
1000
1000
0 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
0 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Update Percentage
Update Percentage
Single Views
Set of Views
Figure 5.2: Effect of Transient and Permanent Materialization 3. Establish that our methods are indeed practical by showing that the overheads of our optimization-based techniques are reasonable, and that our methods scale with respect to increasing number of views (Section 5.6.2).
Effect of Transient and Permanent Materialization We executed the following variations of our algorithm:
Transient and Permanent. Both transient and permanent materialization of additional results is allowed. This corresponds to the techniques proposed in this chapter.
118
In all the cases, the maintenance policy of each of the views is decided based on whether recomputation and incremental computation is cheaper, given the constraints in each case as above. The results for the single view workload and the set of views workload are reported in Figure 5.2. For the single-view workload, transient materialization is not useful if the view maintenance plan used is recomputation, but when incremental computation is used, full results can potentially be shared between differentials for updates to different base relations. Indeed, we found several such instances at low update percentages. At higher update percentages we found fewer such occurrences, and using only transient materialization did not offer much benet. However, permanent materialization of intermediate results reduces the overall materialization cost by up to 50% for smaller update percentages (the smallest update percentage we considered was 1%). These results clearly illustrate the efcacy of the methods proposed in this chapter over and above multi-query optimization (Chapter 3). The set of views workload has a signicant amount of overlap among the constituent views. Thus, the substantial reduction, as high as 48%, in the overall maintenance cost due to only transient materialization is as expected. Permanent materialization has a signicant impact in this case also, and further reduces the maintenance cost by up to another 17%, resulting in a total reduction of up to 65%. Recall from our discussion in Section 5.4 that all additional permanently materialized nodes are always maintained incrementally, since if recomputation-based maintenance of these views is cheaper than incremental maintenance, then they would be chosen for transient materialization instead of permanent materialization. Now, the cost of incremental maintenance increases with the size of the updates; for larger updates, recomputation of a permanently materialized view is a better alternative than incremental maintenance, so a smaller fraction of views are permanently materialized. These two facts together account for the slightly decreasing advantage of transient cum permanent materialization over only transient materialization as update percentages increase, as is clear from the convergence of the respective plots in Figure 5.2 for either workload.
119
Comparing across the two workloads reveals an interesting result: the cost of maintenance without selecting additional materialized view is less for the set of views than for the single view workload, even though they have the same set of queries. The reason is that in the case of set of views, the maintenance of a view can exploit the presence of existing materialized views, even without selecting additional materialized views. Our optimizer indeed takes such plans into consideration even when it does not select additional materialized views.
We also executed tests on an Only Permanent variant of our algorithms, where permanent materialization is allowed, but transient materialization of additional views is disallowed. This corresponds to using only permanent materialized view selection for optimization of view maintenance. However, since views for which the recomputation is cheaper than incremental maintenance can still be permanently materialized, the only difference from the case of transient and permanent is that differential results cannot be shared.
For the single view benchmark there is no possibility of sharing differential results, since each query can have only one occurrence of any expression involving a particular differential. For the set of views benchmark, we found that the benets of materializing differentials was relatively low. Full results are more expensive to compute, and since they can be used with differentials for all relations not used in their denition, they are also shared to a greater degree. As a result full results are preferentially chosen for materialization, and differential results were rarely chosen, and even when chosen gave only small benets. Thus, in this case too the plots for only permanent were almost identical to the plots for transient and permanent. To avoid clutter, we omitted the plots for only permanent from our graphs.
To summarize this section, to the best of our knowledge ours is the rst study that demonstrates quantitatively the benets of materializing extra views (transiently or permanently) to speed up view maintenance in a general setting. Earlier work on selection of materialized views, as far as we are aware, has not presented any performance results except in the limited context of data cubes or star schemas [11].
120
8000
10000
10000
8000
6000
4000
2000
2000
0 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
0 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Update Percentage
Update Percentage
Single Views
Set of Views
Figure 5.3: Effect of Adaptive Maintenance Policy Selection Effect of Adaptive Maintenance Policy Selection In the current database systems, the user needs to specify the maintenance policy (incremental or recomputation) for a materialized view during its denition. In this section, we show that an apriori xed specication as above may not be the a good idea, and make a case for adaptively choosing the maintenance policy for a view in an adaptive manner. We explored the following variants of our algorithm:
Forced Incremental. All the permanent materialized views, including the views given initially as well as the views picked additionally by greedy, are forced to be maintained incrementally.
Forced Recomputation. Incremental maintenance is disallowed and all the permanent materialized views are forced to be recomputed.
121
In all the cases, additional transient and materialized views were chosen by executing greedy as described earlier in the chapter. The results of executing the above variants on each of our workloads are plotted in Figure 5.3. The graphs show that incremental maintenance may be much more expensive than recomputation; the incremental maintenance cost increases sharply for medium to large update percentages in our case, beyond 30% for the single view workload, and beyond 20% for the multi-view workload. In both the workloads, the adaptive technique performs better than both forced incremental and forced recomputation; this extra improvement, up to 34% for the singleview workload, is due to its ability to adaptively choose incremental maintenance for some of the initial as well as additionally materialized views, and recomputation for the others and always maintain a mix that leads to the lowest overall maintenance cost. However, the difference between adaptive and forced recomputation for either workload decreases slightly with increasing update percentage. This is because for large update percentages, incremental maintenance is expensive, and hence every view is recomputed. These observations clearly show that blindly favoring incremental maintenance over recomputation may not be a good idea (this conclusion is similar to the ndings of Vista [61]); and make a case for adaptively choosing the maintenance policy for each view, as done by our algorithms. It is also important to note that the ability to mix different maintenance policies for different subparts of the maintenance plan, even for a single view, is novel to our techniques, and not supported by [61].
Overheads and Scalability Analysis To see how well our algorithms scale up with increasing numbers of views, we used the fol-
denoting part id, subpart id and number. Over these relations, we dened a sequence of 10 views to : the view was a star query on four relations , , and , with joined with , , and . We then grouped these views into 10 sets, where the set consisted of the views .
122
80
2 4 relation star
60
40
20
0 0 2 4 6 8 10
0 0 2 4 6 8 10
Number of Views
Number of Views
Figure 5.4: Scalability analysis on increasing number of views For each we measured (a) the memory requirements of our algorithm and (b) the time taken by our algorithm, and report the same in Figure 5.4. The gure shows that the memory consumption of our algorithm increases practically linearly with the number of views in the set. The reason for this is that the memory usage is basically in maintaining the Query DAG, and for our view set, the increase in the size of the Query DAG is constant per additional view added to the DAG (with a xed number of base relations). The memory requirement for the view set , containing 10 views on a total of 22 relations, is only about 3.2 MB. Further, addition of a new view from our view set to the Query DAG increases the breadth of the DAG, not its height (we think this is the expected case in reality most views are expected to be of similar size and with only partial mutual overlap). Since the height remains constant, the time taken per incremental cost update (ref. Section 5.5.2) remains constant. However, the number of these incremental cost updates increases quadratically with the size of the Query DAG, as observed by in Chapter 3. This accounts for the quadratic increase in the time spent by our algorithm with increasing number of views, as shown in Figure 5.4. However, despite the quadratic growth, the time spent on the 22-relation 10-view set was less than a couple of minutes. This is very reasonable for an algorithm that needs to be executed only occasionally, and which provides savings of the order of 1000s of seconds on each view refresh.
5.7. SUMMARY
123
Thus, we conclude that the memory requirements of our algorithm are reasonable and scale well with increasing number of views. The time taken shows quadratic growth, but this growth is slow enough to make the algorithm practical for large enough view sets; especially since the tremendous cumulative reduction in the maintenance cost across multiple maintenance passes far outweighs the time spent only once while executing the algorithm to make the reduction possible. Finally, we tested the effect of our optimization of treating all the deltas of an expression as a single unit of materialization instead of considering them separately. We found that this reduced the time taken for greedy optimization by about 30 percent, yet made no difference to the plans generated. However, neither alternative found any signicant benets for materializing delta results, whether as a single unit or separately, for reasons that we outlined earlier when discussing the effect of only permanent. Optimization time can therefore be saved by not considering any deltas as candidates for materialization; we found this reduces optimization times by a further factor of 2 from those reported in our experiments.
5.7 Summary
The problem of nding the best way to maintain a given set of materialized views is an important practical problem, especially in data warehouses/data marts, where the maintenance windows are shrinking. We have presented solutions that exploit commonality between different tasks in view maintenance, to minimize the cost of maintenance. Our techniques have been implemented on an existing optimizer, and we have conducted a performance study of their benets. As shown by the results in section 5.6, our techniques can generate signicant speedup in view maintenance cost, and the increase in cost of optimization is acceptable. We therefore believe that our techniques provide a timely and effective solution to a very important real problem.
125 speed up view maintenance, and also select additional views for materialization to minimize the overall cost of maintenance. These techniques, which are extensions of the core techniques developed in context of multi-query optimization in Chapter 3, can generate signicant speedup in view maintenance cost, and the increase in cost of optimization is acceptable. Our algorithms are based on the AND/OR Query DAG representation of queries, making them easily extensible to handle new transformations, operators and implementations. Our algorithms also handle index selection and nested queries, in a very natural manner. We also developed extensions to the Query DAG generation algorithm as proposed for Volcano [23] to detect all common sub expressions and include subsumption derivations. Further, our algorithms are easy to implement on a Volcano-type query optimizer (e.g. the Cascades optimizer of Microsoft SQL-Server [22] and the optimizer of the Tandem ServerWare SQL Product [6]), requiring addition of only a few thousand lines of code.
Future Work Our current work on multi-query optimization (Chapter 3) does not take space constraints into account. While changing our techniques given a constraint on the total size of all materialized results is straightforward (use benet-per-unit size instead of benet in the Greedy algorithm, as in the case of Query Result Caching), it would be too pessimistic. This is because it is seldom the case that the materialized results are to be used all at the same time. As such, it should be possible to schedule the execution such that rst use of a materialized result , the point when gets materialized, follows the last use of another result , the point when can be disposed; thus, the same disk space can be used for both and . Determining such plans requires an
interleaving of query optimization and scheduling, and promises to be an interesting problem to explore. Moreover, during query execution, pipelining can be generalized to incorporate multiple consumers (multiple parts of the query that share an intermediate result) without materialization e.g., the Redbrick data warehouse product allows a scan of a base relation to be shared by multiple consumers. In this thesis, we have assumed that sharing always results in materialization; Dalvi
126
et al. [14] have extended this work to incorporate shared pipelines. Another followup work by Hulgeri et al. [30] incorporates into our work the issues of allocation of memory to individual operators executing in a pipeline. Furthermore, the materialization cost can be eliminated or reduced in some cases by piggybacking the materialization with the actions of an operator that uses the expression. For instance, if an expression is the input to a sort, it can be materialized by simply saving runs generated during sorting, at no extra cost. In query result caching, we can compactly represent large workloads by making use of the fact that many queries (or parts of queries) in a large workload are likely to be the same except for values of selection constants. We can unify such selections and replace them by a parameterized selection, thereby collapsing many selections into a single parameterized selection that is invoked as many times as the number of selections we replaced. Also, when we run short of cache space, instead of discarding a stored result in its entirety, it should be possible to (a) replace it by a summarization, or (b) discard only parts of the result. We can implement the latter by partitioning selection nodes into smaller selects and replacing the original select by a union of the other selects. Two issues in introducing these partitioned nodes are: (a) What partition should we choose? and (b) If the top level is not a select, we can still choose an attribute to partition on, but which should this be? An important direction of future work is to take updates into account in Query Result Caching, thus integrating the techniques developed in Chapter 4 and Chapter 5. We need to develop techniques for: (a) taking update frequencies into account when deciding whether to cache a particular result, and (b) decide when and whether to discard or refresh cached results. We could refresh cached results eagerly as updates happen, or update them lazily, when they are accessed. Another aspect of the integration could be to take into account the query workloads apart from the materialized views in order to determine what additional views to materialize. Finally, Query DAG generation can be extended to include query splitting [15] as well. For example, given
:
and
: :
can be obtained by
in the Query DAG, and taking its union However, this plan, along with the plan
i.e.,
127 ing
and
countering our assumptions about the Query DAG. We are currently working on
Q3
SELECT O_SELKEY FROM CUSTOMER, ORDERS, LINEITEM WHERE C_SELKEY = 1
129
Q5
SELECT MAX(O_SELKEY) FROM CUSTOMER, ORDERS, LINEITEM, SUPPLIER, NATION, REGION WHERE C_CUSTKEY = O_CUSTKEY AND O_ORDERKEY = L_ORDERKEY AND L_SUPPKEY = S_SUPPKEY AND C_NATIONKEY = S_NATIONKEY AND S_NATIONKEY = N_NATIONKEY AND N_REGIONKEY = R_REGIONKEY AND R_REGIONKEY = 1 AND O_SELKEY < 5 GROUP BY N_NATIONKEY;
Q7
SELECT S_SUPPKEY FROM SUPPLIER, LINEITEM, ORDERS, CUSTOMER, NATION, NATION1 WHERE S_SUPPKEY = L_SUPPKEY AND O_ORDERKEY = L_ORDERKEY AND C_CUSTKEY = O_CUSTKEY AND S_NATIONKEY = NATION.N_NATIONKEY AND C_NATIONKEY = NATION1.N1_NATIONKEY AND ((NATION.N_NATIONKEY = 1 AND NATION1.N1_NATIONKEY = 2) OR (NATION.N_NATIONKEY = 2 AND NATION1.N1_NATIONKEY = 1)) AND L_SELKEY > 16;
Q8
SELECT P_PARTKEY FROM PART, SUPPLIER, LINEITEM, ORDERS, CUSTOMER, NATION, NATION1, REGION WHERE P_PARTKEY = L_PARTKEY
AND S_SUPPKEY = L_SUPPKEY AND L_ORDERKEY = O_ORDERKEY AND O_CUSTKEY = C_CUSTKEY AND C_NATIONKEY = NATION.N_NATIONKEY AND NATION.N_NATIONKEY = R_REGIONKEY AND R_REGIONKEY = 2 AND S_NATIONKEY = NATION1.N1_NATIONKEY AND O_SELKEY > 16 AND P_SELKEY < 3;
130 Q9
SELECT P_SELKEY
FROM PART, SUPPLIER, LINEITEM, PARTSUPP, ORDERS, NATION WHERE S_SUPPKEY = L_SUPPKEY AND PS_SUPPKEY = L_SUPPKEY AND PS_PARTKEY = L_PARTKEY AND P_PARTKEY = L_PARTKEY AND O_ORDERKEY = L_ORDERKEY AND S_NATIONKEY = N_NATIONKEY AND P_SELKEY > 251;
Q10
SELECT MIN(CUSTOMER.C_CUSTKEY) FROM CUSTOMER, ORDERS, LINEITEM, NATION WHERE CUSTOMER.C_CUSTKEY = ORDERS.O_CUSTKEY AND LINEITEM.L_ORDERKEY = ORDERS.O_ORDERKEY AND ORDERS.O_SELKEY = 1 AND LINEITEM.L_SELKEY < 7 AND CUSTOMER.C_NATIONKEY = NATION.N_NATIONKEY GROUP BY CUSTOMER.C_CUSTKEY, NATION.N_NATIONKEY;
Q11
SELECT MIN(PARTSUPP.PS_SUPPKEY) FROM PARTSUPP, SUPPLIER, NATION WHERE PARTSUPP.PS_SUPPKEY = SUPPLIER.S_SUPPKEY AND SUPPLIER.S_NATIONKEY = NATION.N_NATIONKEY AND NATION.N_NATIONKEY = 7 GROUP BY PARTSUPP.PS_PARTKEY;
SELECT PARTSUPP.PS_SUPPKEY FROM PARTSUPP, SUPPLIER, NATION WHERE PARTSUPP.PS_SUPPKEY = SUPPLIER.S_SUPPKEY AND SUPPLIER.S_NATIONKEY = NATION.N_NATIONKEY AND NATION.N_NATIONKEY = 7;
Q14
SELECT LINEITEM.L_PARTKEY FROM LINEITEM, PART WHERE LINEITEM.L_PARTKEY = PART.P_PARTKEY AND LINEITEM.L_SELKEY = 20;
131
SELECT MIN(CUSTOMER.C_SELKEY) FROM CUSTOMER, ORDERS, LINEITEM WHERE CUSTOMER.C_CUSTKEY = ORDERS.O_CUSTKEY AND LINEITEM.L_ORDERKEY = ORDERS.O_ORDERKEY GROUP BY CUSTOMER.C_CUSTKEY HAVING CUSTOMER.C_CUSTKEY > 2;
SELECT MIN(PARTSUPP.PS_SUPPKEY) FROM PARTSUPP, SUPPLIER, NATION WHERE PARTSUPP.PS_SUPPKEY = SUPPLIER.S_SUPPKEY AND SUPPLIER.S_NATIONKEY = NATION.N_NATIONKEY AND NATION.N_NATIONKEY = 7;
SELECT MIN(PARTSUPP.PS_SUPPLYCOST) FROM PARTSUPP , PART , LINEITEM , ORDERS WHERE PARTSUPP.PS_PARTKEY > 10 AND PART.P_PARTKEY = PARTSUPP.PS_PARTKEY AND LINEITEM.L_PARTKEY = PARTSUPP.PS_PARTKEY AND ORDERS.O_ORDERKEY = LINEITEM.L_ORDERKEY GROUP BY PART.P_PARTKEY;
SELECT PARTSUPP.PS_SUPPLYCOST FROM PARTSUPP , LINEITEM , ORDERS WHERE PARTSUPP.PS_PARTKEY > 10 AND LINEITEM.L_PARTKEY = PARTSUPP.PS_PARTKEY AND ORDERS.O_ORDERKEY = LINEITEM.L_ORDERKEY;
132
SELECT PARTSUPP.PS_SUPPLYCOST
FROM PART , SUPPLIER , PARTSUPP , NATION , REGION WHERE PART.P_PARTKEY = PARTSUPP.PS_PARTKEY AND SUPPLIER.S_SUPPKEY = PARTSUPP.PS_SUPPKEY AND SUPPLIER.S_NATIONKEY = NATION.N_NATIONKEY AND NATION.N_REGIONKEY = REGION.R_REGIONKEY;
SELECT PARTSUPP.PS_SUPPLYCOST FROM PART , PARTSUPP , LINEITEM , SUPPLIER WHERE PART.P_PARTKEY = PARTSUPP.PS_PARTKEY AND SUPPLIER.S_SUPPKEY = PARTSUPP.PS_SUPPKEY AND LINEITEM.L_PARTKEY = PARTSUPP.PS_PARTKEY;
SELECT PARTSUPP.PS_SUPPLYCOST FROM PARTSUPP , PART , LINEITEM , ORDERS WHERE PARTSUPP.PS_PARTKEY > 10 AND PART.P_PARTKEY = PARTSUPP.PS_PARTKEY AND LINEITEM.L_PARTKEY = PARTSUPP.PS_PARTKEY AND ORDERS.O_ORDERKEY = LINEITEM.L_ORDERKEY;
SELECT PARTSUPP.PS_SUPPLYCOST FROM PART , SUPPLIER , PARTSUPP WHERE PART.P_PARTKEY = PARTSUPP.PS_PARTKEY AND SUPPLIER.S_SUPPKEY = PARTSUPP.PS_SUPPKEY;
and
separately; the
where1
,
.
and
Further,
is used instead of
is
. Similarly for
and
. Also, if
Join Exchange
The sections below give, for each operator, the formulae for , the I/O cost (in milliseconds)
136
readtime (ms) writetime (ms) seektime (ms) index fanout size of a block in kilobytes CPU speed in MIPS available main memory (number of blocks)
average number of instructions executed per byte of data 5 Figure C.1: Constants
size of the input (number of blocks) size of the input (number of tuples) size of the output (number of blocks) size of the output (number of tuples) number of distinct values in the input Figure C.2: Cost Formulae Parameters
137
Result Materialization. Each block of the relation processed and written once.
Sort. In memory sort if the relation ts in the main memory. Otherwise, merge sort with fanin
if
otherwise
Clustered Index Creation on Sorted Relation. The input is already sorted on the relevant attribute. The index B-Tree is created bottom-up. Size of clustered index (in number of blocks)
created bottom-up. The overall cost is the total of the sorting cost and the index creation cost.
Selection. I/O occurs. The input streaming in is ltered using the predicate and the result streamed out. No
if otherwise
Merge Join. Both the inputs are streaming in already sorted. We introduce an arbitrary factor of 2 to account for merge processing costs per block of output.
Nested Loops Join. Since the inputs are streaming in, we do not pay the read cost for the outer relation. If both inputs are smaller that , the join occurs in-memory without any need of I/O.
if if
if
or
otherwise
and
otherwise
Indexed Nested Loops Join. Input 0 is the probe and input 1 is indexed on the join attribute. The total number of block accesses
is the effective number block accesses taking into account the buffering.
if
and
otherwise
139
Hashing based Aggregation. We assume hybrid hashing, with half of the available buffers are used in the hybrid portion.
if
otherwise
Sort based Aggregation. The input is streaming in sorted, so no I/O is involved.
References
[1] AGRAWAL , S., C HAUDHURI , S.,
AND
alized Views and Indexes in Microsoft SQL Server. In Intl. Conf. Very Large Databases (2000). [2] A SHWIN , S., ROY, P., S ESHADRI , S.,
AND
oriented databases using transactional cyclic reference counting. In Intl. Conf. Very Large Databases (1997). [3] B LAKELEY, J., C OBURN , N.,
AND
irrelevant and autonomously computable updates. In Intl. Conf. Very Large Databases (1986). [4] B LAKELEY, J. A., M C K ENNA , W. J.,
AND
OODB Query Optimizer. In ACM SIGMOD Intl. Conf. on Management of Data (Washington, DC., 1993), pp. 287295. [5] B OBROWSKI , S. Using materialized views to speed up queries. Oracle Magazine (Sept. 1999). http://www.oracle.com/oramag/oracle/99-Sep/59bob.html. [6] C ELIS , P. The Query Optimizer in Tandems new ServerWare SQL Product. In Intl. Conf. Very Large Databases (1996). [7] C HAUDHURI , S. An overview of query optimization in relational systems. In ACM SIGACT-SIGART-SIGMOD Symposium on Priciples of Database Systems (1998). 140
REFERENCES
[8] C HAUDHURI , S., K RISHNAMURTHY, R., P OTAMIANOS , S.,
AND
queries with materialized views. In Intl. Conf. on Data Engineering (Taipei, Taiwan, 1995). [9] C HAUDHURI , S.,
AND
for Microsoft SQL Server. In Intl. Conf. Very Large Databases (1997). [10] C HEN , C. M., AND ROUSSOPOLOUS , N. The implementation and performance evaluation of the ADMS query optimizer: Integrating query result caching and matching. In Intl. Conf. on Extending Database Technology (EDBT) (1994). [11] C OLBY, L., C OLE , R. L., H ASLAM , E., JAZAYERI , N., J OHNSON , G., M C K ENNA , W. J., S CHUMACHER , L., AND W ILHITE , D. Redbrick Vista: Aggregate computation and management. In Intl. Conf. on Data Engineering (1998). [12] C OLBY, L., G RIFFIN , T., L IBKIN , L., M UMICK , I. S.,
AND
deferred view maintenance. In ACM SIGMOD Intl. Conf. on Management of Data (1996). [13] C OSAR , A., L IM , E.-P.,
AND
rst branch-and-bound and dynamic query ordering. In Intl. Conf. on Information and Knowledge Management (CIKM) (1993). [14] DALVI , N., S ANGHAI , S., ROY, P.,
AND
optimization. Tech. rep., Indian Institute of Technology, Bombay, 2000. Submitted for publication. [15] DAR , S., F RANKLIN , M. J., J ONSSON , B. T., S RIVASTAVA , D.,
AND
TAN , M. Semantic
data caching and replacement. In Intl. Conf. Very Large Databases (1996). [16] D ESHPANDE , P. M., R AMASAMY, K., S HUKLA , A.,
AND
NAUGHTON , J. F. Caching
multidimensional queries using chunks. In ACM SIGMOD Intl. Conf. on Management of Data (1998). [17] E DMONDS , J. Optimum branchings. J. Research of the National Bureau of Standards 71B (1967).
142
REFERENCES
[18] F INKELSTEIN , S. Common expression analysis in database applications. In ACM SIGMOD Intl. Conf. on Management of Data (Orlando,FL, 1982), pp. 235245. [19] G ANGULY, S. Design and analysis of parametric query optimization algorithms. In Intl. Conf. Very Large Databases (New York City, New York, August 1998). [20] G ASSNER , P., L OHMAN , G. M., S CHIEFER , K. B.,
AND
in the ibm db2 family. Data Engineering Bulletin 16, 4 (1993). [21] G RAEFE , G. Query Evaluation Techniques for Large Databases. ACM Computing Surveys 25, 2 (1993). [22] G RAEFE , G. The Cascades Framework for Query Optimization. Data Engineering Bulletin 18, 3 (1995). [23] G RAEFE , G.,
AND
and Efcient Search. In Intl. Conf. on Data Engineering (1993). [24] G RIFFIN , T.,
AND
niques, and applications. IEEE Data Engineering Bulletin (Special issue on Materialized Views and Data Warehousing) 18(2) 18, 2 (June 1995). [26] G UPTA , H. Selection of views to materialize in a data warehouse. In Intl. Conf. on Database Theory (1997). [27] G UPTA , H., H ARINARAYAN , V., R AJARAMAN , A.,
AND
olap. In Intl. Conf. on Data Engineering (Binghampton, UK, April 1997). [28] G UPTA , H.,
AND
REFERENCES
[29] H ARINARAYAN , V., R AJARAMAN , A.,
AND
ciently. In ACM SIGMOD Intl. Conf. on Management of Data (Montreal, Canada, June 1996). [30] H ULGERI , A., S ESHADRI , S.,
AND
tion. In International Conference on Management of Data (COMAD) (2000). (to appear). [31] K APITSKAIA , O., N G , R. T.,
AND
Directory Caches. In Intl. Conf. on Extending Database Technology (EDBT) (2000). [32] K ELLER , A. M.,
AND
for data warehouses. In ACM SIGMOD Intl. Conf. on Management of Data (1999). [34] L ABIO , W., Q UASS , D.,
AND
Conf. Very Large Databases (Stockholm, 1985), pp. 259269. [36] L EHNER , W., S IDLE , R., P IRAHESH , H.,
AND
matic Summary Tables in IBM DB2/UDB. In ACM SIGMOD Intl. Conf. on Management of Data (2000). [37] M UMICK , I. S., Q UASS , D.,
AND
mary tables in a warehouse. In ACM SIGMOD Intl. Conf. on Management of Data (1997), pp. 100111. [38] PARK , J.,
AND
144 [39] P ELLENKOFT, A., G ALINDO -L EGARIA , C. A., ity of Transformation-Based Join Enumeration. (Athens,Greece, 1997), pp. 306315. [40] P IRAHESH , H., H ELLERSTEIN , J. M.,
AND AND
REFERENCES
K ERSTEN , M. The Complex-
Rewrite Optimization in Starburst. In ACM SIGMOD Intl. Conf. on Management of Data (San Diego, 1992), pp. 3948. [41] P OOSALA , V., I OANNIDIS , Y., H AAS , P.,
AND
selectivity estimation of range predicates. In ACM SIGMOD Intl. Conf. on Management of Data (1996). [42] Q UASS , D., G UPTA , A., M UMICK , I.,
AND
for data warehousing. In Intl. Conf. on Parallel and Distributed Information Systems (1996). [43] R AO , J., AND ROSS , K. Reusing invariants: A new strategy for correlated queries. In ACM SIGMOD Intl. Conf. on Management of Data (1998). [44] ROSS , K., S RIVASTAVA , D.,
AND
integrity constraint checking: Trading space for time. In ACM SIGMOD Intl. Conf. on Management of Data (May 1996). [45] ROUSSOPOLOUS , N. View indexing in relational databases. ACM Trans. on Database Systems 7, 2 (1982), 258290. [46] ROY, P., S ESHADRI , S., S UDARSHAN , S.,
AND
oriented databases using transactional cyclic reference counting. VLDB Journal 7, 3 (1998). [47] ROY, P., S ESHADRI , S., S UDARSHAN , S.,
AND
algorithms for multi-query optimization. In ACM SIGMOD Intl. Conf. on Management of Data (2000).
REFERENCES
[48] S ALEM , K., BAYER , K., C OCHRANE , R.,
AND
chronous incremental view maintenance. In ACM SIGMOD Intl. Conf. on Management of Data (2000). [49] S CHEUERMANN , P., S HIM , J.,
AND
intelligent cache manager. In Intl. Conf. Very Large Databases (1996). [50] S CHEUERMANN , P., S HIM , J.,
AND
for decision support systems. In Intl. Conf. on Scientic and Statistical Database Management (1999). [51] S ELINGER , P., A STRAHAN , M. M., C HAMBERLIN , D. D., L ORIE , R. A.,
AND
P RICE ,
T. G. Access path selection in a relational database management system. In ACM SIGMOD Intl. Conf. on Management of Data (1979), pp. 2334. [52] S ELLIS , T. Intelligent caching and indexing techniques for relational database systems. Information Systems (1988), 175185. [53] S ELLIS , T., AND G HOSH , S. On the multi query optimization problem. IEEE Transactions on Knowledge and Data Engineering (June 1990), 262266. [54] S ELLIS , T. K. Multiple query optimization. ACM Transactions on Database Systems 13, 1 (Mar. 1988), 2352. [55] S ESHADRI , P., P IRAHESH , H.,
AND
Intl. Conf. on Data Engineering (1996). [56] S HIM , K., S ELLIS , T.,
AND
query optimization. Data and Knowledge Engineering 12 (1994), 197222. [57] S HUKLA , A., D ESHPANDE , P.,
AND
multidimensional datasets. In Intl. Conf. Very Large Databases (New York City, NY, 1998). [58] S OUKUP, R.,
AND
REFERENCES
V ENKATARAMAN , S. Cost based optimization of decision
support queries using transient views. In ACM SIGMOD Intl. Conf. on Management of Data (1998). [60] TPC. TPC-D Benchmark Specication, Version 2.1, Apr. 1999. [61] V ISTA , D. Integration of incremental view maintenance into query optimizers. In Intl. Conf. on Extending Database Technology (EDBT) (1998). [62] YANG , H. Z.,
AND
Very Large Databases (Brighton, August 1987), pp. 245254. [63] YANG , J., K ARLAPALEM , K.,
AND
data warehousing environment. In Intl. Conf. Very Large Databases (1997). [64] Z HAO , Y., D ESHPANDE , P., NAUGHTON , J. F.,
AND
mization and evaluation of multiple dimensional queries. In ACM SIGMOD Intl. Conf. on Management of Data (Seattle, WA, 1998).
Acknowledgements
I thank S. Sudarshan and S. Seshadri for introducing me to the eld of databases, and for their continuous enthusiasm, patience and guidance for the last ve years. I have been very fortunate to have Sudarshan as my thesis advisor; his appreciation and understanding were necessary to drive things till the nishing line. Many thanks to Krithi Ramamritham for his interest, encouragement and insights. It was a pleasure working with him. I thank D.B. Phatak for inducting me into the fold of IIT-Bombay; but for him, I would have missed a lot. Moreover, I have valued his encouragement and support during my entire stay. The Informatics Lab at IIT-Bombay is a fun place to work in, thanks to the excellent graduate and undergraduate students working here. I thank all my labmates, past and present, with whom I have had the chance to work with during my stay; in particular, P.P.S. Narayan who taught me a lot about real system development and Siddhesh Bhobe, Pradeep Shenoy and Hoshi Mistry who collaborated with me on parts of this thesis. Thanks to fellow Ph.D. students Bharat Adsul and Arvind Hulgeri for their company. I thank Arvind further for our several technical discussions; they helped a lot. I am grateful to Paul Larson for calling me all the way to Redmond for a summer internship at Microsoft Research, and for giving me a chance to hack into the Microsoft SQL-Server code and prototype my ideas; it was a very valuable experience. This work was supported in part by an IBM Ph.D. fellowship.
Prasan Roy