13s PDF
13s PDF
13
Query Optimization
Practice Exercises
13.1 Show that the following equivalences hold. Explain how you can apply
them to improve the efficiency of certain queries:
a. E 1 1u (E 2 − E 3 ) = (E 1 1u E 2 − E 1 1u E 3 ).
b. su ( AG F (E)) = AG F (su (E)), where u uses only attributes from A.
c. su (E 1 1 E 2 ) = su (E 1 ) 1 E 2 , where u uses only attributes from E 1 .
Answer:
a. E 1 1u (E 2 − E 3 ) = (E 1 1u E 2 − E 1 1u E 3 ).
Let us rename (E 1 1u (E 2 −E 3 )) as R1 , (E 1 1u E 2 ) as R2 and (E 1 1u E 3 )
as R3 . It is clear that if a tuple t belongs to R1 , it will also belong to R2 .
If a tuple t belongs to R3 , t[E 3 ’s attributes] will belong to E 3 , hence
t cannot belong to R1 . From these two we can say that
∀t, t ∈ R1 ⇒ t ∈ (R2 − R3 )
It is clear that if a tuple t belongs to R2 − R3 , then t[R2 ’s attributes] ∈
E 2 and t[R2 ’s attributes] 6∈ E 3 . Therefore:
∀t, t ∈ (R2 − R3 ) ⇒ t ∈ R1
The above two equations imply the given equivalence.
This equivalence is helpful because evaluation of the right hand
side join will produce many tuples which will finally be removed
from the result. The left hand side expression can be evaluated more
efficiently.
b. su ( AG F (E)) = AG F (su (E)), where u uses only attributes from A.
u uses only attributes from A. Therefore if any tuple t in the output
of AG F (E) is filtered out by the selection of the left hand side, all the
tuples in E whose value in A is equal to t[A] are filtered out by the
selection of the right hand side. Therefore:
1
2 Chapter 13 Query Optimization
A∩ B = A ∪ B − (A − B) − (B − A)
Answer:
a. Use the index to locate the first tuple whose building field has value
“Watson”. From this tuple, follow the pointer chains till the end,
retrieving all the tuples.
b. For this query, the index serves no purpose. We can scan the file
sequentially and select all tuples whose building field is anything
other than “Watson”.
c. This query is equivalent to the query:
sbuilding ≥’Watson’ ∧ budget <5000) (department).
Using the building index, we can retrieve all tuples with building
value greater than or equal to “Watson” by following the pointer
chains from the first “Watson” tuple. We also apply the additional
criteria of budget < 5000 on every tuple.
13.7 Consider the query:
select *
from r , s
where upper(r.A) = upper(s.A);
where “upper” is a function that returns its input argument with all low-
ercase letters replaced by the corresponding uppercase letters.
a. Find out what plan is generated for this query on the database system
you use.
6 Chapter 13 Query Optimization
b. Some database systems would use a (block) nested-loop join for this
query, which can be very inefficient. Briefly explain how hash-join
or merge-join can be used for this query.
Answer:
a. First create relations r and s, and add some tuples to the two re-
lations, before finding the plan chosen; or use existing relations in
place of r and s. Compare the chosen plan with the plan chosen for a
query directly equating r.A = s.B. Check the estimated statistics too.
Some databases may give the same plan, but with vastly different
statistics.
(On PostgreSQL, we found that the optimizer used the merge join
plan described in the answer to the next part of this question.)
b. To use hash join, hashing should be done after applying the upper()
function to r.Aand s.A. Similarly, for merge join, the relations should
be sorted on the result of applying the upper() function on r.A and
s.A. The hash or merge join algorithms can then be used unchanged.
13.8 Give conditions under which the following expressions are equivalent
where agg denotes any aggregation operation. How can the above condi-
tions be relaxed if agg is one of min or max?
Answer: The above expressions are equivalent provided E 2 contains only
attributes A and B, with A as the primary key (so there are no duplicates).
It is OK if E 2 does not contain some A values that exist in the result of
E 1 , since such values will get filtered out in either expression. However, if
there are duplicate values in E 2 .A, the aggregate results in the two cases
would be different.
If the aggregate function is min or max, duplicate A values do not have
any effect. However, there should be no duplicates on (A, B); the first
expression removes such duplicates, while the second does not.
13.9 Consider the issue of interesting orders in optimization. Suppose you are
given a query that computes the natural join of a set of relations S. Given
a subset S1 of S, what are the interesting orders of S1?
Answer: The interesting orders are all orders on subsets of attributes that
can potentially participate in join conditions in further joins. Thus, let T
be the set of all attributes of S1 that also occur in any relation in S − S1.
Then every ordering of every subset of T is an interesting order.
13.10 Show that, with n relations, there are (2(n − 1))!/(n − 1)! different join
orders. Hint: A complete binary tree is one where every internal node has
exactly two children. Use the fact that the number of different complete
Exercises 7
If you wish, you can derive the formula for the number of complete binary
trees with n nodes from the formula for the number of binary trees with
n nodes. The number of binary trees with n nodes is:
1 2n
n+1 n
This number is known as the Catalan number, and its derivation can be
found in any standard textbook on data structures or algorithms.
Answer: Each join order is a complete binary tree (every non-leaf node
has exactly two children) with the relations as the leaves. The number
of different complete binary trees with n leaf nodes is n1 2(n−1)
(n−1)
. This is
because there is a bijection between the number of complete binary trees
with n leaves and number of binary trees with n − 1 nodes. Any complete
binary tree with n leaves has n − 1 internal nodes. Removing all the leaf
nodes, we get a binary tree with n− 1 nodes. Conversely, given any binary
tree with n − 1 nodes, it can be converted to a complete binary tree by
adding n leaves in a unique way. The number of binary trees with n − 1
nodes is given by n1 2(n−1)
(n−1)
, known as the Catalan number. Multiplying this
by n! for the number of permutations of the n leaves, we get the desired
result.
13.11 Show that the lowest-cost join order can be computed in time O(3n ).
Assume that you can store and look up information about a set of relations
(such as the optimal join order for the set, and the cost of that join order)
in constant time. (If you find this exercise difficult, at least show the looser
time bound of O(22n ).)
Answer: Consider the dynamic programming algorithm given in Sec-
tion 13.4. For each subset having k + 1 relations, the optimal join order
can be computed in time 2k+1 . That is because for one particular pair of
subsets A and B, we need constant time and there are at most 2k+1 − 2
n
different subsets that A can denote. Thus, over all the k+1 subsets of size
n k+1
k + 1, this cost is k+1 2 . Summing over all k from 1 to n − 1 gives the
binomial expansion of ((1 + x)n − x) with x = 2. Thus the total cost is less
than 3n .
13.12 Show that, if only left-deep join trees are considered, as in the System R
optimizer, the time taken to find the most efficient join order is around
n2n . Assume that there is only one interesting sort order.
Answer: The derivation of time taken is similar to the general case, except
that instead of considering 2k+1 − 2 subsets of size less than or equal to
8 Chapter 13 Query Optimization
a. Write a nested query on the relation account to find, for each branch
with name starting with B, all accounts with the maximum balance
at the branch.
b. Rewrite the preceding query, without using a nested subquery; in
other words, decorrelate the query.
c. Give a procedure (similar to that described in Section 13.4.4) for
decorrelating such queries.
Answer:
create table t1 as
select branch name, max(balance)
from account
group by branch name
select account number
from account, t1
where account.branch name like ’B%’ and
account.branch name = t1 .branch name and
account.balance = t1 .balance
select · · ·
from L1
where P1 and
A1 op
(select f(A2 )
from L 2
where P2 )
where, f is some aggregate function on attributes A2 , and op is some
boolean binary operator. It can be rewritten as
create table t1 as
select f(A2 ),V
from L 2
where P21
group by V
select · · ·
from L 1 , t1
where P1 and P22 and
A1 op t1 .A2
where P21 contains predicates in P2 without selections involving
correlation variables, and P22 introduces the selections involving the
correlation variables. V contains all the attributes that are used in
the selections involving correlation variables in the nested query.
13.14 The set version of the semijoin operator ⋉ is defined as follows:
r ⋉u s = 5 R (r 1u s)