CSE 544: Optimizations: Wednesday, 5/10/2006
CSE 544: Optimizations: Wednesday, 5/10/2006
Optimizations
Wednesday, 5/10/2006
1
The three components
of an optimizer
We need three things in an
optimizer:
3
Algebraic Laws
• Laws involving selection:
C AND C’ (R) = C( (R)) = C(R) ∩
C’
C’ (R)
C OR C’(R) = C(R) U C’(R)
C (R || S) = C (R) || S
• When C involves only attributes of
R
C (R – S) = C (R) – S
C (R S) = C (R) C (S)
C (R || S) = C (R) || S
4
Algebraic Laws
• Example: R(A, B, C, D), S(E,
F, G)
F=3 (R || D=E S) =
?
A=5 AND G=9 (R || D=E S) =
?
5
Algebraic Laws
• Laws involving projections
M(R || S) = M(P(R) || Q(S))
M(N(R)) = M,N(R)
7
Join Trees
• R1 || R2 || …. || Rn
• Join tree:
R3 R1 R2 R4
• A plan = a join tree
• A partial plan = a subtree of a join tree
8
Types of Join Trees
• Left deep:
R4
R2
R5
R3 R1
9
Types of Join Trees
• Bushy:
R3 R2 R4
R1 R5
10
Types of Join Trees
• Right deep:
R3
R1
R5
R2 R4
11
Optimization
Algorithms
• Heuristic based
• Cost based
– Dynamic programming: System R
– Rule-based optimizations: DB2,
SQL-Server
12
Heuristic Based
Optimizations
• Query rewriting based on algebraic
laws
• Result in better queries most of
the time
• Heuristics number 1:
– Push selections down
• Heuristics number 2:
– Sometimes push selections up, then
down
13
Predicate Pushdown
pname
pname
maker=name
maker=name city=“Seattle”
price>100
SELECT
SELECT list
list
FROM
FROM list
list
WHERE
WHERE cond
cond11 AND
AND cond
cond22 AND
AND .. .. .. AND
AND cond
condkk
• Heuristics: selections down,
projections up
• Dynamic programming: join reordering
15
Dynamic Programming
• Given: a query R1 || R2 || …
|| Rn
• Assume we have a function
cost() that gives us the cost
of every join tree
• Find the best join tree for
the query
16
Dynamic Programming
• Idea: for each subset of {R1, …, Rn},
compute the best plan for that subset
• In increasing order of set cardinality:
– Step 1: for {R1}, {R2}, …, {Rn}
– Step 2: for {R1,R2}, {R1,R3}, …, {Rn-1, Rn}
– …
– Step n: for {R1, …, Rn}
• It is a bottom-up strategy
• A subset of {R1, …, Rn} is also called
a subquery
17
Dynamic Programming
• For each subquery Q {R1, …,
Rn} compute the following:
– Size(Q)
– A best plan for Q: Plan(Q)
– The cost of that plan: Cost(Q)
18
Dynamic Programming
• Step 1: For each {Ri} do:
– Size({Ri}) = B(Ri)
– Plan({Ri}) = Ri
– Cost({Ri}) = (cost of scanning
Ri)
19
Dynamic Programming
• Step i: For each Q {R1, …, Rn}
of cardinality i do:
– Compute Size(Q) (later…)
– For every pair of subqueries Q’, Q’’
s.t. Q = Q’ Q’’
compute cost(Plan(Q’) || Plan(Q’’))
– Cost(Q) = the smallest such cost
– Plan(Q) = the corresponding plan
20
Dynamic Programming
• Return Plan({R1, …, Rn})
21
Dynamic Programming
To illustrate, we will make the following
simplifications:
• Cost(P1 || P2) = Cost(P1) + Cost(P2) +
size(intermediate result(s))
• Intermediate results:
– If P1 = a join, then the size of the
intermediate result is size(P1), otherwise
the size is 0
– Similarly for P2
• Cost of a scan = 0
22
Dynamic Programming
• Example:
• Cost(R5 || R7) = 0 (no
intermediate results)
• Cost((R2 || R1) || R7)
= Cost(R2 || R1) + Cost(R7) +
size(R2 || R1)
= size(R2 || R1)
23
Dynamic Programming
• Relations: R, S, T, U
• Number of tuples: 2000, 5000,
3000, 1000
• Size estimation: T(A || B) =
0.01*T(A)*T(B)
24
Subquery Size Cost Plan
RS
RT
RU
ST
SU
TU
RST
RSU
RTU
STU
RSTU 25
Subquery Size Cost Plan
RS 100k 0 RS
RT 60k 0 RT
RU 20k 0 RU
ST 150k 0 ST
SU 50k 0 SU
TU 30k 0 TU
60k+50k=110
RSTU 30M (RT)(SU) 26
k
Reducing the Search
Space
• Left-linear trees v.s. Bushy trees
27
Dynamic Programming:
Summary
• Handles only join queries:
– Selections are pushed down (i.e.
early)
– Projections are pulled up (i.e. late)
•R S = A1,…,An (R
S)
• Where the schemas are:
– Input: R(A1,…An), S(B1,…,Bm)
– Output: T(A1,…,An)
31
Optimizations Based on
Semijoins
Semijoins: a bit of theory (see [AHV])
• Given a conjunctive query:
relation
32
Optimizations Based on
Semijoins
• Example: QQ:-:-R1(A,B),
R1(A,B),R2(B,C),
R2(B,C),R3(C,D)
R3(C,D)
R3(C,D)
R3(C,D):=
:=R3(C,D),
R3(C,D),R2(B,C)
R2(B,C)
R2(B,C)
R2(B,C):=
:=R2(B,C),
R2(B,C),R3(C,D)
R3(C,D)
R1(A,B)
R1(A,B):=
:=R1(A,B),
R1(A,B),R2(B,C)
R2(B,C) 33
Optimizations Based on
Semijoins
• Example:
QQ:-:-R1(A,B),
R1(A,B),R2(B,C),
R2(B,C),R3(A,C)
R3(A,C)
34
Optimizations Based on
Semijoins
• Semijoins in [Chaudhuri’98]
CREATE
CREATEVIEW
VIEWDepAvgSal
DepAvgSalAsAs((
SELECT
SELECTE.did,
E.did,Avg(E.Sal)
Avg(E.Sal)AS
ASavgsal
avgsal
FROM
FROMEmp
EmpEE
GROUP
GROUPBY
BYE.did)
E.did)
SELECT
SELECTE.eid,
E.eid,E.sal
E.sal
FROM
FROMEmp
EmpE,E,Dept
DeptD,D,DepAvgSal
DepAvgSalVV
WHERE
WHEREE.did
E.did==D.did
D.didAND
ANDE.did
E.did==V.did
V.did
AND
ANDE.age
E.age<<30
30AND
ANDD.budget
D.budget>>100k
100k
AND
ANDE.sal
E.sal>>V.avgsal
V.avgsal
35
Optimizations Based on
Semijoins
• First idea:
CREATE
CREATEVIEW
VIEWLimitedAvgSal
LimitedAvgSalAsAs((
SELECT
SELECTE.did,
E.did,Avg(E.Sal)
Avg(E.Sal)AS
ASavgsal
avgsal
FROM
FROMEmp
EmpE,E,Dept
DeptDD
WHERE
WHEREE.did
E.did==D.did
D.didAND
ANDD.buget
D.buget>>100k
100k
GROUP
GROUPBY
BYE.did)
E.did)
SELECT
SELECTE.eid,
E.eid,E.sal
E.sal
FROM
FROMEmp
EmpE,E,Dept
DeptD,D,LimitedAvgSal
LimitedAvgSalVV
WHERE
WHEREE.did
E.did==D.did
D.didAND
ANDE.did
E.did==V.did
V.did
AND
ANDE.age
E.age<<30
30AND
ANDD.budget
D.budget>>100k
100k
AND
ANDE.sal
E.sal>>V.avgsal
V.avgsal
36
Optimizations Based on
Semijoins
• Better: full reducer
CREATE
CREATEVIEW
VIEWPartialResult
PartialResultAS
AS
(SELECT
(SELECTE.id,
E.id,E.sal,
E.sal,E.did
E.did
FROM
FROMEmp
EmpE, E,Dept
DeptDD
WHERE
WHEREE.did=D.did
E.did=D.didANDANDE.age
E.age<<30
30
AND
ANDD.budget
D.budget>>100k)
100k)
CREATE
CREATEVIEW
VIEWFilter
FilterAS
AS
(SELECT
(SELECTDISTINCT
DISTINCTP.did
P.didFROM
FROMPartialResult
PartialResultP)
P)
CREATE
CREATEVIEW
VIEWLimitedAvgSal
LimitedAvgSalAS AS
(SELECT
(SELECTE.did,
E.did,Avg(E.Sal)
Avg(E.Sal)ASASavgsal
avgsal
FROM
FROMEmp
EmpE,E,Filter
FilterFF
WHERE
WHEREE.did
E.did==F.did
F.didGROUP
GROUPBY
BYE.did)
E.did)
37
Optimizations Based on
Semijoins
SELECT
SELECTP.eid,
P.eid,P.sal
P.sal
FROM
FROMPartialResult
PartialResultP,P,LimitedDepAvgSal
LimitedDepAvgSalVV
WHERE
WHEREP.did
P.did==V.did
V.didAND
ANDP.sal
P.sal>>V.avgsal
V.avgsal
38
Modern Query
Optimizers
• Volcano
– Rewrite rules
– Extensible
• Starburst
– Keeps query blocks
– Interblock, intrablock
optimizations
39
Size Estimation
RAMAKRISHAN BOOK CHAPT. 15.2
40
Size Estimation
Estimating the size of a
projection
• Easy: T(L(R)) = T(R)
• This is because a projection
doesn’t eliminate duplicates
41
Size Estimation
Estimating the size of a selection
• S = A=c(R)
– T(S) san be anything from 0 to T(R) – V(R,A) + 1
– Estimate: T(S) = T(R)/V(R,A)
– When V(R,A) is not available, estimate T(S) =
T(R)/10
• S = A<c(R)
– T(S) can be anything from 0 to T(R)
– Estimate: T(S) = (c - Low(R, A))/(High(R,A) -
Low(R,A))
– When Low, High unavailable, estimate T(S) =
T(R)/3
42
Size Estimation
Estimating the size of a natural
join, R ||A S
• When the set of A values are
disjoint, then T(R ||A S) = 0
• When A is a key in S and a foreign
key in R, then T(R ||A S) = T(R)
• When A has a unique value, the same
in R and S, then T(R ||A S) = T(R)
T(S)
43
Size Estimation
Assumptions:
• Containment of values: if V(R,A) <=
V(S,A), then the set of A values of R
is included in the set of A values of S
– Note: this indeed holds when A is a foreign
key in R, and a key in S
• Preservation of values: for any other
attribute B,
V(R || A S, B) = V(R, B) (or V(S, B))
44
Size Estimation
Assume V(R,A) <= V(S,A)
• Then each tuple t in R joins some tuple(s)
in S
– How many ?
– On average T(S)/V(S,A)
– t will contribute T(S)/V(S,A) tuples in R ||A S
• Hence T(R ||A S) = T(R) T(S) / V(S,A)
47
Histograms
Employee(ssn, name, salary, phone)
• Maintain a histogram on salary:
48
Histograms
• Eqwidth 0..20
20..4 40..6 60..8 80..1
0 0 0 00
2 104 9739 152 3
• Eqdepth
44..4 48..5 50..5 55..1
0..44
8 0 6 00
2000 2000 2000 2000 2000
49
The Independence
Assumption
SELECT
SELECT**
FROM
FROMRR
WHERE
WHERER.Age
R.Age==‘35’
‘35’and
andR.City
R.City==‘Seattle’
‘Seattle’