0% found this document useful (0 votes)
28 views

CSE 544: Optimizations: Wednesday, 5/10/2006

The document discusses three components of an optimizer: a space of plans using algebraic laws, an optimization algorithm, and a cost estimator. It provides examples of algebraic laws for different relational algebra operations like selection, projection, join, grouping, and aggregation. Dynamic programming is presented as an optimization algorithm that finds the optimal join ordering by considering the cost of all subplans.

Uploaded by

Rana Gaballah
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

CSE 544: Optimizations: Wednesday, 5/10/2006

The document discusses three components of an optimizer: a space of plans using algebraic laws, an optimization algorithm, and a cost estimator. It provides examples of algebraic laws for different relational algebra operations like selection, projection, join, grouping, and aggregation. Dynamic programming is presented as an optimization algorithm that finds the optimal join ordering by considering the cost of all subplans.

Uploaded by

Rana Gaballah
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 51

CSE 544:

Optimizations
Wednesday, 5/10/2006

1
The three components
of an optimizer
We need three things in an
optimizer:

• A space of plans (e.g.


algebraic laws)
• An optimization algorithm
• A cost estimator
2
Algebraic Laws
• Commutative and Associative Laws
R  S = S  R, R  (S  T) = (R  S)
 T
R || S = S || R, R || (S || T) =
(R || S) || T
R || S = S || R, R || (S || T) =
(R || S) || T
• Distributive Laws
R || (S  T) = (R || S)  (R || T)

3
Algebraic Laws
• Laws involving selection:
 C AND C’ (R) =  C( (R)) =  C(R) ∩ 
C’
C’ (R)
 C OR C’(R) =  C(R) U  C’(R)
 C (R || S) =  C (R) || S
• When C involves only attributes of
R
 C (R – S) =  C (R) – S
 C (R  S) =  C (R)   C (S)
 C (R || S) =  C (R) || S
4
Algebraic Laws
• Example: R(A, B, C, D), S(E,
F, G)
 F=3 (R || D=E S) =
?
 A=5 AND G=9 (R || D=E S) =
?

5
Algebraic Laws
• Laws involving projections
M(R || S) = M(P(R) || Q(S))
M(N(R)) = M,N(R)

• Example R(A,B,C,D), S(E, F, G)


A,B,G(R || D=E S) =  ? (?(R) || D=E
?(S))
6
Algebraic Laws
• Laws involving grouping and aggregation:
(A, agg(B)(R)) = A, agg(B)(R)
A, agg(B)((R)) = A, agg(B)(R) if agg is “duplicate
insensitive”

• Which of the following are “duplicate


insensitive” ?
sum, count, avg, min, max

A, agg(D) (R(A,B) || B=C S(C,D)) =


A, agg(D)(R(A,B) || B=C (C, agg(D)S(C,D)))

7
Join Trees
• R1 || R2 || …. || Rn
• Join tree:

R3 R1 R2 R4
• A plan = a join tree
• A partial plan = a subtree of a join tree

8
Types of Join Trees
• Left deep:

R4

R2

R5

R3 R1

9
Types of Join Trees
• Bushy:

R3 R2 R4

R1 R5
10
Types of Join Trees
• Right deep:

R3

R1
R5
R2 R4

11
Optimization
Algorithms
• Heuristic based
• Cost based
– Dynamic programming: System R
– Rule-based optimizations: DB2,
SQL-Server

12
Heuristic Based
Optimizations
• Query rewriting based on algebraic
laws
• Result in better queries most of
the time
• Heuristics number 1:
– Push selections down
• Heuristics number 2:
– Sometimes push selections up, then
down
13
Predicate Pushdown

pname

pname

 price>100 AND city=“Seattle”

maker=name

maker=name city=“Seattle”
price>100

Product Company Product Company

The earlier we process selections, less tuples we need to manipulate


higher up in the tree (but may cause us to loose an important ordering
of the tuples, if we use indexes).
14
Dynamic Programming
Originally proposed in System R
• Only handles single block queries:

SELECT
SELECT list
list
FROM
FROM list
list
WHERE
WHERE cond
cond11 AND
AND cond
cond22 AND
AND .. .. .. AND
AND cond
condkk
• Heuristics: selections down,
projections up
• Dynamic programming: join reordering
15
Dynamic Programming
• Given: a query R1 || R2 || …
|| Rn
• Assume we have a function
cost() that gives us the cost
of every join tree
• Find the best join tree for
the query

16
Dynamic Programming
• Idea: for each subset of {R1, …, Rn},
compute the best plan for that subset
• In increasing order of set cardinality:
– Step 1: for {R1}, {R2}, …, {Rn}
– Step 2: for {R1,R2}, {R1,R3}, …, {Rn-1, Rn}
– …
– Step n: for {R1, …, Rn}
• It is a bottom-up strategy
• A subset of {R1, …, Rn} is also called
a subquery

17
Dynamic Programming
• For each subquery Q {R1, …,
Rn} compute the following:
– Size(Q)
– A best plan for Q: Plan(Q)
– The cost of that plan: Cost(Q)

18
Dynamic Programming
• Step 1: For each {Ri} do:
– Size({Ri}) = B(Ri)
– Plan({Ri}) = Ri
– Cost({Ri}) = (cost of scanning
Ri)

19
Dynamic Programming
• Step i: For each Q {R1, …, Rn}
of cardinality i do:
– Compute Size(Q) (later…)
– For every pair of subqueries Q’, Q’’

s.t. Q = Q’  Q’’
compute cost(Plan(Q’) || Plan(Q’’))
– Cost(Q) = the smallest such cost
– Plan(Q) = the corresponding plan
20
Dynamic Programming
• Return Plan({R1, …, Rn})

21
Dynamic Programming
To illustrate, we will make the following
simplifications:
• Cost(P1 || P2) = Cost(P1) + Cost(P2) +

size(intermediate result(s))
• Intermediate results:
– If P1 = a join, then the size of the
intermediate result is size(P1), otherwise
the size is 0
– Similarly for P2
• Cost of a scan = 0
22
Dynamic Programming
• Example:
• Cost(R5 || R7) = 0 (no
intermediate results)
• Cost((R2 || R1) || R7)
= Cost(R2 || R1) + Cost(R7) +
size(R2 || R1)
= size(R2 || R1)

23
Dynamic Programming
• Relations: R, S, T, U
• Number of tuples: 2000, 5000,
3000, 1000
• Size estimation: T(A || B) =
0.01*T(A)*T(B)

24
Subquery Size Cost Plan

RS

RT

RU

ST

SU

TU

RST

RSU

RTU

STU

RSTU 25
Subquery Size Cost Plan

RS 100k 0 RS

RT 60k 0 RT

RU 20k 0 RU

ST 150k 0 ST

SU 50k 0 SU

TU 30k 0 TU

RST 3M 60k (RT)S

RSU 1M 20k (RU)S

RTU 0.6M 20k (RU)T

STU 1.5M 30k (TU)S

60k+50k=110
RSTU 30M (RT)(SU) 26
k
Reducing the Search
Space
• Left-linear trees v.s. Bushy trees

• Trees without cartesian product

Example: R(A,B) || S(B,C) || T(C,D)

Plan: (R(A,B) || T(C,D)) || S(B,C) has a


cartesian product – most query
optimizers will not consider it

27
Dynamic Programming:
Summary
• Handles only join queries:
– Selections are pushed down (i.e.
early)
– Projections are pulled up (i.e. late)

• Takes exponential time in general,


BUT:
– Left linear joins may reduce time
– Non-cartesian products may reduce time
further
28
Rule-Based Optimizers
• Extensible collection of rules
Rule = Algebraic law with a direction
• Algorithm for firing these rules
Generate many alternative plans, in
some order
Prune by cost

• Volcano (later SQL Sever)


• Starburst (later DB2)
29
Completing the
Physical Query Plan
• Choose algorithm to implement
each operator
– Need to account for more than cost:
• How much memory do we have ?
• Are the input operand(s) sorted ?
• Decide for each intermediate
result:
– To materialize
– To pipeline
30
Optimizations Based on
Semijoins
Semi-join based optimizations

•R S =  A1,…,An (R
S)
• Where the schemas are:
– Input: R(A1,…An), S(B1,…,Bm)
– Output: T(A1,…,An)
31
Optimizations Based on
Semijoins
Semijoins: a bit of theory (see [AHV])
• Given a conjunctive query:

QQ:-:-RR1, ,RR2, ,. .. ..,.,RRn


1 2 n

• A full reducer for Q is a program:

RRi1 := Ri1 RRj1


i1 := Ri1 j1
RRi2 :=:=RRi2 RRj2
i2 i2 j2
.....
.....
RRip := Rip RRjp
ip := Rip
• Such that no dangling jp tuples remain in any

relation
32
Optimizations Based on
Semijoins
• Example: QQ:-:-R1(A,B),
R1(A,B),R2(B,C),
R2(B,C),R3(C,D)
R3(C,D)

• A full reducer is:


R2(B,C)
R2(B,C):=
:=R2(B,C),
R2(B,C),R1(A,B)
R1(A,B)

R3(C,D)
R3(C,D):=
:=R3(C,D),
R3(C,D),R2(B,C)
R2(B,C)

R2(B,C)
R2(B,C):=
:=R2(B,C),
R2(B,C),R3(C,D)
R3(C,D)

R1(A,B)
R1(A,B):=
:=R1(A,B),
R1(A,B),R2(B,C)
R2(B,C) 33
Optimizations Based on
Semijoins
• Example:

QQ:-:-R1(A,B),
R1(A,B),R2(B,C),
R2(B,C),R3(A,C)
R3(A,C)

• Doesn’t have a full reducer (we can


reduce forever)

Theorem a query has a full reducer iff it


is “acyclic”

34
Optimizations Based on
Semijoins
• Semijoins in [Chaudhuri’98]
CREATE
CREATEVIEW
VIEWDepAvgSal
DepAvgSalAsAs((
SELECT
SELECTE.did,
E.did,Avg(E.Sal)
Avg(E.Sal)AS
ASavgsal
avgsal
FROM
FROMEmp
EmpEE
GROUP
GROUPBY
BYE.did)
E.did)

SELECT
SELECTE.eid,
E.eid,E.sal
E.sal
FROM
FROMEmp
EmpE,E,Dept
DeptD,D,DepAvgSal
DepAvgSalVV
WHERE
WHEREE.did
E.did==D.did
D.didAND
ANDE.did
E.did==V.did
V.did
AND
ANDE.age
E.age<<30
30AND
ANDD.budget
D.budget>>100k
100k
AND
ANDE.sal
E.sal>>V.avgsal
V.avgsal

35
Optimizations Based on
Semijoins
• First idea:
CREATE
CREATEVIEW
VIEWLimitedAvgSal
LimitedAvgSalAsAs((
SELECT
SELECTE.did,
E.did,Avg(E.Sal)
Avg(E.Sal)AS
ASavgsal
avgsal
FROM
FROMEmp
EmpE,E,Dept
DeptDD
WHERE
WHEREE.did
E.did==D.did
D.didAND
ANDD.buget
D.buget>>100k
100k
GROUP
GROUPBY
BYE.did)
E.did)

SELECT
SELECTE.eid,
E.eid,E.sal
E.sal
FROM
FROMEmp
EmpE,E,Dept
DeptD,D,LimitedAvgSal
LimitedAvgSalVV
WHERE
WHEREE.did
E.did==D.did
D.didAND
ANDE.did
E.did==V.did
V.did
AND
ANDE.age
E.age<<30
30AND
ANDD.budget
D.budget>>100k
100k
AND
ANDE.sal
E.sal>>V.avgsal
V.avgsal
36
Optimizations Based on
Semijoins
• Better: full reducer
CREATE
CREATEVIEW
VIEWPartialResult
PartialResultAS
AS
(SELECT
(SELECTE.id,
E.id,E.sal,
E.sal,E.did
E.did
FROM
FROMEmp
EmpE, E,Dept
DeptDD
WHERE
WHEREE.did=D.did
E.did=D.didANDANDE.age
E.age<<30
30
AND
ANDD.budget
D.budget>>100k)
100k)

CREATE
CREATEVIEW
VIEWFilter
FilterAS
AS
(SELECT
(SELECTDISTINCT
DISTINCTP.did
P.didFROM
FROMPartialResult
PartialResultP)
P)

CREATE
CREATEVIEW
VIEWLimitedAvgSal
LimitedAvgSalAS AS
(SELECT
(SELECTE.did,
E.did,Avg(E.Sal)
Avg(E.Sal)ASASavgsal
avgsal
FROM
FROMEmp
EmpE,E,Filter
FilterFF
WHERE
WHEREE.did
E.did==F.did
F.didGROUP
GROUPBY
BYE.did)
E.did)
37
Optimizations Based on
Semijoins

SELECT
SELECTP.eid,
P.eid,P.sal
P.sal
FROM
FROMPartialResult
PartialResultP,P,LimitedDepAvgSal
LimitedDepAvgSalVV
WHERE
WHEREP.did
P.did==V.did
V.didAND
ANDP.sal
P.sal>>V.avgsal
V.avgsal

38
Modern Query
Optimizers
• Volcano
– Rewrite rules
– Extensible

• Starburst
– Keeps query blocks
– Interblock, intrablock
optimizations
39
Size Estimation
RAMAKRISHAN BOOK CHAPT. 15.2

The problem: Given an expression


E, compute T(E) and V(E, A)

• This is hard without computing E


• Will ‘estimate’ them instead

40
Size Estimation
Estimating the size of a
projection
• Easy: T(L(R)) = T(R)
• This is because a projection
doesn’t eliminate duplicates

41
Size Estimation
Estimating the size of a selection
• S = A=c(R)
– T(S) san be anything from 0 to T(R) – V(R,A) + 1
– Estimate: T(S) = T(R)/V(R,A)
– When V(R,A) is not available, estimate T(S) =
T(R)/10

• S = A<c(R)
– T(S) can be anything from 0 to T(R)
– Estimate: T(S) = (c - Low(R, A))/(High(R,A) -
Low(R,A))
– When Low, High unavailable, estimate T(S) =
T(R)/3

42
Size Estimation
Estimating the size of a natural
join, R ||A S
• When the set of A values are
disjoint, then T(R ||A S) = 0
• When A is a key in S and a foreign
key in R, then T(R ||A S) = T(R)
• When A has a unique value, the same
in R and S, then T(R ||A S) = T(R)
T(S)
43
Size Estimation
Assumptions:
• Containment of values: if V(R,A) <=
V(S,A), then the set of A values of R
is included in the set of A values of S
– Note: this indeed holds when A is a foreign
key in R, and a key in S
• Preservation of values: for any other
attribute B,
V(R || A S, B) = V(R, B) (or V(S, B))

44
Size Estimation
Assume V(R,A) <= V(S,A)
• Then each tuple t in R joins some tuple(s)
in S
– How many ?
– On average T(S)/V(S,A)
– t will contribute T(S)/V(S,A) tuples in R ||A S
• Hence T(R ||A S) = T(R) T(S) / V(S,A)

In general: T(R ||A S) = T(R) T(S) /


max(V(R,A),V(S,A))
45
Size Estimation
Example:
• T(R) = 10000, T(S) = 20000
• V(R,A) = 100, V(S,A) = 200
• How large is R || A S ?

Answer: T(R ||A S) = 10000


20000/200 = 1M
46
Histograms
• Statistics on data maintained
by the RDBMS
• Makes size estimation much
more accurate (hence, cost
estimations are more
accurate)

47
Histograms
Employee(ssn, name, salary, phone)
• Maintain a histogram on salary:

Salary 20k..4 40k..6 60k..8 80k..1


0..20k > 100k
: 0k 0k 0k 00k

Tuples 200 800 5000 12000 6500 500

• T(Employee) = 25000, but now we know


the distribution

48
Histograms
• Eqwidth 0..20
20..4 40..6 60..8 80..1
0 0 0 00
2 104 9739 152 3
• Eqdepth
44..4 48..5 50..5 55..1
0..44
8 0 6 00
2000 2000 2000 2000 2000

49
The Independence
Assumption
SELECT
SELECT**
FROM
FROMRR
WHERE
WHERER.Age
R.Age==‘35’
‘35’and
andR.City
R.City==‘Seattle’
‘Seattle’

gram on Age tells us that fraction p1 have age


p1 |R|
35

am on City tells us that fraction p2 live inp2Seattle


|R|

cking more information, use independence: p1 p2 |R|


50
Correlated Attributes
• A better idea is to use a 2-dimensional
histogram
• Generalize: k-dimensional
• But this uses too much space

• Getoor paper: find “conditionally


independent attributes”
• Goal: several 2 or 3 dimensional
histograms “capture” a high dimensional
histogram
51

You might also like