0% found this document useful (0 votes)
86 views

Fundamentals of Database Systems: (Query Optimization - I)

The document discusses the basics of query optimization in a database system. It explains that a query optimizer's goal is to minimize the cost of evaluating a query by choosing an efficient execution plan without requiring users to write optimized queries. It provides an example query and outlines some of the statistical information and techniques used by query optimizers to estimate the costs of different execution plans, such as relation statistics, size estimation for selection operations, and catalog information about relations.

Uploaded by

thecoolguy96
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views

Fundamentals of Database Systems: (Query Optimization - I)

The document discusses the basics of query optimization in a database system. It explains that a query optimizer's goal is to minimize the cost of evaluating a query by choosing an efficient execution plan without requiring users to write optimized queries. It provides an example query and outlines some of the statistical information and techniques used by query optimizers to estimate the costs of different execution plans, such as relation statistics, size estimation for selection operations, and catalog information about relations.

Uploaded by

thecoolguy96
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Outline Basics Statistics for Query Evaluation Problems

Fundamentals of Database Systems


[Query Optimization – I]

Malay Bhattacharyya

Assistant Professor

Machine Intelligence Unit


Indian Statistical Institute, Kolkata
December, 2020
Outline Basics Statistics for Query Evaluation Problems

1 Basics

2 Statistics for Query Evaluation


Relation Statistics
Query Statistics

3 Problems
Outline Basics Statistics for Query Evaluation Problems

Basics of query optimization

Why optimizing a query?


So that it is processed efficiently.
Outline Basics Statistics for Query Evaluation Problems

Basics of query optimization

Why optimizing a query?


So that it is processed efficiently.
What is meant by efficiently?
Minimizing the cost of query evaluation.
Outline Basics Statistics for Query Evaluation Problems

Basics of query optimization

Why optimizing a query?


So that it is processed efficiently.
What is meant by efficiently?
Minimizing the cost of query evaluation.
Who will minimize?
The system, not the user.
Outline Basics Statistics for Query Evaluation Problems

Basics of query optimization

Why optimizing a query?


So that it is processed efficiently.
What is meant by efficiently?
Minimizing the cost of query evaluation.
Who will minimize?
The system, not the user.

Query optimization is facilitating a system to construct a


query-evaluation plan for processing a query efficiently,
without expecting users to write efficient queries
Outline Basics Statistics for Query Evaluation Problems

How does a query optimizer work?


Outline Basics Statistics for Query Evaluation Problems

An example query

Find the basic pay and grade pay of ISI employees who are single.
Outline Basics Statistics for Query Evaluation Problems

An example query

Find the basic pay and grade pay of ISI employees who are single.

πBPay ,GPay
|
σstatus=”single”
|
./

HRA ./

BASIC GRADE

Note: Marital status is a part of HRA table.


Outline Basics Statistics for Query Evaluation Problems

An example query – revised

Find the basic pay and grade pay of ISI employees who are single.

πBPay ,GPay
|
./

σstatus=”single” ./
| 
HRA BASIC GRADE

Note: Marital status is a part of HRA table.


Outline Basics Statistics for Query Evaluation Problems

Evaluating query cost

The basic parameters for estimating the query cost are


The number of seek operations performed
The number of blocks read
The number of blocks written
Outline Basics Statistics for Query Evaluation Problems

Evaluating query cost

The basic parameters for estimating the query cost are


The number of seek operations performed
The number of blocks read
The number of blocks written

Intuitive (Naive) approach for least-cost query finding:


1 Generate expressions that are logically equivalent to the
original expression
2 Annotate the resultant expressions in alternative ways to
generate alternative query evaluation plans.
3 Go to 1 until you get some new expression.
Outline Basics Statistics for Query Evaluation Problems

Catalog information

For a given relation R, we can store the following relevant


information in the catalog:
NR – the number of tuples
BR – the number of blocks containing tuples of relation R
LR – the size of a tuple in bytes
FR – the number of tuples that fit into one block (blocking
factor)
V (X , R) – the number of distinct values for attribute X
HR – the height of B+ -tree indices for R
LR – the number of leaf pages in the B+ -tree indices for R

Note: V (X , R) equals to the size of πX (R), in general, and if X is


a key then it is NR .
Outline Basics Statistics for Query Evaluation Problems

The storage of data in B+ -Trees

The B+ tree is a balanced binary search tree that follows a


multi-level index format. It can support both random access and
sequential access of the data items.

A B+ -Tree of order 5 and depth 3 consisting of 59 data items


Outline Basics Statistics for Query Evaluation Problems

Other statistical information

Suppose the tuples of a relation R are physically stored in a file


then we have the following relation
 
NR
BR =
FR
Outline Basics Statistics for Query Evaluation Problems

Other statistical information

Suppose the tuples of a relation R are physically stored in a file


then we have the following relation
 
NR
BR =
FR

Special statistical information:

Histogram –
Outline Basics Statistics for Query Evaluation Problems

Final comments

Some facts about the relation statistics:

– Recompute relation statistics on every update (but this might be


a huge overhead), at least during the periods of light system load.
– In real-world cases, optimizers often maintain further statistical
information to improve the accuracy of their cost estimates of
evaluation plans.
Outline Basics Statistics for Query Evaluation Problems

Size estimation for selection operation

Assumption: Attribute values are uniformly distributed


NR
S(σX =x (R)): V (X ,R)
NR ∗(x−min(X ,R)+1)
S(σX ≤x (R)): max(X ,R)−min(X ,R)+1 ,
where min(X , R) and
max(X , R) denote the minimum and maximum values of the
attribute X in R, respectively
S1 ∗S2 ∗...∗Sn
S(σθ1 ∧θ2 ∧...∧θn (R)): NRn−1
, where Si denotes the
estimated size of the selection operation σθi (R)
NRn −(NR −S1 )∗(NR −S2 )∗...∗(NR −Sn )
S(σθ1 ∨θ2 ∨...∨θn (R)): NRn−1
, where Si
denotes the estimated size of the selection operation σθi (R)
S(σ¬θ (R)): NR − S(σθ (R))

Note: A predicate is expressed as θ.


Outline Basics Statistics for Query Evaluation Problems

Size estimation for selection operation

The estimation of size does not work well in the following cases:
Values of the attribute on which the selection is applied are
not uniformly distributed.
The number of tuples is pretty low.
The number of distinct values for an attribute is pretty low.
The predicates connected together (by logical AND or OR)
under the selection are dependent on each other.
Outline Basics Statistics for Query Evaluation Problems

Size estimation for selection – A conceptual example


Consider the following table and its system catalog information:

Table: TOY
ID COLOR COST
T1 Blue 10 NTOY = 6
T2 Blue 10 V (COLOR, TOY ) = 2
T3 Red 20 V (COST , TOY ) = 3
T4 Blue 20 min(COST , TOY ) = 10
T5 Red 30 max(COST , TOY ) = 30
T6 Red 30 HTOY = 3

NTOY
S(σCOLOR=Red (TOY )): V (COLOR,TOY ) = 3.
NTOY ∗(20−min(COST ,TOY ))
S(σCOST ≤20 (TOY )): max(COST ,TOY )−min(COST ,TOY ) = 3.
S(σ(COLOR=Red)∧(COST ≤20) (TOY )):
S(σCOLOR=Red (TOY ))∗S(σCOST ≤20 (TOY ))
NTOY = 1.5
Outline Basics Statistics for Query Evaluation Problems

Size estimation for Cartesian product and natural join

Cartesian product:
S(R1 × R2 ) is equal to NR1 ∗ NR2 (each tuple occupies LR1 + LR2
bytes).

Natural join:
If A(R1 ) ∩ A(R2 ) = φ: S(R1 ./ R2 ) equals to NR1 ∗ NR2
If A(R1 ) ∩ A(R2 ) is a key for R1 : S(R1 ./ R2 ) is no greater
than NR2
If A(R1 ) ∩ A(R2 ) is a key for R2 : S(R1 ./ R2 ) is no greater
than NR1
If A(R1 ) ∩ A(R2 ) = X is neither a key for R1 nor for R2 :
N 1 ∗NR2 NR1 ∗NR2
S(R1 ./ R2 ) is the minimum of VR(X ,R1 ) and V (X ,R2 )
Outline Basics Statistics for Query Evaluation Problems

Size estimation for other operations

Projection: S(πX (R)) equals to V (X , R). Note that, X is a


set of attributes.
Aggregation: Involves a size of V (X , R).
Set union operation: S(R1 ∪ R2 ) is no greater than NR1 + NR2 .
Set intersection operation: S(R1 ∩ R2 ) is no greater than
min(NR1 , NR2 ).
Set difference operation: S(R1 − R2 ) is no greater than NR1 .
Outline Basics Statistics for Query Evaluation Problems

Query size estimation – Example I

Given a relation R with 60 tuples. If R has an attribute Age


within the range [20, 30] and there are 15 distinct values for
the attribute Height minimum of which is 170, estimate the
size of the query σ(Age≤23)∨Height=170 (R).
Outline Basics Statistics for Query Evaluation Problems

Query size estimation – Example I

Given a relation R with 60 tuples. If R has an attribute Age


within the range [20, 30] and there are 15 distinct values for
the attribute Height minimum of which is 170, estimate the
size of the query σ(Age≤23)∨Height=170 (R).
Solution: It is given that NR = 60, min(Age, R) = 20,
max(Age, R) = 30, min(Height, R) = 170 and V (Height, R) = 15.
Therefore, with uniform distribution assumption, the size of the
query can be estimated as S(σ(Age≤23)∨Height=170 (R))

NR2 − (NR − S(σAge≤23 (R))) ∗ (NR − S(σHeight=170 (R)))


=
NR
NR ∗(23−min(Age,R)) NR
NR2 − (NR − max(Age,R)−min(Age,R) ) ∗ (NR − V (Height,R) )
=
NR
= 20.8.
Outline Basics Statistics for Query Evaluation Problems

Query size estimation – Example II


Let R1(ID, Name) and R2(Roll, CGPA) be a pair of relations.
Now if ID be the primary key for R1 and the attribute Roll
has a minimum value of 118002001, then estimate the size
of the query σID=11 (R1) ./ σRoll≤118002001 (R2).
Outline Basics Statistics for Query Evaluation Problems

Query size estimation – Example II


Let R1(ID, Name) and R2(Roll, CGPA) be a pair of relations.
Now if ID be the primary key for R1 and the attribute Roll
has a minimum value of 118002001, then estimate the size
of the query σID=11 (R1) ./ σRoll≤118002001 (R2).
Solution: If ID be the primary key of R1, then it should have
distinct values satisfying V (ID, R1) = NR1 . On the other side, as
it is given that min(Roll, R2) = 118002001, the value of Roll can
not go below 118002001. Therefore, with the assumption of
uniform distribution, we can simply perform an equality check on
this attribute. Hence, size of the given query can be estimated as
S(σID=11 (R1) ./ σRoll≤118002001 (R2))

= S(σID=11 (R1)) × S(σRoll=118002001 (R2))


NR1 NR2 NR1 NR2 NR2
= × = × = .
V (ID, R1) V (Roll, R2) NR1 V (Roll, R2) V (Roll, R2)
Outline Basics Statistics for Query Evaluation Problems

Problems

1 Given a pair of relations R1 and R2, wherein the primary key


of R1 (say K ) is the only foreign key of R2, estimate the size
of the following queries. Assume that the number of tuples in
R1 and R2 are t1 and t2, respectively.
i R1 ./ R2.
ii R1 × R2.
iii σK =‘19BM6JP010 (R1) ./ R2.

You might also like