Advanced Database Management System
Query processing and optimization
Database security and authorization
Transaction processing concepts
MID SEMESTER EXAMINATION
Concurrency control techniques
Database recovery techniques
Distributed database system
Query Processing and
Optimization
Database system features
Crash Recovery
Integrity Checking
Security
Concurrency Control
Query Processing and Optimization
File Organization and Optimization
2
Query Processing and Optimization
Translating SQL Queries into Relational Algebra
Basic Algorithms for Executing Query Operations
Using Heuristic in Query Optimization
Using Selectivity and Cost Estimates in Query
Optimization
Semantic Query Optimization
•Query languages: Allow manipulation and
retrieval
of data from a database.
3
Relational algebra-operations on a relation
Operations that could be applied on relations
•
Selection ( ) Selects a subset of rows from a relation.
Projection ( ) Deletes unwanted columns from a relation.
Renaming: assigning intermediate relation for a single operation
Cross-Product ( x ) Allows to concatenate a tuple from one
relation with all the tuples from the other relation.
Set-Difference ( - ) Tuples in relation R1, but not in relation
R2.
Union ( ) Tuples in relation R , or in relation R .
1 2
Intersection () Tuples in relation R and in relation R
1 1
Join Tuples joined from two relations based on a condition
Query processing :
The aim of query processing is to find information in one or more
databases and deliver it to the user.
4
Query Processing and Optimization
A query expressed in a high-level query language such as
SQL must first be scanned, parsed, and validated.
The scanner identifies the language tokens—such as
SQL keywords, attribute names, and relation names.
The parser checks the query syntax based on the rules of
the query language.
The query must also be validated, by checking that all
attribute and relation names are valid and semantically
meaningful names in the schema of the particular
database being queried.
5
Query Processing and Optimization
An internal representation of the query is then
created, usually as a tree data structure called a
query tree.
It is also possible to represent the query using a
graph data structure called a query graph.
The DBMS must then devise an execution strategy
for retrieving the result of the query from the
database files. A query typically has many possible
execution strategies, and the process of choosing a
suitable one for processing a query is known as
query optimization.
6
Query Processing and Optimization
The query optimizer module has the task of
producing an execution plan.
The code generator generates the code to
execute that plan.
The runtime database processor has the task
of running the query code.
7
Query Processing (cont…)
8
Query Processing (cont…)
Translating SQL queries into relational
algebra
Query block:
The basic unit that can be translated into the
algebraic operators
A query block contains a single SELECT-FROM-
WHERE expression, as well as GROUP BY and
HAVING clause if these are part of the block.
Nested queries within a query are identified as
separate query blocks.
9
Query Processing (cont…)
SELECT LNAME, FNAME
FROM EMPLOYEE
WHERE SALARY > ( SELECT MAX (SALARY)
FROM EMPLOYEE
WHERE DNO = 5);
SELECT LNAME, FNAME SELECT MAX (SALARY)
FROM EMPLOYEE FROM EMPLOYEE
WHERE SALARY > C WHERE DNO = 5
πLNAME, FNAME (σSALARY>C(EMPLOYEE)) ℱMAX SALARY (σDNO=5 (EMPLOYEE))
10
Basic algorithms for executing query
operations
External Sorting
Implementing the SELECT Operation
Implementing the JOIN Operation
Implementing PROJECT and Set Operations
11
Basic algorithms for executing query
operations
External sorting:
Refers to sorting algorithms that are suitable for large
files of records stored on disk that do not fit entirely in
main memory, such as most database files.
Sorting is one of the primary algorithms used in query
processing. For example, whenever an SQL query
specifies an ORDER BY-clause, the query result must be
sorted. Sorting is also a key component in sort-merge
algorithms used for JOIN and other operations (such as
UNION and INTERSECTION), and in duplicate
elimination algorithms for the PROJECT operation (when
an SQL query specifies the DISTINCT option in the
SELECT clause).
12
External sorting:
13
External sorting:
External sorting uses Sort-Merge strategy:
Starts by sorting small subfiles (runs) of the main file
and then merges the sorted runs, creating larger sorted
subfiles that are merged in turn. 2 phases:
Sorting phase:
Number of subfiles (runs) nR = (b/nB)
Merging phase:
Degree of merging(dM) = Min (nB-1, nR); nP = (logdM(nR))
14
Sort-Merge strategy (cont…)
nR: number of initial runs;
b: number of file blocks;
nB: available buffer space;
dM: degree of merging;
P: number of passes
The size of a run and number of initial run depends on the
number of file blocks (b) and available buffer space (nB)
Example: if nB=5 blocks and size of the file=1024 blocks,
nR=(b/nB)= (1024/5) =205 runs each of size 5 blocks
except the last run which will have 4 blocks.
Hence, after the sort phase, 205 sorted runs are stored
as temporary subfiles on disk
15
Sort-Merge strategy (cont…)
In the merging phase, the sorted runs are merged during
one or more passes.
The degree of merging (dM) is the number of runs that
can be merged in each pass.
dM=min((nB-1) and nR))
The number of passes=[(logdM (nR))]
In each pass, one buffer block is needed to hold one
block from each of the runs being merged and one block
is needed for containing one block of the merge result
16
Sort-Merge strategy (cont…)
In the above example, dM=4(four way merging)
Hence, the 205 initial sorted runs would be merged into:
52 at the end of the first pass
13 at the end of the second pass
4 at the end of the third pass
1 at the end of the fourth pass
The minimum dM of 2 gives the worst case performance
of the algorithm, which is
2*b +2(*(b*(log2 (nR) )
The first term represents the number of block access for
the sort phase and the second term represents the
number of block access for the merge phase
17
Exercise using the relation given below
Assume available buffer
space is 3 blocks.
Find DM?
Find P?
Find the number of
merges in each pass?
18
Merging phase
Step 1 (only for run 1 and run 2)
Read a (from first run) and b(from second run) and load to buffer
a is written into an output buffer
Read d(from first run) and load to buffer
a is written to disk, and b is written into an output buffer
Read c (from second run) and load to buffer
b is written to disk, and c is written into an output buffer
Read e(from second run) and load to buffer
c is written to disk, d is written to an output buffer
Read g (from first run) and load to buffer
d is written to disk, e is written to an output buffer
And write g to disk
Do for Step 1 (for run 3 and run 4) and the whole for step 2
19
Implementing the SELECT Operation
There are many options for executing a SELECT operation
Some options depend on the file having specific access
paths and may apply only to certain types of selection
conditions
Examples:
(OP1): SSN='123456789' (EMPLOYEE)
(OP2): DNUMBER>5(DEPARTMENT)
(OP3): DNO=5(EMPLOYEE)
(OP4): DNO=5 AND SALARY>30000 AND SEX=F(EMPLOYEE)
20
Implementing the SELECT Operation (cont…)
Search Methods for implementing Simple Selection:
S1 Linear search (brute force):
Retrieve every record in the file, and test whether its attribute
values satisfy the selection condition.
S2 Binary search:
If the selection condition involves an equality comparison on a
key attribute on which the file is ordered, binary search (which
is more efficient than linear search) can be used. An example
is OP1 if SSN is the ordering attribute for EMPLOYEE file
S3 Using a primary index or hash key to retrieve a
single record:
If the selection condition involves an equality comparison on a
key attribute with a primary index (or a hash key), use the
primary index (or the hash key) to retrieve the record.
For Example, OP1 use primary index to retrieve the record
21
Implementing the SELECT Operation (cont…)
Search Methods for implementing Simple Selection
S4 Using a primary index to retrieve multiple
records:
If the comparison condition is >, ≥, <, or ≤ on a key
field with a primary index, use the index to find the
record satisfying the corresponding equality
condition, then retrieve all subsequent records in
the (ordered) file. (see OP2)
S5 Using a clustering index to retrieve multiple
records:
If the selection condition involves an equality
comparison on a non-key attribute with a clustering
index, use the clustering index to retrieve all the
records satisfying the selection condition. (See OP3)
22
Implementing the SELECT Operation (cont…)
• S6: using a secondary index on an equality comparison:
• This search method can be used to retrieve a single record if
the indexing field is a key (has unique values) or to retrieve
multiple records if the indexing field is not a key
• Search Methods for implementing complex Selection like OP4
• S7: Conjunctive selection using an individual index :
If an attribute involved in any single simple condition in
the conjunctive condition has an access path that
permits the use of one of the methods S2 to S6, use that
condition to retrieve the records and then check whether
each retrieved record satisfies the remaining simple
conditions in the conjunctive condition
23
Implementing the SELECT Operation
(summarized)
Whenever a single condition specifies the selection, we
can only check whether an access path exists on the
attribute involved in that condition.
If an access path exists, the method corresponding
to that access path is used;
Otherwise, the “brute force” linear search approach
of method S1 is used
For conjunctive selection conditions, whenever
more than one of the attributes involved in the
conditions have an access path, query optimization
should be done to choose the access path that
retrieves the fewest records in the most efficient way
24
Implementing the SELECT Operation
(summarized)
Disjunctive selection conditions: This is a situation where
simple conditions are connected by the OR logical
connective rather than AND.
Compared to conjunctive selection, It is much harder to
process and optimize.
Example: DNO=5 OR SALARY>3000 OR SEX=‘F’(EMPLOYEE)
Little optimization can be done because the records
satisfying the disjunctive condition are the union of the
records satisfying the individual conditions.
If any of the individual conditions does not have an
access path, we are compelled to use the brute force
approach.
25
Implementing the JOIN Operation:
The join operation is one of the most time
consuming operation in query processing
Join
two–way join: a join on two files
e.g. R A=B S
multi-way joins: joins involving more than two files
e.g. R A=B S C=D T
Examples
(OP5): EMPLOYEE DNO=DNUMBER DEPARTMENT
(OP6): DEPARTMENT MGRSSN=SSN EMPLOYEE
26
Implementing the JOIN Operation(cont…)
Methods for implementing joins:
J1 Nested-loop join (brute force):
For each record t in R (outer loop), retrieve every
record s from S (inner loop) and test whether the
two records satisfy the join condition t[A] = s[B]
J2 Single-loop join (Using an access structure to
retrieve the matching records):
If an index (or hash key) exists for one of the two
join attributes — say, B of S — retrieve each record
t in R, one at a time, and then use the access
structure to retrieve directly all matching records s
from S that satisfy s[B] = t[A].
27
Implementing the JOIN Operation (cont…)
Methods for implementing joins:
J3 Sort-merge join:
If the records of R and S are physically sorted
(ordered) by value of the join attributes A and B,
respectively, we can implement the join in the most
efficient way possible.
Both files are scanned in order of the join attributes,
matching the records that have the same values for
A and B.
In this method, the records of each file are scanned
only once each for matching with the other file—
unless both A and B are non-key attributes
28
Implementing the JOIN Operation (cont’d)
Other type of JOIN algorithms
Partition hash join
Partitioning phase:
Each file (R and S) is first partitioned into M partitions using a
partitioning hash function on the join attributes:
R1 , R2 , R3 , ...... Rm and S1 , S2 , S3 , ...... Sm
Minimum number of in-memory buffers needed for the
partitioning phase: M.
A disk sub-file is created per partition to store the tuples for
that partition.
Joining or probing phase:
Involves M iterations, one per partitioned file.
Iteration i involves joining partitions Ri and Si.
29
Algorithms for PROJECT and SET Operations
Algorithm for PROJECT operation
<attribute list>(R)
1. If <attribute list> has a key of relation R, extract all tuples from
R with only the values for the attributes in <attribute list>.
2. If <attribute list> does NOT include a key of relation R,
duplicated tuples must be removed from the results.
Methods to remove duplicate tuples
1. Sorting
2. Hashing
Note: in SQL query, the default is not to remove duplicates.
To remove duplicates we have to use the key word
‘DISTINCT’.
30
Set operations
UNION, INTERSECTION, SET DIFFERENCE, and CARTESIAN
PRODUCT
CARTESIAN PRODUCT: too expensive. If R has n records and j
attributes and S has m records and k attributes, the result
relation will have n * m records and j + k attributes. Hence, it is
important to avoid the CARTESIAN PRODUCT operation and
substitute other equivalent operations during query optimization.
UNION, INTERSECTION, and SET DIFFERENCE: apply with
the same number of attributes and the same attribute domains.
The customary way to implement these operations is to use:
sort-merge technique: the two relations are sorted on the
same attributes, and, after sorting, a single scan through each
relation is sufficient to produce the result.
31
Query Optimization
There are two main techniques for implementing query
optimization.
Heuristic rules
A heuristic is a rule that works well in most cases but is
not guaranteed to work well in every possible case. The
rules typically reorder the operations in a query tree.
Cost estimation
Determining the cost of different execution strategies and
choosing the execution plan with the lowest cost estimate.
The two techniques are usually combined in a query
optimizer
32
General Transformation rules for
relational algebra
1. Cascade of : A conjunctive selection condition can be
broken up into a cascade (sequence) of individual
operations:
c1 AND c2 AND ... AND cn(R) = c1 (c2 (...(cn(R))...) )
2. Commutativity of :
The operation is commutative:
c1 (c2(R)) = c2 (c1(R))
3. Cascade of : In a cascade (sequence) of operations, all
but the last one can be ignored:
List1 (List2 (...(Listn(R))...) ) = List1(R)
33
Transformation Rules (cont…)
4. Commuting with :
If the selection condition c involves only the
attributes A1, ..., An in the projection list, the two
operations can be commuted:
A1, A2, ..., An (c (R)) = c (A1, A2, ..., An (R))
5. Commuting of and X: both are commutative
R cS = S c R, RXS=SXR
34
Transformation Rules (cont…)
6. Commuting with : If projection list is L = {A1, ..., An,
B1, ..., Bm}, where A1, ..., An are attributes of R and
B1, ..., Bm are attributes of S and the join condition c
involves only attributes in L, the two operations can be
commuted as follows:
L ( R C S ) = (A1, ..., An (R)) C ( B1, ..., Bm (S))
7. Converting a (, x) sequence into : If the condition c of a
that follows a x Corresponds to a join condition, convert
the (, x) sequence into a as follows:
(C (R x S)) = (R C S)
35
Transformation Rules (cont…)
8. Commutativity of the Set Operations: UNION and
INTERSECTION but not SET DIFFERENCE
RS=SR and RS=SR
9. Associativity of the THETA JOIN, CARTESIAN
PRODUCT, UNION and INTERSECTION.
(R S) T=R (S T) where is one of the
operations
10. Commuting SELECTION with SET OPERATIONS
c (R S)= (c(R) c(S)) where is one of the operations
11. Commuting PROJECTION with UNION
L1 (SR )= L1 (S ) L1 (R )
36
Using Heuristics in Query Optimization
The heuristic approach of optimization will make use of:
Properties of individual operators
Association between operators
Query Tree: a graphical representation of the operators, relations,
attributes and predicates and processing sequence during query
processing.
Query tree is composed of three main parts:
The Leafs: the base relations used for processing the query/
extracting the required information
The Root: the final result/relation as an output based on the
operation on the relations used for query processing
Nodes: intermediate results or relations before reaching the final
result.
Sequence of execution of operation in a query tree will start from the leaves
and continues to the intermediate nodes and ends at the root
37
Using Heuristics in Query Optimization (cont…)
Process for heuristics optimization
1. The parser of a high-level query generates an initial
internal representation;
2. Apply heuristics rules to optimize the internal
representation.
3. A query execution plan is generated to execute groups of
operations based on the access paths available on the files
involved in the query
The main heuristic is to apply first the operations that
reduce the size of intermediate results
E.g., Apply SELECT and PROJECT operations before
applying the JOIN or other binary operations
38
Using Heuristics in Query Optimization (cont…)
Heuristic Optimization of Query Trees:
The same query could correspond to many different relational algebra
expressions — and hence many different query trees.
The task of heuristic optimization of query trees is to find a final query
tree that is efficient to execute.
Example:
Project (Pname, Pnumber), Employee (Bdate, Ssn, Fname, Lname)
and Works_on (Pno, Essn). Project and Employee are related by
Works_on table.
List Last Name of employees who are working in AQUARIUS project
and born after 1957-12-31.
Q: SELECT LNAME
FROM EMPLOYEE, WORKS_ON, PROJECT
WHERE PNAME = ‘AQUARIUS’ AND
PNMUBER=PNO AND ESSN=SSN
AND BDATE > ‘1957-12-31’;
39
Steps in converting a query tree during heuristic optimization:
Initial query tree for the query Q
40
Steps in converting a query tree during heuristic optimization:
Moving the select operation down the tree
41
Steps in converting a query tree during heuristic optimization:
Applying the more restrictive select operation first
42
Steps in converting a query tree during heuristic optimization:
Replacing Cartesian product and select with join
43
Steps in converting a query tree during heuristic optimization:
Moving project operations down the query tree
44
Using Heuristics in Query Optimization
Summary of Heuristics for Algebraic Optimization:
1. The main heuristic is to apply first the operations that
reduce the size of intermediate results
2. Perform select operations as early as possible to reduce
the number of tuples and perform project operations as
early as possible to reduce the number of attributes. (This
is done by moving select and project operations as far
down the tree as possible.)
3. The select and join operations that are most restrictive
should be executed before other similar operations. (This
is done by reordering the leaf nodes of the tree among
themselves and adjusting the rest of the tree
appropriately.)
45
Exercise
Consider the following schemas and the query, where the
EMPLOYEE and the PROJECT relations are related by the
WORKS_ON relation.
EMPLOYEE (EEmpID, FName, LName, Salary, Dept, Sex, DoB)
PROJECT (PProjID, PName, PLocation, PFund, PManagerID)
WORKS_ON (WEmpID, WProjID
Query: The manager of the company working on road construction
would like to view employees name born before January 1, 1965 who
are working on the project named Ring Road.
Formulate SQL query, relational algebra and finally construct the
optimum query tree (show the steps you passed to reach the final
tree). You must justify to simplify your tree.
46
Using Selectivity and Cost Estimates in Query
Optimization
Cost-based query optimization:
Estimate and compare the costs of executing a query
using different execution strategies and choose the
strategy with the lowest cost estimate
Cost Components for Query Execution
1. Access cost to secondary storage
2. Computation cost
3. Memory usage cost
4. Communication cost
Note: Different database systems may focus on different
cost components.
47
Using Selectivity and Cost Estimates in
Query Optimization
Catalog Information Used in Cost Functions
Information about the size of a file
number of records (tuples) (r),
record size (R),
number of blocks (b)
blocking factor (bfr) –a factor of block size and record size
Information about indexes and indexing attributes of a file
Number of levels (x) of each multilevel index
Number of first-level index blocks (bI1)
Number of distinct values (d) of an attribute
Selectivity (sl) of an attribute-r/d
48
Semantic Query Optimization
Semantic Query Optimization:
Uses constraints specified on the database schema in order to
modify one query into another query that is more efficient to
execute
Consider the following SQL query,
SELECT E.LNAME, M.LNAME
FROM EMPLOYEE E M
WHERE E.SUPERSSN=M.SSN AND E.SALARY>M.SALARY
Explanation:
Suppose that we had a constraint on the database schema that
stated that no employee can earn more than his or her direct
supervisor. If the semantic query optimizer checks for the
existence of this constraint, it need not execute the query at all
because it knows that the result of the query will be empty.
49