Unit 3 - Query Optimisation
Unit 3 - Query Optimisation
SEMESTER 1
O02CA504
DATABASE MANAGEMENT SYSTEM
Unit: 3 – Query Optimisation 1
O02CA504: Database Management System
Unit 3
Query Optimisation
TABLE OF CONTENTS
Fig No /
SL SAQ /
Topic Table / Page No
No Activity
Graph
1 Introduction - -
3
1.1 Objectives - -
2 Query Execution Algorithm - -
2.1 External sorting - -
2.2 Implementing the SELECT operation 1, 2 -
4-11
2.3 Methods to implement JOIN operation - -
2.4 Project and Set operations implementation - -
2.5 Aggregate operations implementation - 1, I
3 Heuristics in Query Optimisation - -
3.1 Notation for query trees and query graphs 1, 2 -
3.2 General transformation rules for relational 3 - 12-17
algebraic operations
3.3 Conversion of query trees into the query 3 2
execution plans
4 Semantic Query Optimisation - 3 18
5 Multi-Query Optimisation and Application - 4, II 19-21
6 Execution Strategies for SQL Sub Queries - 5 22
7 Query Processing for SQL Updates 4 6 23-24
8 Summary - - 25
9 Glossary - - 26
10 Terminal Questions - - 27
11 Answers - -
11.1 Self-Assessment Questions - - 28-29
11.2 Terminal Questions - -
12 References - - 30
1. INTRODUCTION
You have already studied DBMS and SQL in the previous units. Hence, query optimisation, which
we are going to cover in this unit, will not be new to you as it is related to DBMS.
Query optimisation is a technique that helps the DBMS to reduce the query execution time.
Nowadays, every database software supplies optimising SQL compilers that first analyse the SQL
query, if required, then rewrite the query and finally develop an optimal query to retrieve the data
from the database. This module of the SQL compiler is called Query Optimiser. This optimisation
is based on the different optimisation rules devised on the cost criteria of each operation on the
query.
We start our discussion with an overview of various algorithms for query operations in the context
of an RDBMS. It will also cover the discussion on the heuristic in query optimisation. We will further
study a brief overview of semantic query optimisation, multi-query optimisation and application.
The latter part of the unit deals with execution strategies for SQL sub-queries and query
processing for SQL updates.
1.1. Objectives
After studying this unit, you should be able to:
Describe the algorithms for executing query operations and discuss the heuristics in query
optimisation.
Explain briefly semantic query optimisation.
Identify multi-query optimisation and application.
Explain the execution strategies for SQL sub-queries and discuss query processing for SQL
updates
RDBMS provides various algorithms for implementing the different types of relational operations
that appear in a query execution strategy.
Different types of algorithms that are used by many relational operations are like external sorting
merge sort and many more to implement operations like SELECT, JOIN, PROJECT (UNION,
INTERSECTION, SET DIFFERENCE) and aggregate operations MIN, MAX, COUNT, AVERAGE
and SUM. Let’s discuss these in detail.
Internal sorting uses main memory, so it is fast but also expensive, while external sorting is slow
and cheaper as it uses secondary storage devices. External sorting is appropriate for large files
stored on the disk as they can’t completely fit in the main memory. The internal sorting algorithm
is suitable for sorting data structures that can fit completely in memory.
An external sorting algorithm makes use of a sort-merge strategy. The sort-merge strategy divides
the main file into sub-files of smaller size termed runs. Now, these runs are sorted first and then
merged to make larger runs. These larger runs are again sorted in turn. The external sorting needs
buffer space in the main memory to execute the actual sorting and merging of the runs.
The external Sorting algorithm carries out the operation in the following two phases:
Phase 1: Sorting phase: In this phase, the runs are read into the main memory. Over there, the
runs are sorted by using an internal sorting algorithm, and the result is written back to a disk as
temporary sorted runs. The number of initial runs and the size of a run (nR) are governed by the
number of file blocks (b) and available buffer space (nB).
For instance,
if nB = 5 blocks
b = 1024 blocks
Hence, after the sort phase, 205 sorted runs are stored as temporary sub-files on disk.
Phase 2: Merging phase: In this phase, the merging of the sorted runs is carried out over one or
more phases. The number of runs that can be merged together in each pass is termed the degree
of merging (dM). In each pass, a buffer block is required to hold one single block from every run
being merged, and one block is required to contain one block of the final result.
The worst-case performance comes with a minimum value of dM as 2. And the number of block
accesses will be [(2*b) + (2*(b*(log dM b))]. Here, the 1 st term constitutes the number of block
accesses in the sort phase since each file block is accessed at two times, first for reading into
memory and second for writing the record blocks, after sorting, to disk. The 2 nd term symbolises
the number of block accesses for the merge phase, presuming the worst-case scenario of dM 2.
Now, let’s generate a query for selecting all the records of EMPNO 3276 from table EMP (Table
4.1
(Query1): EMPNO ‘3276’EMP
If you want to retrieve all the records with a DNO of more than four from the table DEPT (Table
4.2), then the query will be framed like:
EMP
DNO 4
(Query2):
Table 4.2: Instance of DEPT Table
DNAME DNO MGRENO
Research 4 4001
Production 5 5001
Sales 1 1001
If you want to select all the records from table EMP where the DNO field is 2, then it can be written
as:
EMP
(Query3): DNO 2
The SELECT operation can retrieve data with multiple criteria as well. For example, select all the
records about the female employees whose SALARY is more than 10000 from department 3 with
reference to the EMP table; it will be framed like this:
Search Methods for Simple Selection: To select records from the file, possibilities of numerous
search algorithms exist. These are termed as file scans as they scan entire file and then retrieve
the records satisfying the user defined condition. And if the search algorithm carries an index, then
that search is termed as index scan. The basic search algorithms are explained below.
Linear search (brute force): It retrieves all the records from the file and checks whether the
attribute values qualify the criteria or not. This process is carried out till the end of the file.
So we can say above EMP
mentioned query 1 EMPNO ‘3276’ may be executed using the linear search algorithm.
Binary search: It does not take care of the ordering done on some field of the file. It checks for
the equality on which ever key attribute file is ordered. Binary search is faster and more efficient
than the linear search as the search space is reduced to half in each comparison.
So, we can say that if in EMP file the EMPNO is the ordering attribute
EMP
then query 1 EMPNO ‘3276’ might have used binary search.
Using primary index (or Hash Key): If the query checks for equality between the key attribute
and the primary index (or hash key), then hash search might be used. If in query 1, in EMP file,
the EMPNO is hash indexed, then hash search may be used on this field. On using the primary
index (or hash key) at most a single record is only retrieved.
Depending upon the cost associated with each method, the query optimiser picks up the most
appropriate method for executing a SELECT operation. Now let us extend our discussion to JOIN
operation.
There are various algorithms for implementing JOIN operations. Some of these are discussed
below:
Nested loop join (brute force): Retrieve each and every record from (inner loop) Y of each record
t in (outer loop) X. After this, check if both records meet the condition; t [M]= s[N].
Single loop join (using an access structure to retrieve the matching records): If an hash key
or index is there for any one of the two join attributes (let us say N of Y) retrieve each and every
record t in X, (one at one time). After that utilise the access structure to directly retrieve all matching
records s from S satisfying condition; s [N] = t [M].
Sort-merge join: JOIN can be implemented more efficiently if the records from X and Y files are
physically sorted by the join attribute value M and N respectively
Hashjoin: All the records from both the files R and S are hashed to a single hash file by utilising
the same function on the join attributes M of X and N of Y as hash keys.
Firstly, the partitioned phase takes places where the records of the file with less entries/records
(say X) hash its entries to the hash file bucket. Here, the records of R are sent into the hash
buckets.
Secondly, the probing phase takes place, where each record of another file (say Y) is hashed and
added to the bucket. Now the matching records from X are combined in that bucket.
attribute list R
relation. In the project operation implementation if < attribute list > carries the key of
relation R then the output produces same number of tuples as that of R but with only the attribute
values in the <attribute list>.
Now, if the < attribute list> is without key of R, then the output will produce duplicate tuples. These
duplicate tuples must be deleted by sorting. After sorting, wherever duplicate tuples appear
consecutively, it is eliminated.
Among the Set operations (UNION, INTERSECTION, SET DIFFERENCE and CARTESIAN
PRODUCT), the CARTESIAN PRODUCT operation R x S is most costly. It is because of two
reasons, firstly its output carries a combination of records from R and S. Secondly, all the attributes
of R and S are present in the output.
CARTESIAN PRODUCT output of the above table will carry (i*j) records and (x+y) attributes. As
it carries many records so it’s wise to avoid CARTESIAN PRODUCT operation and go for some
different equivalent operation.
Set operations like UNION, INTERSECTION and SET DIFFERENCE can be applicable only to
union compatible relations. These relations must have same number of attributes and must be
from same attribute domain. Relations where Union operation can be performed must have same
attributes and that to from the same domain.
SQL has many built-in functions for performing calculations on data. SQL
Aggregate Functions are the functions that return a single value, calculated from values in a
column which is selected from the table. Some of the useful aggregate functions are:
AVG() - The AVG() function returns the average value of a numeric column.
COUNT() - Returns the number of rows
MAX() - The MAX() function returns the largest value of the selected column
MIN() - The MIN() function returns the smallest value of the selected column
SUM() - The SUM() function returns the total sum of a numeric column.
Let us take one example. The SQL syntax for using MAX() is:
SELECT MAX(column_name) FROM table_name
Table STD.
STD_Id STUDENTS SUBJECTS MARKS
1 Seema Maths 95
2 Sangeetha Maths 91
3 Seema Physics 97
4 Monika Physics 55
5 Sangeetha Physics 63
6 Seema Chemistry 70
Where STD is the table name and MARKS is the column name of the table STD. The above
statement would select the highest (maximum) Marks from the column MARKS of the table STD
and result would look like this:
MAXIMMARKS
97
MIN operation will work in the same way. If a SQL query is made for tableSTD as:
SELECT MIN (MARKS) AS MINMARKS
FROM STD
Then the above statement would select the lowest (minimum) Marks from the column MARKS of
the table STD and result would look like this:
MINMARKS
55
Aggregate functions often need an added GROUP BY statement. The GROUP BY statement is
used in conjunction with the aggregate functions to group the result-set by one or more columns.
If you want to find the sum total of each students from the table STD, then you have to use the
GROUP BY statement to group the STUDENTS.
SELF-ASSESSMENT QUESTIONS – 1
Activity I
Discuss the reasons for converting SQL queries into relational algebra queries before
optimization is done.
In query optimisation heuristic rule defines internal representation of the query. This representation
may be in the form of query graph or query tree. It is done with the objective of performance
improvement. All high level programs initially generate an internal representation, which is further
optimised as per the heuristic rules. Later on, depending upon the access path specified in the
query, the query execution plan is generated.
The basic heuristic rule says that before applying JOIN operation or any other binary operation,
SELECT and PROJECT operations must be applied. This is done because SELECT and
PROJECT operations reduce the size of a file while JOIN operation increases the file size.
Query graph and query tree are the data structures used for queries internal representation. Query
graph represents relational calculus expression. And query tree represents relational algebra.
Query tree execution consists of an internal node operation. On the operation execution the
operands are replaced from the internal nodes by the resultant relation. The operation reaches its
final stage on the execution of the root node and generates the resultant relation for the query.
Figure 4.1 illustrates one query tree for query block ‘Q’. For all projects from the ‘Sikkim’, retrieve
project number, controlling dept. Number, and manager’s surname, address and birth date.
In Figure 4.1(a), the leaf nodes P, D and E represents PROJECT, DEPT and EMP relations
respectively. And the operations are represented by the internal tree nodes. On the execution of
this query tree, all the nodes marked (1), (2) and (3) in Figure 4.1(a) will operate sequentially as
the resulting tuples of preceding node will be the input of the following node. Query tree arranges
the operations in specific order for query execution.
Figure 4.2 represents the query graph for the relational algebra expression given below:
In the query graph as shown in Figure 4.2 the single circles represents the relations whereas the
double nodes denotes the constant values. The graph edges represent the selection and join
conditions. And the square brackets specify the attributes to be retrieved from each relation.
Moreover, there are no order preferences for executing an operation in case of query graph. In
correspondence to every query only a single graph can be created.
Many rules exist to transform relational algebra operations into equivalent ones. The symbols used
over there are defined in the table below.
Union U
Interaction ∩
Cartesian Product x
The transformation of query Q2 into the relational algebra expression will be like:
Moving ahead, the query tree for query Q2 is shown in Figure 4.3.
For the conversion of this tree into an execution plan, following are the requirements of the
optimiser:
index search for the SELECT operation
table scan as access method for EMPLOYEE,
nested loop join algorithm for JOIN,
scan of JOIN result for the PROJECT operator.
Additionally, for query execution a pipelined or materialised evaluation can also be taken into
account.
Materialised (stored) evolution specifies that the result of operations can be stored as temporary
relation (table). For example, the output of JOIN operation can be stored as a temporary relation
(table), which further is read as input for the PROJECT operation, to produce the resultant query
table.
Pipelined evaluation results into the cost savings. This is because the intermediate results need
not to be saved to the disk and not having to read them back for the next operation.
SELF-ASSESSMENT QUESTIONS – 2
5. The size of the file can be reduced by SELECT and ____________ operations.
6. _____________ represents a relational calculus expression.
7. The query graph representation also indicates an order in which operations perform first.
(True/False).
Let’s discuss this approach with the help of an example given below:
Select E.LNAME, M.LNAME
FROM EMPLOYEE AS E, EMPLOYEE AS M
WHERE E.SUPERNO=M.ENO AND E.SALARY>M>SALARY
This query retrieves the names of employees who earn more than their supervisors. If a constraint
is applied on the database schema to check that none of the employee can earn more than his
reporting supervisor; the semantic query optimiser checks for this constraint and may not execute
the query if it knows that the resultant query will be empty. If the constraint check is done efficiently
then this approach can save a lot of time.
SELF-ASSESSMENT QUESTIONS – 3
8. Semantic query optimisation helps in efficient query __________ by modifying one query
into another.
9. Relational database constraints are used in semantic query optimisation technique.
(True/False)
In stream query processing, the workload is shared among concurrently multiple active queries by
sharing computation and state. Query evaluation techniques that do not follow this property are
known as MQO (Multi- Query Optimisation) techniques. MQO saves the evaluation cost and
execution time by executing the common operations once over a set of queries. MQO offers
significant improvement to the system performance.
Consider the following two queries that retrieve information from an order processing database.
a)
SELECT name, custkey, orderkey, orderdate, totalprice
FROM customer, orders, lineitem
WHERE orders.custkey = customer.custkey
AND lineitem.orderkey = orders.orderkey
AND lineitem.quantity = '24';
b)
SELECT name, custkey, orderkey, orderdate, totalprice
FROM customer, orders, lineitem
WHERE orders.custkey = customer.custkey
AND lineitem.orderkey = orders.orderkey
AND lineitem.quantity = '24'
AND orders.orderstatus = ‘shipping’;
The first query retrieves customer and order information for the specific quantity of items ordered.
The second query also retrieves the same information but only for those whose order status is
shipping.
Second query’s output is a subset of the first query output, so its computation is fast. MQP (Multiple
Query processing) helps in optimising the result by first finding out the customers whose lineitem
quantity is 24 and then utilising this information by applying additional constraint to check that the
orderstatus is shipping.
Multi-query optimisation technique can be used for framing proficient algorithms for problems like
view/index selection, query result caching and maintenance.
For example, multiple query optimisations can be applied to mobile database system to pull (on-
demand) batches requests. Several queries can be answered at once by the resulting view. This
is broadcasted over a view channel dedicated to common answers of multiple queries instead of
transmitting over individual downlink channels.
Greedy algorithm is a cost-based heuristic algorithm. Depending upon the current situation it
makes the decision and never reconsiders this decision again, whatever situation may arise later.
To find an optimal solution, Greedy algorithm selects a set of nodes to be materialised and then
concludes the decision. It is a repetitive task to be carried over different sets of nodes to find the
best set of nodes to be materialised.
A greedy strategy works in a top-down manner. It reduces each problem into a smaller one by
making one greedy choice after another. This approach turns out to be good strategy in some
cases and sometimes does not offer optimal solutions, but only provides a compromise that
produces acceptable approximations.
SELF-ASSESSMENT QUESTIONS – 4
10. The key to achieving good stream processing performance is to optimise ____________
together.
11. MQO (Multi Query Optimisation) saves the evaluation cost and execution time by
executing the common operations once over a set of queries (True/False)
Activity II
With the help of internet find out some more practical applications of multi query
optimisation.
Navigational strategies: For executing sub- query, navigational strategies depends on the
nested loops joins. Basically there are two classes of navigational strategies: forward lookup and
reverse lookup. Forward lookup firstly starts executing the outer query and when outer rows are
generated then it invokes the sub-query. Reverse lookup starts with the sub-query and processes
one sub-query row at one time.
Set-oriented processing finally needs that the query could be effectively de-correlated. If this is
the situation, set operations for example hash, merge and join can execute the query.
SELF-ASSESSMENT QUESTIONS – 5
It is required to provide validation to update operations against stated constraints like Check,
uniqueness, etc. Also these operations are required to preserve the basic storage structures.
Delta stream: It is defined as a set of rows which is used to encode the changes to a specific
base table. This is just a relation with a precise schema. Any relational operator can process it.
In Figure 4.4, we have shown a general template used for the implementation plan of update
statements.
Fig 4.4: General Template for the Execution Plan of Update Statement
The second component consumes the delta stream, applying the changes to the base table, and
then performs all the actions that the DML (Data Manipulation Language) statement implicitly
fires.
The action series to be executed is evaluated by checking the entire active dependencies opposed
to the target table, and then filtering depending upon the current statement requirement.
SELF-ASSESSMENT QUESTIONS – 6
14. It is required to validate update operations against stated relational database constraints
(True/False)
15. ____________ is defined as a set of rows that encode the changes made to a specific
base table.
8. SUMMARY
Let us recapitulate the important points of this unit:
Sorting is one of the primary algorithms used in query processing. It is of two types: internal sorting
and external sorting.
There are multiple categories of query execution algorithms such as external sorting, binary
search, linear search, hash-key search or primary index etc.
SELECT (represented by symbol ) operation performs the task of retrieving the desired records
from the database. There are various search methods such as linear search, binary search and
primary index.
JOIN operation is used to join two database tables/relations. There are various algorithms for
implementing the JOIN operations, such as Nested loop join, single loop joins, sort-merge join and
hash join.
Query graphs and query trees are the data structures used for the internal representation of
queries.
There are two types of sub-query optimisation strategies. First is the navigational strategy, and
second is the set-oriented strategy.
9. GLOSSARY
Query Query optimisation refers to the procedure for selecting the best
-
optimisation execution strategy amongst the various options available.
11. ANSWERS
Answer 3: Semantic query optimisation is a different approach to query optimisation that uses
various constraints to modify one query into another to make it more efficient to execute. Refer to
Section 4.4 for more details.
Answer 4: To achieve good stream processing performance, multiple queries are optimised
together rather than individually. This is multi-query optimisation. Refer to Section 4.5 for more
details.
Answer 5: There are mainly two techniques for SQL sub-queries execution, namely navigational
strategies and set-oriented strategies. Refer to Section 4.6 for more details.
12. REFERENCES
• Elmasri, Navathe, Somayajulu, Gupta, (2006) Fundamentals of Database Systems, (6th
Ed.), India: Pearson Education.
• Peter Rob, Carlos Coronel, (2004). Database Systems: Design, Implementation, and
Management, (7th Ed.), US: Thomson Learning.
• Silberschatz, Korth, Sudarshan, (2011). Database System Concepts, (6th Ed.), McGraw-
Hill.
E-references
• http://www.cs.iusb.edu/technical_reports/TR-20080105-1.pdf
• http://research.microsoft.com/pubs/76059/pods98-tutorial.pdf
• http://infolab.stanford.edu/~hyunjung/cs346/ioannidis.pdf