Query Processing and Optimization
Query Processing and Optimization
There is order
E.g., PRIMARY INDEX, CLUSTERING INDEX, SECONDARY INDEX Why not index all fields?
Searching algorithms
Linear (#recs/2) or binary (log2(#recs)) Use index or not?
Joins
Nested loop join R JOIN S
For each tr in R do
For each ts in S do Test pair (tr, ts) to see if they can be joined If so, add to result Loop
P salary
salary < 25000; use secondary index on salary; use binary search
Employee
Query Optimization
Reorder the operations in the internal representation of a query (tree or graph) to improve performance A heuristic rule works well in MOST cases but it is NOT GUARANTEED to work in ALL possible cases
Find the costs of the different execution strategies and choose the one with the lowest cost Computationally intensive
Which is a better?
(R.A=?) (R
R.B = S.B
S)
( R.A=? (R)) R.B = S.B S What if S is too small compared to R Index on B in R but no index on A
(2) Apply selection a linear scan of all of R (1) Join and reject results that dont satisfy selection
Query Representation
Query tree
Tree data structure that corresponds to a relational algebra expression
Input relations of the query as leaf nodes of the tree The relational algebra operations as internal nodes
Sample Query
Example: For every project located in Stafford, retrieve the project number, the controlling department number and the department managers last name, address and birthdate
SELECT P.NUMBER,P.DNUM,E.LNAME, E.ADDRESS, E.BDATE FROM PROJECT AS P,DEPARTMENT AS D, EMPLOYEE AS E WHERE P.DNUM=D.DNUMBER AND D.MGRSSN=E.SSN AND P.PLOCATION=STAFFORD;
Relational algebra:
PNUMBER, DNUM, LNAME, ADDRESS, BDATE (((PLOCATION=STAFFORD(PROJECT))
DNUM=DNUMBER
(DEPARTMENT))
MGRSSN=SSN (EMPLOYEE))
Internal nodes are - Executed when inputs are ready - Replaced by results Internal nodes
Input relations
Different representation for the same algebra expression assumed to be the initial form
There are many trees for the same query - strict order among their operations - Query optimization must find best order
The main heuristic is to first apply the operations that reduce the size of intermediate results
E.g., Apply SELECT and PROJECT operations before applying the JOIN or other binary operations 1- Push selections down 2- Apply more restrictive selections first
3- Combine cross products and selections to become joins 4- Push projections down
Select names of employees working on the Aquarius Project and born after 1957 Select Lname From Employee, Works_On, Project Where Pname = Acquarius and bithdate> 12/31/1957 and SSN=ESSN and PNumber = PNO
Usually, we start with Cross Products, followed by selections, followed by Projects
1-Push selections down 2-Apply more restrictive selections first -e.g. equalities before range queries 3-Combine cross products and selections to become joins 4-Push projections down
Transformation Rules
Transformation rules transform one relational algebra expression to AN EQUIVALENT ONE General Transformation Rules:
Used by the query optimizer to optimize query tree Any rule, if applied, makes sure that the resulting tree is equivalent resulting execution plan is equivalent (2) Commutativity of : The operation is commutative:
c1 (c2(R)) = c2 (c1(R))
(6.a) Commuting with (or x ): If all the attributes in the selection condition c involve only the attributes of one of the relations being joinedsay Rthe two operations can be commuted as follows :
Break up any select operations with conjunctive conditions into a cascade of select operations Move each select operation as far down the query tree as is permitted by the attributes involved in the selection condition Rearrange the leaf nodes of the tree so that the leaf node relations with the most restrictive select operations are executed first in the query tree representation.
Combine a cross product operation with a subsequent select operation in the tree into a join operation Break down and move lists of projection attributes down the tree as far as possible by creating new project operations as needed
Apply algorithm
Issues
Cost function Number of execution strategies to be considered
Much better for compiled queries where optimization is done once at compile time and the query is executed many times
PreparedStatements VS Statements
Searching, reading, writing, updating, etc Number of memory buffers needed for the query Storing any intermediate files that are generated by an execution strategy for the query
Storage cost
Communication cost
Shipping the results from the database site to the users site
Of performing in-memory operations on the data buffers during the execution plan (searching, sorting, joining, arithmetic)
Computation cost
Apply algorithm
Find Lname and SSN of all employees in the Design Department working on project 5 who earn more than the highest paid employee working on the Project X Project