0% found this document useful (0 votes)
32 views97 pages

SF8 - Unit 2 DDB

Uploaded by

praneet trimukhe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views97 pages

SF8 - Unit 2 DDB

Uploaded by

praneet trimukhe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 97

II.

QUERY PROCESSING AND


DECOMPOSITION
 Objective of Query Processing in Distributed
Databases: Transform a high-level query into an
efficient execution strategy using a low-level language
across local databases.
 Query Languages: The high-level language is
relational calculus, while the low-level language is an
extension of relational algebra with communication.
 Parallel Execution: Query operations can be executed
in parallel at different sites, reducing response time
compared to the total cost.
 Total Cost Components:
 CPU Cost: Time for executing operations on data in
main memory.
 I/O Cost: Time for disk access, minimized by efficient
data access methods and memory management.
 Communication Cost: Time for exchanging
data between sites, including message processing
and data transmission.
 Distributed vs. Centralized DBMS Costs:

 Centralized systems focus only on CPU and I/O

costs.
 Distributed systems also account for
communication costs.
 Early Assumptions: Early distributed query
optimization focused on minimizing
communication costs, assuming communication
dominated local processing costs due to slow
networks.
 Modern Networks: With faster communication

networks comparable to disk bandwidths, recent


research considers a weighted combination of
 Communication Overhead: Despite faster
networks, communication still incurs overhead
costs (e.g., protocol processing), making it a
crucial factor.
 Query Optimization: Multiple execution
strategies are possible for a single query, but the
one minimizing resource consumption (total cost
and response time) should be selected.
 Measures of Resource Consumption:

 Total Cost: Sum of time incurred in query


processing across all sites and communication.
Response Time: Time taken to execute the
query.
CHARACTERIZATION OF QUERY
PROCESSORS
Languages:
 Early Focus on Relational DBMSs: Most early work on
query processing was done for relational database systems
because their high-level languages allow for more
optimization.
 Input Language for Query Processing: In relational
DBMSs, the input language for the query processor is based
on relational calculus.
 Object DBMSs: In object database systems, the input
language is based on object calculus, which is an
extension of relational calculus. These queries are broken
down into object algebra.
 XML Data Model: For XML databases, the main query
languages are XQuery and XPath.
 Query Decomposition: Queries written in relational
calculus need to be converted (decomposed) into
relational algebra for further processing.
 Distributed Systems: In distributed
databases, the output language is usually a
version of relational algebra with added
commands for communication between
different sites.
 Efficient Query Mapping: The query
processor must efficiently convert the input
query language into the output language for
smooth execution in the system.
2 . TYPES OF OPTIMIZATION
 Goal of Query Optimization: The aim is to find
the best execution plan from all possible strategies
to run a query efficiently.
 Exhaustive Search Method: One way is to check

all possible strategies, predict their costs, and


choose the one with the lowest cost. However, this
can take a lot of time and resources.
 Large Solution Space: The number of possible

strategies increases quickly, especially when


dealing with more than 5 or 6 database tables or
fragments.
 Exhaustive Search is Sometimes Useful:
Although costly, exhaustive search can be useful if
the query is optimized once and used multiple
times later.
 Randomized Strategies: Methods like iterative
improvement and simulated annealing try to
find a good solution without checking all options,
reducing the time and memory needed.
 Heuristics for Faster Optimization: Heuristics

reduce the solution space by focusing on fewer


strategies. A common method is to minimize the
size of intermediate results during query execution.
 Heuristic in Distributed Systems: In distributed

databases, a key approach is to use semijoins


instead of joins to reduce the amount of data
transferred between sites, saving communication
costs.
3. OPTIMIZATION TIMING
 When Query Optimization Happens:
 It can be done either before the query runs
(static) or during query execution (dynamic).
 Static Query Optimization:

 Happens before query execution, at the


time of query compilation.
 Its cost can be spread across multiple
query executions.
 It works well with exhaustive search
but requires estimating the size of
intermediate results using database
statistics, which can sometimes be
inaccurate.
 Dynamic Query Optimization:
 Happens during query execution.
 Decisions are based on the actual results of previous
steps, so no need for database statistics to guess
intermediate result sizes.
 More accurate, as real-time information is available,
but expensive since it must be repeated every time
the query is run.
 Best suited for ad-hoc queries (one-time queries).
 Hybrid Query Optimization:
 Combines the benefits of static and dynamic
optimization.
 Optimization starts statically, but switches to
dynamic if there's a big difference between
predicted and actual sizes of intermediate
results during query execution.
4. STATISTICS
 Query optimization depends on accurate database
statistics to make good decisions.
 Uses statistics to decide the order in which query
operations should be performed.
 Needs even more detailed statistics to estimate the
size of intermediate results during query execution.
 Include details about data fragments, like their size,
number of rows (cardinality), and unique values of
each attribute.
 Sometimes, more detailed information, like histograms
of attribute values, is used to improve accuracy, but it
increases management costs.
 For accuracy, statistics need to be updated regularly.
 In static optimization, if the data changes significantly,
the query may need to be optimized again.
5. DECISION SITES
 In static optimization, either one site or multiple
sites can help choose the best way to answer a query.
 Most systems use a centralized approach, where

one site decides the strategy for the entire system.


 In a distributed approach, several sites share the

decision-making process, using only local information.


 The centralized approach is easier but needs complete

knowledge of the whole database, while the


distributed approach uses only local knowledge.
 Hybrid approaches are common, where one site

makes the main decisions, and other sites handle local


decisions.
6. EXPLOITATION OF THE NETWORK TOPOLOGY
 Network topology is used by distributed query processors to
improve query execution.
 In wide area networks (WANs), data communication costs are the
most important factor to reduce. This makes query optimization
simpler, as it focuses on two things:
 Deciding the overall execution plan based on communication
between sites.
 Deciding local execution plans using normal query processing
methods.
 In local area networks (LANs), communication costs are similar to
input/output (I/O) costs. So, it's okay to increase parallel execution
even if communication costs go up.
 Broadcasting capabilities in some LANs can be used to make join
operations faster.
 Specialized algorithms for different network types (like star or
satellite networks) can be used to optimize query processing.
 In a client-server environment, the client computer's power can
be used to process some database operations. The challenge is to
decide which tasks should be done on the client and which on the
server.
7. EXPLOITATION OF REPLICATED FRAGMENTS
 A distributed relation is usually split into smaller
pieces called fragments.
 When a query is made on the whole relation, it is
converted into queries on the individual fragments.
This process is called localization because it helps
identify where the data is stored.
 To improve reliability and speed, fragments are often
replicated (copied) across different sites.
 Most optimization methods handle localization
separately from the rest of the optimization process.
 Some algorithms use replicated fragments during
query execution to reduce communication time, but
this makes the optimization process more
complicated because there are more strategies to
choose from.
8. USE OF SEMIJOINS
 The semijoin operator helps reduce the size of the data
being processed.
 In distributed databases, when communication costs are
important, semijoins are helpful because they reduce the
amount of data transferred between sites during join
operations.
 However, using semijoins can increase the number of
messages exchanged and the local processing time.
 Early distributed systems, like SDD-1, used semijoins a lot
because they were designed for slow networks.
 Newer systems, like R*, use faster networks and avoid
semijoins. Instead, they perform joins directly to reduce
local processing costs.
 Even with fast networks, semijoins are useful when they
significantly reduce the size of the data involved in the join.
 Some query algorithms try to find the best combination of
joins and semijoins to optimize performance.
LAYERS OF QUERY PROCESSING
1. QUERY DECOMPOSITION
 First Layer: Transforms the calculus query into an algebraic
query based on global relations, using information from the
global schema. This layer treats the database as centralized
and doesn’t consider data distribution.
 Four Steps in Query Decomposition:
 Normalization: Rewrite the query into a standard form for easier
manipulation, adjusting logical operators and query qualifiers.
 Semantic Analysis: Check if the query is correct and reject any
incorrect queries early. This involves using graphs to understand the
query's meaning.
 Elimination of Redundancy: Make the query simpler by removing
unnecessary parts. Redundant parts often come from transformations
applied to the original query.
 Restructuring: Convert the calculus query into an algebraic query,
improving it for better performance. This involves translating the
query into relational operators and then optimizing it through
transformation rules.
 Optimization: The algebraic query generated is generally good
but may not be optimal because it doesn’t yet consider data
distribution or how data is fragmented.
2. DATA LOCALIZATION
 Second Layer Input: Starts with an algebraic query based on
global relations.
 Main Role: Localizes the query data using information about
data distribution and fragmentation from the fragment schema.
 Fragmentation: Data is divided into smaller pieces called
fragments, each stored at different sites. The second layer
figures out which fragments are needed for the query and
transforms the query to work with these fragments.
 Two Steps to Generate a Fragment Query:
 Mapping: Replace each global relation in the query with its
reconstruction program, which details how to access the
fragments.
 Simplification and Restructuring: Simplify and adjust the
fragment query to improve it, following similar rules as in
the first layer.
 Limitations: The final query on fragments is usually not
optimal because it doesn’t fully use the information about data
fragmentation.
3. GLOBAL QUERY OPTIMIZATION

 Third Layer Input: Takes an algebraic query on


fragments (pieces of data).
 Goal: To find a nearly optimal execution strategy

for the query. Finding the absolute best strategy


is often too complex.
 Execution Strategy: Includes relational algebra

operators and communication actions (like


sending and receiving data between sites).
 Previous Optimization: Earlier layers have

already optimized the query by removing


redundant parts, but this doesn't account for
fragment specifics or communication needs.
 Optimization Task: Determine the best order for the
operators in the query, including communication actions,
to minimize costs such as disk space, I/O, CPU, and
communication.
 Cost Function: Measures various costs, often combining
I/O, CPU, and communication costs. Early systems
focused mainly on communication costs due to slow
networks, but now communication may be cheaper than
I/O.
 Predicting Costs: Costs are estimated using statistics
about the fragments and formulas for estimating results.
This helps in static optimization.
 Join Ordering: Arranging joins in the query can
significantly impact performance. Using semijoins can
reduce data size and communication costs but may
increase local processing costs.
 Output: The result of this layer is an optimized query
plan that includes communication actions, saved for
future use as a distributed query execution plan.
4. DISTRIBUTED QUERY
EXECUTION
 Last Layer: Each site with fragments related
to the query performs this step.
 Local Queries: Each site runs and optimizes

its part of the query, called a local query.


 Local Optimization: Uses the site's local

schema to choose the best algorithms for


processing.
 Algorithms: Relies on standard algorithms

used in centralized systems for processing


relational operators.
QUERY DECOMPOSITION
 Query decomposition is the first step in
processing a query.
 It changes a query written in relational

calculus (a formal language) into a query


using relational algebra (another type of
formal language).
 Both the starting and ending queries focus

on global data without worrying about where


the data is stored.
 This means the process is the same whether

you're working on a centralized or distributed


system.
 During this step, the query is assumed to be correctly
written. Once the process is done, the query should
be both logically correct and efficient, avoiding
unnecessary work. The four steps in query
decomposition are:
 Normalization: Adjusting the query into a standard

form.
 Analysis: Checking the query for correctness.

 Elimination of redundancy: Removing


unnecessary parts of the query.
 Rewriting: Converting the query into relational

algebra.
 The first three steps focus on making sure the query

is optimized by using transformations that are


equivalent in meaning but may improve
performance. The final step rewrites the query using
relational algebra.
NORMALIZATION
 The input query can be very complicated,
depending on the features of the language
you're using.
 The goal of normalization is to simplify the

query into a standard form that is easier to


process.
 In languages like SQL, the most important part

of this transformation happens in the WHERE


clause of the query.
 This part may be complex, but it's made up of

simpler conditions without quantifiers (like "for


all" (∀) or "there exists" (∃)), which come
before them if needed.
 There are two ways to simplify the conditions:
 Conjunctive normal form (CNF): This form

emphasizes "AND" (∧). It is a series of "AND"


statements that connect smaller "OR" groups. It
looks like this:
 (p11 ∨ p12 ∨ ... ∨ p1n) ∧ ... ∧ (pm1 ∨ pm2 ∨ ... ∨
pmn)
 Here, each pij is a simple condition.
 Disjunctive normal form (DNF): This form
emphasizes "OR" (∨). It is a series of "OR"
statements that connect smaller "AND" groups.
It looks like this:
 (p11∧ p12 ∧ ... ∧ p1n) ∨ ... ∨ (pm1 ∧ pm2 ∧ ... ∧
pmn)
 well-known rules for logical operations (AND ∧, OR
∨, and NOT ¬). Some of these rules are:
 1. p1 ∧ p2 ⇔ p2 ∧ p1
 2. p1 ∨ p2 ⇔ p2 ∨ p1
 3. p1 ∧(p2 ∧ p3) ⇔ (p1 ∧ p2)∧ p3
 4. p1 ∨(p2 ∨ p3) ⇔ (p1 ∨ p2)∨ p3
 5. p1 ∧(p2 ∨ p3) ⇔ (p1 ∧ p2)∨(p1 ∧ p3)
 6. p1 ∨(p2 ∧ p3) ⇔ (p1 ∨ p2)∧(p1 ∨ p3)
 ¬(p1 ∧ p2) ⇔ ¬p1 ∨ ¬p2
 8. ¬(p1 ∨ p2) ⇔ ¬p1 ∧ ¬p2
 9. ¬(¬p) ⇔ p

 These rules help transform the query to make it easier


to process and improve its performance.
ENGINEERING DATABASE
EMP(ENO, ENAME, TITLE)
PROJ(PNO, PNAME, BUDGET, LOC, CNAME)
ASG(ENO, PNO, RESP, DUR)
PAY(TITLE, SAL)
“Find the names of employees who have
been working on project P1 for 12 or 24
months”
SELECT ENAME FROM EMP, ASG
WHERE EMP.ENO = ASG.ENO AND
ASG.PNO = "P1" AND DUR = 12 OR DUR = 24
 The qualification in conjunctive normal form
is
EMP.ENO = ASG.ENO ∧ ASG.PNO = “P1” ∧ (DUR =
12 ∨ DUR = 24)

 while the qualification in disjunctive normal


form is

(EMP.ENO = ASG.ENO ∧ ASG.PNO = “P1” ∧ DUR =


12) ∨ (EMP.ENO = ASG.ENO ∧ ASG.PNO = “P1” ∧
DUR = 24)
ANALYSIS
 Query analysis helps identify and reject queries that cannot
be processed further.
 A query is rejected if:
 It is type incorrect (wrong data types or undefined names are
used).
 It is semantically incorrect (the meaning of the query is invalid).
 If the query is incorrect, it is returned to the user with an
explanation.
 If the query is correct, processing continues.
 To detect incorrect queries:
 A type incorrect query occurs when:
 Attribute or relation names used in the query don’t exist in the global
schema.
 Operations are applied to the wrong data types (e.g., trying to add a
string to a number).
 The method used to find type errors is like type checking in
programming languages.
 However, in a database query, the type information comes from the
global schema (the structure of the database), not from the query
EXAMPLE
 The following SQL query on the engineering
database is type incorrect for two reasons.
 First,
attribute E# is not declared in the schema.
 Second, the operation “>200” is incompatible
with the type string of ENAME.

 SELECT E# FROM EMP


WHERE ENAME > 200;
 A query is semantically incorrect if its parts do
not help in producing the correct result.
Relational Calculus and Semantic
Correctness:
 In relational calculus, it’s generally hard to tell if a

query is semantically correct or not.


 However, for a large group of queries (those

without disjunction or negation), it is possible to


check their semantic correctness.
Use of Graphs in Queries:
 To analyze these queries, they are represented as

a type of graph called a query graph or


connection graph.
 Query Graph Structure:
 The graph contains nodes and edges:
 Nodes:
 One node represents the result of the query.
 Other nodes represent the data tables (operand relations).

 Edges:
 An edge between two nodes (not the result node) represents
a join operation.
 An edge that connects a node to the result represents a

projection operation.
A node that isn’t the result can be labeled with a
select operation or a self-join.
 Join Graph:
 An important part of the query graph is the join

graph, which focuses only on the join operations.


 This join graph is very helpful during the query

optimization phase.
EMP(ENO, ENAME, TITLE)
PROJ(PNO, PNAME, BUDGET, LOC, CNAME)
ASG(ENO, PNO, RESP, DUR)
PAY(TITLE, SAL)
 “Find the names and responsibilities of

programmers who have been working on the


CAD/CAM project for more than 3 years.”
The query expressed in SQL is

 SELECT ENAME, RESP FROM EMP, ASG, PROJ


WHERE EMP.ENO = ASG.ENO
AND ASG.PNO = PROJ.PNO
AND PNAME = "CAD/CAM"
AND DUR ≥ 36
AND TITLE = "Programmer"
Connected Graph which is semantically correct
 Semantically incorrect Disconnected Graphs

Let us consider the following SQL query:


SELECT ENAME, RESP FROM EMP, ASG, PROJ
WHERE EMP.ENO = ASG.ENO AND
PNAME = "CAD/CAM" AND DUR ≥ 36 AND TITLE =
"Programmer"
ELIMINATION OF REDUNDANCY
 Relational languages: These are used to handle and control
semantic data (data that has meaning).
 User queries: When a user writes a query (a request for
information), it is often done on a "view" (a virtual table based
on a database).
 Adding conditions: Sometimes, extra conditions (called
predicates) are added to the query to make sure the data is
accurate, secure, and corresponds to the right view.
 Redundancy: Adding too many conditions can create
unnecessary repetitions, leading to extra work for the system.
 Naive evaluation: If the system processes these conditions
without recognizing the repetitions, it may do the same work
multiple times.
 Solution: To avoid this problem, the system can use special
rules (called idempotency rules) to simplify the query and
remove the unnecessary, repeated conditions.
 Result: By simplifying, the system can work more efficiently,
reducing duplicated efforts.
IDEMPOTENCY RULES
 1. p∧ p ⇔ p
 2. p∨ p ⇔ p

 3. p∧true ⇔ p

 4. p∨ f alse ⇔ p

 5. p∧ f alse ⇔ false

 6. p∨true ⇔ true

 7. p∧ ¬p ⇔ false

 8. p∨ ¬p ⇔ true

 9. p1 ∧(p1 ∨ p2) ⇔ p1

 10. p1 ∨(p1 ∧ p2) ⇔ p1


 SELECT TITLE FROM EMP
WHERE (NOT (TITLE = "Programmer") AND
(TITLE = "Programmer" OR TITLE = "Elect. Eng.")
AND NOT (TITLE = "Elect. Eng.")) OR
ENAME = "J. Doe"
 Let p1 be TITLE = “Programmer”,

p2 be TITLE = “Elect. Eng.”, and


p3 be ENAME = “J. Doe”.
 The query qualification is

(~P1 ^ (P1 v P2) ^ ~P2) v P3


= ((~P1^P1) v (P2 ^ ~P2) v P3  (F v F) v P3  F v
P3.
 SELECT TITLE FROM EMP

WHERE ENAME = "J. Doe"


REWRITING
 Query Rewriting: The last part of breaking down a query is to
convert it into relational algebra.
 Operator Tree:
 The query is represented as a visual tree, called an operator tree.
 The leaf nodes (bottom) are actual tables from the database.
 The non-leaf nodes (above) are results from relational algebra
operations (like join, select, etc.).
 The tree starts at the leaves and ends at the root, which is the final
answer.
 Steps to Build the Tree:
 A leaf is made for each table (relation) mentioned in the query.

 The root node represents the final result (from the SQL SELECT

clause).
 The conditions (from the SQL WHERE clause) are turned into

relational algebra operations, connecting the leaves to the root.


 SELECT ENAME FROM PROJ, ASG, EMP
WHERE ASG.ENO = EMP.ENO AND
ASG.PNO = PROJ.PNO AND ENAME != "J. Doe"
AND PROJ.PNAME = "CAD/CAM"
AND (DUR = 12 OR DUR = 24)
 Tree Variations:
 There can be multiple valid versions of the operator tree, all
achieving the same result.
 There are six key rules (equivalence rules) to help rewrite
and optimize these trees.
 Important Rules:
 Commutativity: Operations like join and Cartesian product
can be switched around (e.g., R × S is the same as S × R).
 Associativity: You can group operations differently without
changing the result (e.g., (R × S) × T is the same as R × (S
× T)).
 Idempotence: Repeated operations can be grouped or split
to simplify the query.
 Commuting selection with projection: You can reorder
operations like selection and projection to improve
performance.
 Commuting selection with binary operators: Selection
and Cartesian product can be commute.
 Commuting projection with binary operators.
Projection and Cartesian product can be commuted.
 Optimization:
 These rules help optimize queries by reducing unnecessary
operations or grouping them efficiently.
 The goal is to avoid "bad" trees that take more time or
resources to compute.
 Practical Use:
 Unary operations (like select or project) are performed as
early as possible to reduce the size of data.
 Binary operations (like joins) are ordered for efficiency.
 The process helps make queries run faster and use fewer
resources.
LOCALIZATION LAYER
 In Layers of Query Processor localization layer
translates an algebraic query on global
relations into an algebraic query expressed on
physical fragments. Localization uses
information stored in the fragment schema.
 Fragmentation is defined through
fragmentation rules, which can be expressed
as relational queries.
 A global relation can be reconstructed by

applying the reconstruction (or reverse


fragmentation) rules and deriving a relational
algebra program whose operands are the
fragments. This is called as localization.
A basic/ navie way to make a
distributed query local is by changing
each global part of the query to its local
version.
 This can be thought of as replacing the

last parts of the query with smaller parts


that match the local programs.
 The new query created in this way is

called the "localized query.”


 This method is not efficient because, it

still make the query simpler and better.


 Here are the localization layer query

processor, which make query efficient.


1. REDUCTION FOR PRIMARY
HORIZONTAL FRAGMENTATION
The horizontal fragmentation function distributes
a relation based on selection predicates.
Eg: EMP(ENO, ENAME, TITLE).

split into three horizontal fragments EMP1, EMP2,


and EMP3
 The localization program for an horizontally
fragmented relation is the union of the
fragments.
EMP = EMP1 ∪ EMP2 ∪ EMP3
 Reducing queries on horizontally fragmented

data mainly involves finding parts that will give


empty results after restructuring and removing
them.
 Horizontal fragmentation can help make both

selection and join operations simpler.


A) REDUCTION WITH SELECTION
 Selections on fragments that have a condition
that goes against the rule used to divide the
data, the result will be empty.
 Given a relation R that has been horizontally

fragmented as R1, R2, ..., Rw, where Rj = σ pj


(R), the rule can be stated formally as follows:
SELECT * FROM EMP WHERE ENO = "E5"
2. REDUCTION WITH JOIN
 Joins on horizontally fragmented data can be
simplified when the fragments are based on the
same attribute used in the join.
 This simplification involves distributing the join

over unions and removing unnecessary joins.


 Simplification Rule:

 If you have two fragments (R1 and R2) of a

relation R, and a second relation S, the join of (R1


∪ R2) with S can be split into two separate joins:

This transformation pushes the union higher in the query,


making all fragment joins clear.
 Useless joins are identified when the conditions of
two fragments contradict each other, producing
no result (an empty set).
 Rule 2: If two fragments (Ri and Rj) are defined

by different conditions (pi and pj), their join is


empty if the conditions contradict.

 This rule allows the join of two relations to be done as partial


joins
in parallel.
 A simplified query is better when there are fewer partial
joins,
which happens when there are many contradicting
conditions.
 In the worst case, every fragment of one relation must join
with
every fragment of the other, like a Cartesian product.
Simplified queries perform better when fragments share
EXAMPLE
 Assume that relation EMP is fragmented
between EMP1, EMP2, and EMP3, as above,
and that relation ASG is fragmented as

EMP1 and ASG1 are defined by the same predicate.


Furthermore, the predicate defining ASG2 is the union of
the predicates defining EMP2 and EMP3.
SELECT * FROM EMP, ASG WHERE EMP.ENO = ASG.ENO ;
The query reduced by distributing joins over unions and applying
rule 2 can be implemented as a union of three partial joins that
can be done in parallel
REDUCTION FOR VERTICAL FRAGMENTATION
 Vertical fragmentation divides a relation by
separating it based on specific attributes
(columns).
 To bring these fragments back together, we use a

join operation.
 For vertically fragmented data, the localization

process involves joining the fragments using a


common attribute they all share.
 Just like horizontal fragmentation, queries on
vertical fragments can be simplified by finding
and removing unnecessary parts.
 Projections on vertical fragments that don’t share

any useful attributes with the main query (except


the key) produce results that are not useful, even
though they aren't empty.
 Given a relation R with attributes A = {A1, A2, ...,

An}, a vertical fragment Ri is created using only a


subset of those attributes A' (where A' ⊆ A).where
SELECT ENAME FROM EMP
3. REDUCTION FOR DERIVED FRAGMENTATION
 The join operation is very common and costly, so it
can be optimized using primary horizontal
fragmentation.
 If the joined relations are fragmented based on the

same join attribute, the join can be done as a union


of partial joins.
 However, this method limits one relation from being

fragmented by other attributes used for selection.


 Derived horizontal fragmentation is another
approach to improve both select and join operations.
 In derived horizontal fragmentation, relation R is

fragmented based on the fragmentation of relation S,


meaning fragments with the same join attribute
values are stored together.
REDUCTION FOR DERIVED
FRAGMENTATION
 Relation S can also be fragmented based on
a selection condition.
 Derived fragmentation works best for one-to-

many relationships (S → R), where one tuple


of S matches multiple tuples in R, but each R
tuple matches only one tuple in S.
 Derived fragmentation could be used for

many-to-many relationships, but this requires


replicating tuples of S, which is hard to
manage consistently.
 To keep things simple, it's recommended to

use derived fragmentation only for


hierarchical (one-to-many)
The localization program for a horizontally
fragmented relation is the union of the fragments.
In our example, we have
ASG = ASG1∪ ASG2
 Queries on derived fragments can also be simplified.
 Since derived fragmentation helps optimize join

queries, one way to simplify is by distributing joins


over unions (used in the localization process) and
applying Rule 2.
 Rule 2 says that if fragmentation conditions
(predicates) conflict, the join will produce an empty
result.
 For example, the predicates of ASG1 and EMP2

conflict, so joining them gives an empty result


 (ASG1 ✶ EMP2 = φ).

 Unlike earlier join reductions, the simplified


(reduced) query is always better than the localized
query.
 This is because the number of partial joins usually

matches the number of fragments in relation R.


 The reduction by derived fragmentation is
illustrated by applying it to the following SQL
query, which retrieves all attributes of tuples from
EMP and ASG that have the same value of ENO
and the title “Mech. Eng.”:
 SELECT * FROM EMP, ASG

WHERE ASG.ENO = EMP.ENO


AND TITLE = "Mech. Eng.“
REDUCTION FOR HYBRID
FRAGMENTATION
 Hybrid fragmentation is created by combining
different fragmentation methods (horizontal,
vertical, etc.).
 The goal of hybrid fragmentation is to efficiently

handle queries involving projection (selecting


columns), selection (filtering rows), and joins.
 Optimizing one operation or a combination of

operations may make other operations less


efficient.
 For example, if hybrid fragmentation is based on

selection and projection, it may make only


selection or only projection less efficient compared
to horizontal or vertical fragmentation.
 The localization process for a hybrid fragmented
 Queries on hybrid fragments can be reduced
by combining the rules used, respectively, in
primary horizontal, vertical, and derived
horizontal fragmentation. These rules can be
summarized as follows:
 Remove empty relations generated by
contradicting selections on horizontal
fragments.
 Remove useless relations generated by
projections on vertical fragments.
 Distribute joins over unions in order to isolate

and remove useless joins


QUERY OPTIMIZATION
 Query optimization is about improving how a query is executed,
whether in a centralized or distributed environment.
 The query is first rewritten in relational algebra after converting

it from a calculus expression.


Query Optimization:
 It is the process of creating a query execution plan (QEP) that

minimizes cost and is a strategy for executing a query efficiently.


 Three Main Parts of a Query Optimizer:
 Search space: This is the set of different ways the query can be
executed. All these ways give the same result but have different
performance.
 Cost model: It predicts the cost of each execution plan based on
knowledge of the environment.
 Search strategy: It explores different execution plans and picks the
best one using the cost model.
 Transformation rules (from relational algebra) are applied to
create different execution plans.
 The details of whether the environment is centralized or
distributed affect the search space and cost model.
1. SEARCH SPACE
 Query Execution Plan are represented by
operator trees, which show the order of
operations for a query.
 Operator trees also include details like the best

algorithm chosen for each operation.


 Search Space for Queries:

 The search space is made up of all possible

operator trees for a given query, created using


transformation rules.
 Focus is often on "join trees" (operator trees

with join or Cartesian product) because the


order of joins impacts performance.
•Below three equivalent join trees for that query, which are
obtained by
exploiting the associatively of binary operators.
•Each tree can be assigned a cost based on the cost of its
operators.
•One tree (with a Cartesian product) may have a much higher
cost than
the others.
Handling Complex Queries:
 For complex queries, there can be many possible

operator trees.
 For N relations, the number of possible join trees

is very large (O(N!)), making optimization time-


consuming.
Optimizers Reduce Search Space:
 Optimizers limit the search space by using

heuristics:
 Perform selection and projection early.
 Avoid unnecessary Cartesian products (e.g., operator
tree (c) in would not be considered).
Types of Join Trees:
Linear trees: At least one operand in each operation is a
base relation. This reduces the search space to O(2^N).
Bushy trees: More general, where both operands can be
intermediate results. These are useful in distributed
systems for parallel execution.
Parallelism in Bushy Trees:
In a distributed environment, bushy trees allow operations
to run in parallel, improving performance (e.g., join tree
(b)
SEARCH STRATEGY
Dynamic Programming in Query Optimization:
 The most common search strategy is dynamic programming,
which is deterministic.
 It builds query plans step-by-step, starting from base relations
and adding one relation at a time until the complete plan is
formed.
 Dynamic programming creates all possible plans and selects
the best one.
 Pruning is used to discard partial plans that are unlikely to be
optimal, reducing optimization costs.
Greedy Algorithm (Another Deterministic Strategy):
 Unlike dynamic programming, the greedy algorithm builds
only one plan at a time, following a depth-first approach.
Pros and Cons of Dynamic Programming:
 It guarantees finding the best plan but works well only when
there are fewer relations (5 or 6 or fewer).
 For more complex queries, it becomes too expensive in terms of
time and memory.
Randomized Strategies for Complex Queries:
 Randomized strategies are used for complex queries with

more relations.
 These strategies reduce the complexity of optimization but

don’t always guarantee the best plan.


 They allow the optimizer to balance between optimization

time and execution time.


Examples of Randomized Strategies:
 Iterative Improvement focus on improving a plan by

making small changes to it.


 These algorithms start with one or more initial plans,

created using a greedy approach, and try to improve them


by exploring nearby plans.
 A typical change might involve swapping two relations

randomly.
Performance of Randomized Strategies:
 Randomized strategies tend to perform better than
deterministic strategies when the query involves many
relations.
DISTRIBUTED COST MODEL
An optimizer’s cost model helps predict how
long it will take to run a query. It includes:
 Cost functions: These predict the time

needed for each operation in the query.


 Statistics and base data: These help

estimate the cost by using information about


the data.
 Formulas: These calculate the sizes of the

results produced during different steps of the


query.
 The goal is to estimate the execution time

of the query.
1. COST FUNCTIONS
Cost of Distributed Execution Strategy:
 The cost can be measured in two ways:
 Total time: The sum of all time components involved in the
query.
 Response time: The time from starting the query to
completing it.
Formula for Total Time:
 Total time is calculated using this formula:
 TCPU: Time for CPU instructions.
 TI/O: Time for disk input/output (I/O).
 TMSG: Time to send or receive a message.
 TT R: Time to transfer data between sites.

 Local processing time includes CPU and I/O operations.


 Communication time includes sending messages and transferring
data between sites.
Communication Time:
 Communication time is based on the number of

bytes sent and received. It is often assumed to be


constant for simplicity, but this may vary in wide-
area networks.

Cost in Networks:
 In wide-area networks (like the Internet),

communication time dominates due to long


distances.
 In local-area networks, the cost of local

processing and communication is more balanced.


RESPONSE TIME:

 Response time focuses on the maximum


number of tasks that must be done
sequentially.
 Parallel processing and communication are

ignored when calculating response time.


 The goal is to reduce response time by doing

tasks in parallel.
EXAMPLE OF TOTAL TIME VS.
RESPONSE TIME:
 When transferring data from two sites to a third
site:
 Total time adds up the time for both data transfers.
 Response time considers the longer of the two
transfers, as they can happen in parallel.
Trade-Off Between Total Time and Response
Time:
 Minimizing response time often requires more

parallel execution, which can increase total time.


 Minimizing total time improves resource usage

but might not reduce response time.


 A balance between minimizing total time and

response time is usually the goal.


 Let us illustrate the difference between total
cost and response time using the example of
Figure , which computes the answer to a
query at site 3 with data from sites 1 and 2.
For simplicity, we assume that only
communication cost is considered.
 Database Statistics:
 The size of intermediate results affects performance,
especially when operations are done at different
locations.
 The goal is to estimate the size of these intermediate
results to reduce data transfer over the network.
 Database statistics help predict the size of
intermediate results, like the number of tuples (rows)
and the distinct values of attributes (columns).
 Cardinality Estimation:
 Cardinality refers to the number of tuples in the
result of a database query operation.
 Estimating cardinalities helps improve query
performance.
 Formulas are used to estimate the number of results
from operations like selection, projection, join, and
others.
Query Operations:
 For selection (choosing specific rows), there are formulas

depend on calculation how many rows match a condition.


 For projection (choosing specific columns), if duplicate rows

are removed, it's harder to estimate the result size.


 Cartesian product simply multiplies the number of rows in

two tables.
 Join operations combine data from two tables, and the result

size depends on how well the data matches.


Histograms:
 Histograms help in estimating query results by capturing data

distribution, especially when the data is not evenly


distributed.
 They divide the data into buckets to represent the frequency

and range of values.


Query Optimization:
 Centralized query optimization involves optimizing
queries on a single database system.
 Dynamic query optimization builds and optimizes queries

during execution without relying on a pre-defined cost model.


CENTRALIZED QUERY
OPTIMIZATION
 Centralized Query Optimizations the process of
improving the performance of queries in a single-
computer system.
 Distributed queries are broken down into local queries,

which are processed centrally.


 Distributed optimization often builds on centralized

techniques.
 Centralized optimization is simpler because it doesn't

involve minimizing communication costs.


Classification by Timing:
 Dynamic: Optimization happens while the query is

running. Eg: Ingres Algorithm


 Static: Optimization is done before the query is

executed. Eg: System R Algorithm


 Hybrid: A mix of dynamic and static methods.
DYNAMIC QUERY OPTIMIZATION
 What is Dynamic Query Optimization?
 It combines two phases: breaking down the query and optimizing
it, then executing it directly.
 No need to estimate the query cost beforehand.
 How it Works:
 The optimizer breaks down the main query into smaller sub-queries.
 Each sub-query focuses on one table at a time.
 These sub-queries are executed step-by-step.
 Choosing the Best Method:
 The optimizer selects the best way to access data, like using an index or
scanning the table.
 For example, if you want a specific value from column A, the index on
column A will be used.
 Breaking Down the Query:
 If the query has many conditions, it’s split into smaller parts.
 Each smaller part uses a common table, making the query easier to
execute.
 Example:
 If you want to get the names of employees working on a specific project:
 First, get the project number of the desired project.
 Then, find employees linked to this project number.
 Handling Complex Queries:
 For complex queries, the algorithm reduces the number of tables
involved in each sub-query.
 It replaces the rows in a table with their actual values, making the
query smaller.
 Steps for Dynamic Query Optimization:
 Split the query into smaller, single-table queries (if possible).
 Apply conditions and extract data early to reduce the size of
intermediate results.
 Use efficient data structures to store intermediate results.
 Handle complex sub-queries by substituting tuples (rows) one by
one.
 Example of Tuple Substitution:
 If the query has two employees to be matched, it will create two
separate sub-queries for each employee.
 Then, execute these sub-queries to get the final result.
 When to Use Dynamic Query Optimization:
 It is useful when queries are complex and involve multiple tables and
conditions.
 Helps reduce execution time by breaking down the query step-by-
step.
 This method is called Dynamic-QOA and uses recursive steps
INGRES-QOA
Input:
 MRQ: Multi-relation query with n relations.

Output:
 output: Final result after execution.

 Initialization:

 Set output as an empty set: output ← φ.

 Check if the Query has Only One

Relation:
 If n = 1 then , execute MRQ directly:
 output ← run(MRQ).
Return the result as the final output.
 Recursive Case:
 If the MRQ has more than one relation:
 The MRQ is decomposed into m one-relation queries
(ORQs) and a smaller MRQ (MRQ0').
 Each ORQ is executed using the run function, and the
results are merged into output.
 A relation R is chosen from the remaining MRQ
(MRQ0') for tuple substitution.
 For each tuple t in R:
 The values of t are substituted into MRQ0' to
create a new MRQ (MRQ'').
 The algorithm is recursively called on MRQ''

to obtain its result (output').


 The result of MRQ'' is merged into output.
STATIC APPROACH: SYSTEM R
CENTRALIZED ALGORITHM

 Execute Joins
 Determine the possible ordering of joins
 Determine the cost of each ordering.
 Choose the join ordering with the minimal
cost.
Example:
“Names of employees working on the CAD/CAM
project”
PROJ * ASG * EMP
EMP has an index on ENO
ASG has an index on PNO
PROJ has an index on PNO and an index on PNAME

• The label eno on edge EMP-ASG

stands for predicate


EMP.ENO=ASG.ENO
•The label PNO on edge ASG
PROJ
stands for predicate
ASG.PNO=PROJ.PNO
STEP 1: SELECT THE BEST SINGLE-RELATION ACCESS PATHS

Emp : sequential scan (because there is no selection on


EMP)
Asg: sequential scan (because there is no selection on
ASG)
PROJ: index on PNAME (because there is a selection on
PROJ based on PNAME).
STEP 2: SELECT THE BEST JOIN
ORDERING FOR EACH RELATION

• The operations marked pruned are dynamically eliminated.


• (Emp X PROJ) and (PROJ X EMP) are pruned because they are
Cartesian
•We assume that (EMP ⋈ ASG) and (ASG ⋈ PROJ) have a cost higher
than
(ASG ⋈ EMP) and (PROJ ⋈ ASG), respectively.
•Best total join order ((PROJ ⋈ ASG) ⋈ EMP) , since it uses the indexes
best
* select PROJ using index on PNAME.
* Join with ASG using index on PNO.
SYSTEM R ALGORITHM
 Input:
 You have a query tree (QT) that involves n different
relations (tables).
 Output:
 The goal is to determine the best Query Execution Plan
(QEP) that has the lowest cost.
 Step 1: Calculate the Cost of Each Access Path
 For every relation Ri in the query tree:
 Identify each possible access path APij to that relation (e.g., a
sequential scan or using an index).
 Compute the cost for each access path APij using a cost
function that takes into account factors like I/O, CPU usage,
etc.
 Select the access path APij that has the lowest cost for that
relation and store it as the "best access path" for Ri.
 Step 2: Explore Different Orders of Joins
 There are many different ways to join n relations.

The algorithm considers each possible order of


joins:
 Foreach order of relations (e.g., (R1, R2, ..., Rn)),
construct a query execution plan using the best
access path found in Step 1.
 Compute the cost of this specific QEP.
 Step 3: Determine the Best QEP
 Out of all the possible query execution plans,

select the one with the minimum cost.


 Output:

 The QEP with the minimum cost is the best one

and is returned as the final output.


DISTRIBUTED QUERY
OPTIMIZATION- INGRES DYNAMIC
APPROACH
 Step 1: Break down the multi-relation query into
individual monorelation queries and execute
them.
 Step 2: Reduce the multi-relation query into

irreducible queries.
 Step 3: While there are irreducible queries:

Choose a query, decide where to process it,


move data fragments to the processing sites,
and execute the query.
 Step 4: Return the result of the last executed

query.
DISTRIBUTED QUERY OPTIMIZATION-
INGRES DYNAMIC APPROACH
 Input:
 The algorithm takes a multi-relation query

(MRQ) as input.
 MRQ consists of multiple relations (tables)

that need to be processed.


 Output:

 The result of the last multi-relation query

execution.
Step 1: Execute Each Monorelation Query
(ORQ)
 For each detachable query ORQi in MRQ:
 ORQi is a monorelation query (a query involving only a
single relation).
 Run each ORQi individually to get initial results.

Step 2: Reduce the Query


 Reduce the multi-relation query (MRQ) to a list of

irreducible queries (MRQ' list).


 These irreducible queries are smaller subqueries

that cannot be simplified further.


Step 3: While There Are Irreducible Queries (n ≠ 0)
 The variable n keeps track of the number of irreducible queries

left to process.
Step 3.1: Select the Next Query
 Choose the next irreducible query MRQ' that involves the smallest
fragments (smallest pieces of data).
Step 3.2: Determine Transfer and Processing Strategy
 Determine how to transfer fragments (data pieces) and where to
process the selected query (MRQ').
 Use a function SELECT_STRATEGY to decide which fragments to move
and where to process them.
Step 3.3: Move Selected Fragments to the Chosen Sites
 For each fragment F and site S in the fragment-site list:
 Move fragment F to the site S (a specific server or database location).
Step 3.4: Execute the Query MRQ'
 Execute the selected irreducible query MRQ' with the transferred
fragments.
Step 3.5: Update n
 Decrease n by 1 to reflect that one more irreducible query has been
processed.
 Step 4: Return the Final Output
 The result of the last executed query (MRQ')

is the final result of the original multi-relation


query (MRQ).

You might also like