Ch-9 Multidatabase Query Processing
Ch-9 Multidatabase Query Processing
Processing
by:
Kay Kay Chetty
Jason Eversole
Table of Contents
2
Introduction
3
Issues in Multidatabase Query Processing
The autonomy of the component DBMSs poses problems. So, we define this
autonomy along three main dimensions.
Communication: This means that a component DBMS communicates with others at
its own discretion. It may terminate its services at anytime.
Design: This may restrict the availability and accuracy of cost information that is
needed for query optimization.
Execution: It is difficult to apply some of the query optimization strategies that
have been previously discussed. Problems arise because communication with
component DBMSs occur at a high level of the DBMS API.
4
Issues in Multidatabase Query Processing
In distributed DBMSs query processors have to deal only with data distribution
across multiple sites.
In a distributed multidatabase environment data is distributed not only across
sites but also across multiple databases.
This difference increases the parties involved in processing the query from
two in the distributed DBMS to three in the distributed multidatabase
environment.
5
Multidatabase Query Processing
Architecture
Mediator/Wrapper Architecture
Each component database has a wrapper that exports source schema, data and
query processing capabilities
A mediator centralizes information from the wrappers in a unified view of available
data, this is stored in a global data dictionary
The mediator also preforms query processing using the wrappers to access the
component DBMSs.
The data model used by the mediator can be:
Relational
Object-Oriented
Semi-Structured
6
Multidatabase Query Processing
Architecture
Mediator/Wrapper Architecture has several advantages
The specialized components allow various concerns of different kinds of users to be
handled separately
Mediators typically specialize in a related set of component databases with similar
data
7
Multidatabase Query Processing
Architecture
Three main layers are involved in multidatabase query processing
The first two layers map input query into an optimized distributed query execution
plan (QEP)
They also perform the functions of query rewriting, query optimization and some
query processing
The third layer performs query translation and execution using the wrappers
8
Query Rewriting Using Views
9
Datalog Terminology
10
Datalog Terminology
Consider relations EMP(ENO, ENAME, TITLE, CITY) and ASG(ENO, PNO, DUR)
Assume ENO is the primary key of EMP and (ENO, PNO) is of ASG
SQL Query:
SELECT ENO, TITLE, PNO FROM EMP, ASG
WHERE EMP.ENO = ASG.ENO AND TITLE = “Programmer” or DUR = 24
Query in Datalog:
Q(ENO, TITLE, PNO): - EMP(ENO, ENAME, “Programmer”, CITY),
ASG(ENO, PNO, DUR)
Q(ENO, TITLE, PNO): - EMP(ENO, ENAME, TITLE, CITY),
ASG(ENO, PNO, 24)
11
Rewriting in GAV
Is this approach, the global schema is expressed in terms of the data sources
Each global relation is defined as a view over the local relation
The rewriting technique that will be used is called unfolding
It replaces each global relation invoked with it’s corresponding view
This is done by applying the view definition rules to the query and producing
a union of conjunctive queries
Unfolding can however generate redundant queries that need to be
eliminated
12
Rewriting in LAV
13
Rewriting in LAV
14
Rewriting in LAV
15
Rewriting in LAV
By considering the query globally and considering how each predicate in the
query interacts with the views the MinCon algorithm addresses the limitations
of the bucket algorithm.
One: views are selected that contain subgoals corresponding to the subgoals
of query Q
Finds a mapping that unifies a subgoal q of Q with a subgoal v in view V
It then considers the join predicates in Q and finds the minimum set of
additional subgoals of Q
These must be mapped to subgoals in V
This set of subgolas of Q is captured by a MinCon description (MCD) associated
with V
16
Rewriting in LAV
17
Query Optimization and Execution
18
Heterogeneous Cost Modeling
The global cost function definition and the associated problem of obtaining
cost-related information from component DBMSs is the most studied of the
three problems.
Primarily interested in determining the cost of the lower levels of a query
execution tree
This corresponds to the parts of the query executed at component DBMSs
Three approaches:
I. Black Box Approach
II. Customized Approach
III. Dynamic Approach
19
Black Box Approach
This approach treats each component DBMS as a block box, running some test
queries against it and then determines the necessary cost information
Cost functions are expressed logically rather than on the basis of physical
characteristics.
The cost function for component DBMSs is expressed as:
20
Black Box Approach
The individual terms of this formula will differ for different operators.
The difficulty is determining the coefficients is the formula since they will
change with different component DBMSs.
The major drawback of this approach is that the cost model is common for all
component DBMSs and may not capture their individual specifics.
So, it might fail to accurately estimate the cost of a query executed at a
component DBMS that exposes unforeseen behavior.
21
Customized Approach
Uses previous knowledge about the component DBMSs and their external
characteristics to subjectively determine the cost information
The basis is that the query processors of the component DBMSs are too
different to be represented by a unique cost model
Assumes that the ability to accurately estimate the cost of local subqueries
will improve global optimization.
Provides framework to integrate the component DBMSs cost model into the
mediator query optimizer.
Extends the wrapper interface such that the mediator gets some specific cost
information from each of the wrappers
The wrapper developer is free to provide a cost model, partially or entirely
22
Customized Approach
This poses a challenge to integrate this cost description into the mediator
query optimizer.
Two main solutions:
One: Provide the logic within the wrapper to compute three cost estimates
I. The time to initiate the query process and receive the first result item
(reset_cost)
II. The time to get the next item (advance_cost)
III. The result cardinality
This makes the total query cost:
Total_access_cost = reset_cost +(cardinality – 1) * advance_cost
23
Customized Approach
24
Customized Approach
25
Dynamic Approach
26
Dynamic Approach
One approach is to extend the sampling method and consider user queries as
new samples for environments where network connection, data storage and
available memory change over time.
Query response time is measured to adjust the cost model parameters at run
time for subsequent queries.
This avoids the overhead of processing queries periodically
However, it still requires heavy computation to solve the cost model
equations and does not guarantee that the cost model’s precision improves
over time
27
Dynamic Approach
28
Dynamic Approach
29
Heterogeneous Query Optimization
One component DBMS may support only simple select operations while another may
support complex queries involving join and aggregate.
Depending on how the wrappers export such capabilities, query processing at the
mediator level can be more or less complex.
1---Query based:
The wrappers support the same query capability,
e.g., a subset of SQL, which is translated to the capability of the component DBMS.
2---Operator based:
The wrappers export the capabilities of the component DBMSs through compositions of
relational operators.
Thus, there is more flexibility in defining the level of functionality between the mediator,
any functionality that may not be supported by component DBMSs (e.g., join) will need to
be implemented at the mediator.
30
Query-based Approach
DBMSs appear homogeneous to the mediator, one approach is to use a distributed
cost-based query optimization algorithm with a heterogeneous cost model
extensions are needed to convert the distributed execution plan into
subqueries
Hybrid two-step optimization technique
1. static plan is produced by a centralized cost-based query optimizer
2. At startup time, an execution plan is produced by carrying out site selection
and allocating the subqueries to the sites.
31
Cost-based query optimizer : first generate a left linear join tree, and then
convert it to a bushy tree.
A hybrid algorithm that concurrently performs a bottom-up and top-down sweep
of the left linear join execution tree, transforming it, step-by-step, to a bushy
tree.
The algorithm has two pointers
Bottom UAN (Upper Anchor Nodes)
Is set to the grandparent of the leftmost root node
Top UAN
Is set to the root
For each UAN the algorithm selects a lower anchor node (LAN)
32
The LAN is chosen such that its right child subtree’s
response time is close to the corresponding UAN’s right
child subtree’s response time.
34
Operator-based Approach
Expressing the capabilities of the component DBMSs through relational
operators allows tight integration of query processing between mediator and
wrappers.
In this approach, the capabilities of the component DBMSs are expressed by
the wrappers as planning functions that can be directly called by a
centralized query optimizer.
-------Consider the following SQL query submitted to mediator m:
SELECT ENAME, PNAME, DUR
FROM EMPASG
WHERE CITY = "Paris" AND DUR > 24
the GAV approach, the global view EMPASG(ENAME, CITY, PNAME,DUR) is
EMPASG = (db1.EMP 1 db2.ASG) db3.EMPASG
35
• Used to model non-relational data sources such as web sites.
36
1. Search space generation:
The query is decomposed into a number of QEPs, which constitutes the search
space for query optimization.
The search space is generated using a traditional search strategy such as
dynamic programming.
2. QEP decomposition
Each QEP is decomposed into a forest of n wrapper QEPs and a composition QEP.
Each wrapper QEP is the largest part of the initial QEP that can be entirely
executed by the wrapper.
Operators that cannot be performed by a wrapper are moved up to the
composition QEP.
The composition QEP combines the results of the wrapper QEPs in the final
answer, typically through unions and joins of the intermediate results produced
by the wrappers.
37
Eddy Approach:
Eddy is a general framework for adaptive query processing. It was developed in
the context of the Telegraph project with the goal of running queries on large
volumes of online data with unpredictable input rates and fluctuations in the
running environment.
38
Querry
39
A Query Execution Plan with
Eddy.
40
Query Translation and Execution
Query translation and execution is performed by the wrappers using the component
DBMSs. A wrapper encapsulates the details of one or more component databases,
each supported by the same DBMS (or file system). It also exports to the mediator
the component DBMS capabilities and cost functions in a common interface. One
of the major practical uses of wrappers has been to allow an SQL-based DBMS to
access non-SQL databases
Wrapper interfaces
41
The main function of a wrapper is conversion between the common interface and
the DBMS-dependent interface.
Second, the wrapper must translate the results to the common interface format so
that they can be returned to the mediator for integration.
In addition, the wrapper can execute operations that are not supported by the
component DBMS
42
Consider the relation EMP(ENO, ENAME, CITY) stored in a very simple component
database, in server Component DB, built with Unix text files. Each
EMP tuple can then be stored as a line in a file, e.g., with the attributes
separated by “:”. In SQL/MED, the definition of the local schema for this
relation together with the mapping to a Unix file can be declared as a foreign
relation with the following
statement:
CREATE FOREIGN TABLE
EMP ENO INTEGER, ENAME VARCHAR(30), CITY VARCHAR(20)
SERVER ComponentDB
OPTIONS (Filename ’/usr/EngDB/emp.txt’, Delimiter ’:’)
Then, the mediator can send the wrapper supporting access to this relation
SQL statements. For instance, the query:
43
SELECT ENAME FROM EMP
can be translated by the wrapper using the following Unix shell command to
extract the relevant attribute:
cut -d: -f2 /usr/EngDB/emp
Additional processing, e.g., for type conversion, can then be done using
programming code.
1. Wrappers are mostly used for read-only queries, which makes query
translation and wrapper construction relatively easy.
2. Wrapper construction typically relies on CASE tools with reusable components
to generate most of the wrapper code
44
Problem:
The main problem of updating through a wrapper is to guarantee
component database consistency by rejecting all updates that violate
integrity constraints, whether they are explicit or implicit.
Solution:
A software engineering solution to this problem uses a CASE tool with
reverse engineering techniques to identify within application code the
implicit integrity constraints which are then translated into validation code in
the wrappers
45
Conclusion
Query processing in multidata-base systems is significantly more complex than
in tightly-integrated and homogeneous distributed DBMSs.
46