Database Integration
Database Integration
2
Bottom-Up Design Methodology
• Bottom-up design involves the process by which information from participating
databases can be (physically or logically) integrated to form a single cohesive
multi-database.
• There are two alternative approaches. In some cases, the global conceptual (or
mediated) schema is defined first, in which case the bottom-up design involves
mapping LCSs to this schema.
• This is the case in data warehouses, but the practice is not restricted to these and
other data integration methodologies may follow the same strategy. In other cases,
the GCS is defined as an integration of parts of LCSs. In this case, the bottom-up
design involves both the generation of the GCS and the mapping of individual
LCSs to this GCS.
3
Database Integration Process
4
Database Integration Process
• The schema generation process consists of the following steps:
1. Schema matching to determine the syntactic and semantic
correspondences among the translated LCS elements or between
individual LCS elements and the pre-defined GCS elements.
2. Integration of the common schema elements into a global conceptual
(mediated) schema if one has not yet been defined.
3. Schema mapping that determines how to map the elements of each
LCS to the other elements of the GCS.
5
Schema Matching
• Schema matching determines which concepts of one schema match
those of another. if the GCS has already been defined, then one of
these schemas is typically the GCS, and the task is to match each LCS
to the GCS. Otherwise matching is done on two LCSs. The matches
that are determined in this phase are then used in schema mapping to
produce a set of directed mappings, which, when applied to the source
schema, would map its concepts to the target schema.
6
Schema Matching issues
Aside from schema heterogeneity, other issues that complicate the matching process are
the following:
• Insufficient schema and instance information
• Unavailability of schema documentation
• Subjectivity of matching
7
Schema Integration
Once schema matching is done, the correspondences between the various LCSs
have been identified. The next step is to create the GCS, and this is referred to as
schema integration. As indicated earlier, this step is only necessary if a GCS has
not already been defined and matching was performed on individual LCSs. If the
GSC was defined up-front, then the matching step would determine correspondences
between it and each of the LCSs and there would be no need for the integration step.
If the GCS is created as a result of the integration of LCSs based on correspondences
identified during schema matching, then, as part of integration, it is important to
identify the correspondences between the GCS and the LCSs.
8
Schema Mapping
Once a GCS (or mediated schema) is defined, it is necessary to identify how the
data from each of the local databases (source) can be mapped to GCS (target) while
preserving semantic consistency (as defined by both the source and the target).
Although schema matching has identified the correspondences between the LCSs
and the GCS, it may not have identified explicitly how to obtain the global database
from the local ones. This is what schema mapping is about.
In the case of data warehouses, schema mappings are used to explicitly extract data
from the sources, and translate them to the data warehouse schema for populating it.
In the case of data integration systems, these mappings are used in query processing
phase by both the query processor and the wrappers
9
Data Cleaning
Errors in source databases can always occur, requiring cleaning in order to correctly answer user queries.
Data cleaning is a problem that arises in both data warehouses and data integration systems, but in different
contexts.
In data warehouses where data are actually extracted from local operational databases and materialized as a
global database, cleaning is performed as the global database is created.
In the case of data integration systems, data cleaning is a process that needs to be performed during query
processing when data are returned from the source databases.
10
View Management
One of the main advantages of the relational model is that it provides full logical data independence.
External schemas enable user groups to have their particular view of the database. In a relational system, a
view is a virtual relation, defined as the result of a query on base relations (or real relations), but not
materialized like a base relation, which is stored in the database. A view is a dynamic window in the sense
that it reflects all updates to the database.
An external schema can be defined as a set of views and/or base relations. Besides their use in external
schemas, views are useful for ensuring data security in a simple way.
By selecting a subset of the database, views hide some data. If users may only access the database through
views, they cannot see or manipulate the hidden data, which are therefore secure.
11
Data Security
Data security is an important function of a database system that protects data against unauthorized access.
Data security includes two aspects:
• Data protection
• Access control
12
Data protection
Data protection is required to prevent unauthorized users from understanding the physical content of data.
This function is typically provided by file systems in the context of centralized and distributed operating
systems.
The main data protection approach is data encryption. which is useful both for information stored on disk
and for information exchanged on a network. Encrypted (encoded) data can be decrypted (decoded) only by
authorized users who “know” the code.
The two main schemes are the Data Encryption Standard [NBS, 1977] and the public-key encryption
schemes.
13
Access control
Access control must guarantee that only authorized users perform operations they are allowed to perform on
the database.
Many different users may have access to a large collection of data under the control of a single centralized
or distributed system.
The centralized or distributed DBMS must thus be able to restrict the access of a subset of the database to a
subset of the users.
Access control has long been provided by operating systems, and more recently, by distributed operating
systems as services of the file system.
14
Semantic Integrity Control
Another important and difficult problem for a database system is how to guarantee database consistency. A
database state is said to be consistent if the database satisfies a set of constraints, called semantic integrity
constraints. Maintaining a consistent database requires various mechanisms such as concurrency control, re-
liability, protection, and semantic integrity control, which are provided as part of transaction management.
Semantic integrity control ensures database consistency by rejecting update transactions that lead to
inconsistent database states, or by activating specific actions on the database state, which compensate for the
effects of the update transactions.
Note that the updated database must satisfy the set of integrity constraints.
15
THANK YOU
16