Schema Matching
Schema Matching
The terms schema matching and mapping are often used interchangeably for a database process. For this
article, we differentiate the two as follows: Schema matching is the process of identifying that two objects
are semantically related (scope of this article) while mapping refers to the transformations between the
objects. For example, in the two schemas DB1.Student (Name, SSN, Level, Major, Marks) and DB2.Grad-
Student (Name, ID, Major, Grades); possible matches would be: DB1.Student ≈ DB2.Grad-Student;
DB1.SSN = DB2.ID etc. and possible transformations or mappings would be: DB1.Marks to DB2.Grades
(100-90 A; 90-80 B: etc.).
Automating these two approaches has been one of the fundamental tasks of data integration. In general, it is
not possible to determine fully automatically the different correspondences between two schemas —
primarily because of the differing and often not explicated or documented semantics of the two schemas.
Impediments
Among others, common challenges to automating matching and mapping have been previously classified
in[1] especially for relational DB schemas; and in[2] – a fairly comprehensive list of heterogeneity not
limited to the relational model recognizing schematic vs semantic differences/heterogeneity. Most of these
heterogeneities exist because schemas use different representations or definitions to represent the same
information (schema conflicts); OR different expressions, units, and precision result in conflicting
representations of the same data (data conflicts).[1] Research in schema matching seeks to provide
automated support to the process of finding semantic matches between two schemas. This process is made
harder due to heterogeneities at the following levels[3]
Syntactic heterogeneity – differences in the language used for representing the elements
Structural heterogeneity – differences in the types, structures of the elements
Model / Representational heterogeneity – differences in the underlying models (database,
ontologies) or their representations (key-value pairs, relational, document, XML, JSON,
triples, graph, RDF, OWL)
Semantic heterogeneity – where the same real world entity is represented using different
terms or vice versa
Schema matching
[4][5][6][7][8]
Methodology
Discusses a generic methodology for the task of schema integration or the activities involved.[5] According
to the authors, one can view the integration.
Approaches
Approaches to schema integration can be broadly classified as ones that exploit either just schema
information or schema and instance level information.[4][5]
Schema-level matchers only consider schema information, not instance data. The available information
includes the usual properties of schema elements, such as name, description, data type, relationship types
(part-of, is-a, etc.), constraints, and schema structure. Working at the element (atomic elements like
attributes of objects) or structure level (matching combinations of elements that appear together in a
structure), these properties are used to identify matching elements in two schemas. Language-based or
linguistic matchers use names and text (i.e., words or sentences) to find semantically similar schema
elements. Constraint based matchers exploit constraints often contained in schemas. Such constraints are
used to define data types and value ranges, uniqueness, optionality, relationship types and cardinalities, etc.
Constraints in two input schemas are matched to determine the similarity of the schema elements.
Instance-level matchers use instance-level data to gather important insight into the contents and meaning
of the schema elements. These are typically used in addition to schema level matches in order to boost the
confidence in match results, more so when the information available at the schema level is insufficient.
Matchers at this level use linguistic and constraint based characterization of instances. For example, using
linguistic techniques, it might be possible to look at the Dept, DeptName and EmpName instances to
conclude that DeptName is a better match candidate for Dept than EmpName. Constraints like zipcodes
must be 5 digits long or format of phone numbers may allow matching of such types of instance data.[9]
Hybrid matchers directly combine several matching approaches to determine match candidates based on
multiple criteria or information sources. Most of these techniques also employ additional information such
as dictionaries, thesauri, and user-provided match or mismatch information[10]
Reusing matching information Another initiative has been to re-use previous matching information as
auxiliary information for future matching tasks. The motivation for this work is that structures or
substructures often repeat, for example in schemas in the E-commerce domain. Such a reuse of previous
matches however needs to be a careful choice. It is possible that such a reuse makes sense only for some
part of a new schema or only in some domains. For example, Salary and Income may be considered
identical in a payroll application but not in a tax reporting application. There are several open ended
challenges in such reuse that deserves further work.
Sample Prototypes Typically, the implementation of such matching techniques can be classified as being
either rule based or learner based systems. The complementary nature of these different approaches has
instigated a number of applications using a combination of techniques depending on the nature of the
domain or application under consideration.[4][5]
Identified relationships
The relationship types between objects that are identified at the end of a matching process are typically
those with set semantics such as overlap, disjointness, exclusion, equivalence, or subsumption. The logical
encodings of these relationships are what they mean. Among others, an early attempt to use description
logics for schema integration and identifying such relationships was presented.[11] Several state of the art
matching tools today[4][7] and those benchmarked in the Ontology Alignment Evaluation Initiative[12] are
capable of identifying many such simple (1:1 / 1:n / n:1 element level matches) and complex matches (n:1 /
n:m element or structure level matches) between objects.
Evaluation of quality
The quality of schema matching is commonly measured by precision and recall. While precision measures
the number of correctly matched pairs out of all pairs that were matched, recall measures how many of the
actual pairs have been matched.
See also
Data integration
Dataspaces
Federated database system
Minimal mappings
Ontology alignment
Schema crosswalk
References
1. Kim, W. & Seo, J. (Dec 1991). "Classifying Schematic and Data Heterogeneity in
Multidatabase Systems.". Computer 24, 12.
2. Sheth, A. P. & Kashyap, V. (1993). "So Far (Schematically) yet So Near (Semantically)". In
Proceedings of the IFIP WG 2.6 Database Semantics Conference on interoperable
Database Systems.
3. Sheth, A. P. (1999). "Changing Focus on Interoperability in Information Systems: From
System, Syntax, Structure to Semantics". In Interoperating Geographic Information Systems.
M. F. Goodchild, M. J. Egenhofer, R. Fegeas, and C. A. Kottman (eds.), Kluwer, Academic
Publishers.
4. Rahm, E. & Bernstein, P (2001). "A survey of approaches to automatic schema matching".
The VLDB Journal 10, 4.
5. Batini, C., Lenzerini, M., and Navathe, S. B. (1986). "A comparative analysis of
methodologies for database schema integration.". ACM Comput. Surv. 18, 4.
6. Doan, A. & Halevy, A. (2005). "Semantic-integration research in the database community". AI
Mag. 26, 1.
7. Kalfoglou, Y. & Schorlemmer, M. (2003). "Ontology mapping: the state of the art". Knowl.
Eng. Rev. 18, 1.
8. Choi, N., Song, I., and Han, H. (2006). "A survey on ontology mapping". SIGMOD Rec. 35, 3.
9. Pereira Nunes, Bernardo; Mera, Alexander; Casanova, Marco Antonio; P. Paes Leme, Luis
Andre; Dietze, Stefan (2013). "Complex Matching of RDF Datatype Properties" (http://www.r
epo.uni-hannover.de/handle/123456789/1358). Database and Expert Systems Applications
- 24th International Conference. Lecture Notes in Computer Science. 8055: 195–208.
doi:10.1007/978-3-642-40285-2_18 (https://doi.org/10.1007%2F978-3-642-40285-2_18).
ISBN 978-3-642-40284-5.
10. Hamdaqa, Mohammad; Tahvildari, Ladan (2014). "Prison Break: A Generic Schema
Matching Solution to the Cloud Vendor Lock-in Problem". IEEE 8th International Symposium
on the Maintenance and Evolution of Service-Oriented and Cloud-Based Systems: 37–46.
doi:10.1109/MESOCA.2014.13 (https://doi.org/10.1109%2FMESOCA.2014.13). ISBN 978-
1-4799-6152-8. S2CID 14499875 (https://api.semanticscholar.org/CorpusID:14499875).
11. Ashoka Savasere; Amit P. Sheth; Sunit K. Gala; Shamkant B. Navathe; H. Markus (1993).
"On Applying Classification to Schema Integration". RIDE-IMS.
12. Ontology Alignment Evaluation Initiative::2006 (http://oaei.ontologymatching.org/2006/)
External links
Early work in schema matching (http://knoesis.wright.edu/library/download/S04-Dagstuhl-Ea
rly-Work.pdf)