»In the name of
God«
Data Integration Management System
for Distributed Database
using Virtual Database Technology and MapReduce
By: Saeid Masoumi
For Advanced Database course
University of Tabriz
Fall semester–18th December
2012
Table of Contents
Introduction
Solution
Associated Studies
Database Virtualization
Homogeneous
Heterogeneous
Common Schema Generation
Query Conversion
System Framework
The MapReduce Process
Conclusion
References
Data Integration Management System for Distributed
Database 2 of 17
Introduction
Massive amounts of data
Locate and access knowledge and trends
Using data mining techniques
Valuable to support analyses and decision-
making in businesses
Ubiquitous databases: distributed and placed
anywhere
Spend much time for database selection and
data collection
instead to concentrate on the work of analysis
and rule extraction
Data Integration Management System for Distributed
Database 3 of 17
Solution
Develop a database virtualization technique
Data analysts
Users who apply data mining methods
Data collection from the internet databases
Data cleansing works
Examine XML scheme advantages
ubiquitous database schema in a unified fashion
using the XML schema
distributed database management of the same type
and of different types
location transparency feature
Develop a common schema generation method
Propose the virtual database query language
Data Integration Management System for Distributed
Database 4 of 17
Associated studies
Metadata (creation)
Irrelevant to database model (e.g. RDB or OODB)
great workload to create in initial stage
no standardized definition and manipulation language
UML and E–R model (design technique)
Irrelevant to data model(e.g. table or object)
XML scheme (mapping)
To exchange information
definitions and manipulations are well standardized
useful in a flexible fashion with various Internet
databases
Data Integration Management System for Distributed
Database 5 of 17
Database Virtualization
Databases of many kinds
Data model differences:
Table type of relational databases (RDB)
XML-representation type of XML databases (XMLDB)
Object-oriented databases (OODB)
Vendor differences:
Regarding RDB for example: MySQL, PgSQL, SQLServer, …
Undesired result: because of the different data models
Time
Labor
Virtualization of different types of modeled databases
less of workload and cost
facilitate management
Provide transparency for users (database structure or
location)
Data Integration Management System for Distributed
Database 6 of 17
Database Virtualization (…)
able to use
databases of all
kinds
For virtualization of
ubiquitous
databases :
describe the schema
information of the real
databases
Data Integration Management System for Distributed
Database 7 of 17
Virtualization of Homogeneous Distributed Databases
Method of building a virtual database
management system for RDBs provided by
different vendors
XML conversion program
XMLExport/Import
RDB schema conversion into XML
RDB data conversion into XML
Data Integration Management System for Distributed
Database 8 of 17
Homogeneous: RDB schema conversion into XML
<?xml version=”1.0” encoding”UTF-8” standalone=”yes”>
-<root>
-<rdb Name=”mysql”>
-<database Name=”questionnaire” >
-<table_structure Name=”member”>
<field Field=”samplenum” Type=”integer” Null=”FALSE” Default=” />
<field Field=”answerday” Type=”text” Null=”FALSE” Default=” />
….
</table_structure>
-<schema>
<constraint Type=”PRIMARY KEY” Table=”member” Column=”samplenum” />
….
<constraint Type=”FOREIN KEY” Table=”questionnaire” Column=”samplenum”
Retable=”member” ReColumn=”samplenum” />
….
</schema>
</database>
</rdb>
</root>
Data Integration Management System for Distributed
Database 9 of 17
Homogeneous: RDB data conversion into XML
-<root>
-<dataset dbname=”mysql”>
<data tblname=”member” samplenum=”10001” answerday=”’2007/7/6’”
answertime=”’ 13:07:19:499’” />
<data tblname=”member” samplenum=”10002” answerday=”’2007/7/6’”
answertime=”’ 13:10:33:507’” />
….
</dataset>
</root>
Data Integration Management System for Distributed
Database 10 of 17
Virtualization of Heterogeneous Distributed Databases
virtualization of modeled DBs of different
types
describe the schema information of each
model using a single common schema (XML
Schema)
Data Integration Management System for Distributed
Database 11 of 17
Heterogeneous: SQL and associated XML
SQL XML
Table CREATE TABLE <xsd: element
Any XMLDB is definition table name… name=”table name”…
already described Column
definition
CREATE TABLE…
column name...
<xsd: element
name=”column
name”…
in the XML Data type CREATE TABLE… <xsd: element…
format (without definition data type.. type=”data type”…
conversion) Default values
CREATE TABLE…
column name DEFAULT
value
<xsd: element…
default=”value”…
Primary key
PRIMARY KEY <xsd: key…
constraint
An OODB is Unique
UNIQUE <xsd:unique …
constraint
fundamentally an Foreign key
FOREIGN KEY
<xsd: keyref …
constraint refer =…
extension of RDB NOT NULL NOT NULL
<xsd:…
nillable=”false”...
Method CREATE METHOD
CREATE TABLE…
Inheritance UNDER upper level <xsd: complexType …
table name
Data Integration Management System for Distributed
Database 12 of 17
Common Schema Generation
CREATEDB EmployeeDB; <!-- … -->
<xs:element name="AffiliationTable" type="AffiliationTable Type"/>
<xs:element name="EmployeeTable" type="EmployeeTable Type"/>
CREATE TABLE EmployeeTable ( </xs:sequence>
EmployeeID int PRIMARY KEY, </xs:complexType>
Name varchar(50) NOT NULL, <xs:complexType name="AffiliationTable Type">
Salary int CHECK(0 < Salary), <xs:annotation>
<xs:appinfo>
AffiliationID int REFERENCES <r:index index-key="AffiliationID" primary="yes"/>
AffiliationTable(AffiliationID) <r:index index-key="Affiliation" unique="yes"/>
ON UPDATE CASCADE </xs:appinfo>
ON DELETE CASCADE <!-- … -->
<xs:complexType>
); <xs:sequence>
<xs:element minOccurs="1" name="AffiliationID“ r:nullable="false“
CREATE TABLE AffiliationTable ( …
<xs:element minOccurs="0" name="Affiliation“ r:sqltype="varchar“
AffiliationID int Primary Key,
…
Affiliation varchar(5) UNIQUE <!-- … -->
); <xs:complexType name="EmployeeTable Type">
<xs:annotation>
<xs:appinfo>
Example of SQL/CREATE <r:index index-key="EmploeyeID" primary="yes"/>
<r:check check-column="Salary" rule="0<Salary“/>
<r:fkey fkey-column="AffiliationID" refcolumn="AffiliationID“
xample of the common schema ref-table="Affiliation” …
Data Integration Management System for Distributed
Database 13 of 17
Query Conversion
development of the query language to access
the virtual databases
extension of the existing XQuery
from the XQuery language into SQL language
or XQuery language or …
Sample of the virtual database query
for $employee in common-schema()/DB1/sample_db1/employee
for $manager in common-schema()/DB2/sample_db2/affiliation.xml
where $employee/EmplyeID= $manager/Affiliation/
return $employee/Name
Data Integration Management System for Distributed
Database 14 of 17
System Framework Client Client Client
Parse Engine Parse Engine
Session Session
Manager Manager
Info Center Session Session
Manager Manager
Virtual Database manager
Session Session
Manager Manager Common Schema
Query
Conversion Schema
Module Conversion
Module
Execute Engine Execute Engine
Archive
Worker
Worker Worker
Execute Execute
Manager Manager
Worker Worker
Resource Delegate
Resource Delegate Resource Delegate
RDB XML OODB
Data
Data Data
Schema
Schema Schema
Data Integration Management System for Distributed
Database 15 of 17
The MapReduce Process
Data Integration Management System for Distributed
Database 16 of 17
Conclusion
Developed the common schema conversion program
for RDB schema into XML schema
Showed the schema constraints
(such as PRIMARY KEY, CHECK, NOT NULL, UPDATE
CASCADE ON DELETE, UNIQUE) can be converted.
Future research
Develop the integration program of XML DB schema
into the common schema
Implement the common data manipulation API
(for example, extension of the existing XQuery modules)
to access the virtual databases
Incorporate location transparency functions to this API
Data Integration Management System for Distributed
Database 17 of 17
1)
References
S. Abiteboul, P. Buneman, and D. Suciu, Data on the Web:From Relations to Semistructured Data and
XML, Morgan Kaufmann Series in Data Management Systems (1999).
2) S. Amer-Yahia, F. Du, and J. Freire, “A comprehensive solution to the XML-to-relational mapping
problem,” Proc.6th Annual ACM International Workshop on Web Information and Data Management,
pp.31-38 (2004).
3) I. Varlamis and M. Vazirgiannis, “Bridging XML-schema and relational databases: a system for generating
and manipulating relational databases using valid XML documents,” Proc. 2001 ACM Symposium on
Document engineering, pp.105-114 (2001).
4) P. Bohannon, J. Freire, P. Roy, and J. Simeon, “From XML Schema to Relations: A Cost-Based Approach
to XML Storage,” Proc. 18th International Conference on Data Engineering, pp. 64-75 (2002).
5) G. Kappel, E. Kapsammer, and W. Retschitzegger, “Integrating XML and Relational Database Systems,”
World Wide Web: Internet and Web Information Systems, 7, pp. 343-384 (2004).
6) R. Li, Z. Lu, W. Xiao, B. Li, and W. Wu, “Schema Mapping for Interoperability in XML-Based
Multidatabase Systems,”Proc. 14th International Workshop on Database and Expert Systems Applications,
(2003).
7) Y. Wada, Y. Watanabe, J. Sawamoto, and T. Katoh,”Database Virtualization Technology in Ubiquitous
Computing,” Proc. 6th Innovations in Information Technology (Innovations’09), pp.170-174 (2009-12).
8) Y. Wada, Y. Watanabe, K. Syoubu, J. Sawamoto, and T.Katoh, ”Virtualization Technology for Ubiquitous
Databases,” Proc. 4th Workshop on Engineering Complex Distributed Systems (ECDS 2010) (2010-02)(to
be appeared)
Thanks for attention
Any Question