Copyright © 2008 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark and document use rules apply.
One of the challenges facing Semantic Web for Health Care and Life Sciences is that of converting relational databases into Semantic Web format. The issues and the steps involved in such a conversion have not been well documented. To this end, we have created this document to describe the process of converting SenseLab databases into OWL. SenseLab is a collection of relational (Oracle) databases for neuroscientific research. The conversion of these databases into RDF/OWL format is an important step towards realizing the benefits of Semantic Web in integrative neuroscience research. This document describes how we represented some of the SenseLab databases in Resource Description Framework (RDF) and Web Ontology Language (OWL), and discusses the advantages and disadvantages of these representations. Our OWL representation is based on the reuse and extension of existing standard OWL ontologies developed in the biomedical ontology communities. The purpose of this document is to share our implementation experience with the community.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This is an Interest Group Note of the Semantic Web in Health Care and Life Sciences Interest Group (HCLS), part of the W3C Semantic Web Activity. It is considered stable and expected to be published as an Interest Group Note in May 2008. This document serves as a companion to A Prototype Knowledge Base for the Life Sciences and describes the process for integrating new data into an existing biological database. We hope other groups who plan to convert their databases into RDF/OWL format will benefit from this document.
The document was produced by the Semantic Web in Health Care and Life Sciences Interest Group (HCLS), part of the W3C Semantic Web Activity (see charter). Comments may be sent to the publicly archived public-semweb-lifesci@w3.org mailing list. Feedback is encouraged, as is participation in the recently re-charted HCLSIG. A list of changes since the last publication is available.
Publication as an Interest Group Note does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the disclosure obligations of the 5 February 2004 W3C Patent Policy. The group does not expect this document to become a W3C Recommendation. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information to public-semweb-lifesci@w3.org [public archive] in accordance with in accordance with section 6 of the W3C Patent Policy.
The SenseLab databases can be accessed through a web interface at the SenseLab web site [SENSELAB-WEB]. SenseLab is divided into a number of specialised databases, of which we have converted three to Semantic Web formats. These databases are NeuronDB, BrainPharm and ModelDB. All databases are based on compartmental models of neurons. NeuronDB contains descriptions of anatomic locations, cell architecture and physiologic parameters of neuronal cells. The pilot BrainPharm database is intended to support research on drugs for the treatment of neurological disorders. It enhances the descriptions in a portion of NeuronDB with descriptions of the actions of pathological and pharmacological agents. ModelDB is a large repository of computational neuroscience models and simulations. The mathematical models in ModelDB are annotated with references to NeuronDB. Taken together, these databases allow the researcher to query information and to run simulations pertaining to the function of neurons in healthy and disease states. All databases contain extensive literature references and excerpts from texts that have been used to curate the database entries.
The databases are based on the "entity-attribute-value with classes and relationships" (EAV/CR) schema [EAV-CR]. The data can also be downloaded from the SenseLab Semantic Web development portal [SENSELAB-SW] as a database dump in Microsoft Access format and as text.
Our motivation was to make the SenseLab databases available in RDF(S) [RDFS] (without OWL) and in OWL DL [OWL Overview]. The two versions were developed in parallel in order to compare the difference between the conversion processes and the outcomes. We wanted to explore the issues in mapping relational databases to RDF/OWL structure. In addition, we wanted to explore the possibility of automatic translation from EAV/CR to RDF.
We developed a converter application in Java that queried the SenseLab database and wrote RDF/XML files. The conversion was fully automatic for the RDF version, but required some manual editing for the OWL version.
These conversions were too tied to the original database structure, which resulted in inconsistent OWL ontologies. Some shortcomings of the first conversion to OWL were:
http://neuroweb.med.yale.edu/senselab/neuron_ontology.owl#GABA
).
This grave mistake would not have been noticed without the use of OWL
reasoning. ¹ Disjoint classes are used in OWL to assert that they have no members in common. Inferences from this can be used to flag any inconsistent models.
The revised OWL conversion was based on the first OWL conversions described above. The design of the revised SenseLab ontologies follows the "ontological realism" approach [SMITH-2004]. This means that the revised ontologies are focused on direct representations of physical objects and processes (e.g., neuronal cells, ionic currents), and not on their abstractions (e.g., concepts or database entries).
Manually correcting the logical inconsistencies in the first version of the OWL ontology; making use of foundational ontologies (BFO, Relation Ontology) where possible; mapping the ontology to other neuroscience ontologies.
An ontology containing basic class hierarchies and relations was manually created, based on the structure of existing SenseLab databases. This basic ontology could not be created from the database structure in an automated process because this would not have resulted in a logically consistent ontology. This ontology was edited by a domain expert, based on inspection and manual editing with Protege 3.2 [PROTEGE] and Topbraid Composer [TOPBRAID]. The ontologies were built upon established foundational ontologies in order to maximize the interoperability with other existing and forthcoming biomedical Semantic Web resources. The foundational ontologies used were:
Based on this manually created basic ontology, the data from the SenseLab databases were then automatically converted to OWL using programs written in Java and Python. The automated export scripts extended the manually created basic ontology through the creation of subclasses, OWL property restrictions and individuals. The resulting ontologies show no clearly distinguishable divide between the 'schema' and 'data'.
The OWL export of NeuronDB was based on a transformation from the EAV/CR model of the SenseLab database to files in RDF/XML syntax by a Java program. The export from ModelDB and BrainPharm was based on a simple flat text file export of the databases. The text file exports were converted to RDF/XML files with a Python script.
For mappings to external bioinformatics databases that did not yet offer
stable URIs for reference on the Semantic Web, we used the URI scheme for
database record identifiers established by Science Commons [SC-URI]. URIs for database records could simply be
generated by concatenating the record identifier to a predefined namespace.
For example, the Entrez Gene record with ID '3579' was identified by the URI
http://purl.org/commons/record/ncbi_gene/3579
,
the Uniprot record 'P46663' was identified by http://purl.org/commons/record/uniprotkb/P46663
and the Pubmed record with ID '11160518' was identified by http://purl.org/commons/record/pmid/11160518
.
The database entries were connected to the ontological representations of
real-word entities through relations such as
has_nucleotide_sequence_described_by
. For example, the gene of
the Dopamine Receptor D1 (DRD1) is defined through a reference to NCBI record
1812, which contains a description of the sequence of this specific gene:
<http://purl.org/ycmi/senselab/neuron_ontology.owl#DRD1_Gene>
owl:equivalentClass _:property_restriction1 .
_:property_restriction1 owl:onProperty
senselab:has_nucleotide_sequence_described_by .
_:property_restriction1 owl:hasValue
<http://purl.org/commons/record/ncbi_gene/1812> .
Mappings were made to the following ontologies:
The mappings were made with the following cross-ontology relations: owl:equivalentClass, rdfs:subClassOf and the "has part" relation from the OBO relation ontology.
Figure 1: Import hierarchy of OWL ontologies. Ontologies printed in bold have been created by the SenseLab team, other ontologies have been created by other groups. The arrows point from the imported ontology to the importing ontology, e.g., the NeuronDB Ontology imports the Relation Ontology. Import statements are transitive, e.g., the ModelDB Ontology imports both the NeuronDB ontology and the Relation ontology.
Figure 2: Examples of relations ('mappings') spanning between classes from the NeuronDB ontology (in the middle) and classes from external ontologies.
Terse rdfs:labels were replaced by more descriptive ones that could be better understood without knowledge about context. For example, the rdfs:label "Ded" was changed to "Distal part of equivalent dendrite (Ded)". Note that, in this case, the original label was also preserved (in brackets), because it might still be useful for people that do know about the context.
The ontology development was moved to a Subversion (SVN) system on a central web server. During most of the development, the ontologies were simply developed on the client side and were periodically uploaded via FTP. Of course this led to problems when more than one person was working on the ontologies at a time, and it was also impossible for users of the ontology to access previous versions of the ontology, since only the most recent version was available on the web site.
The namespaces / ontology locations were changed to PURL-based URIs. For
example, the URI
http://neuroweb.med.yale.edu/senselab/neuron_ontology.owl#Dopamine
was changed to http://purl.org/ycmi/senselab/neuron_ontology.owl#Dopamine
('ycmi' stands for 'Yale Center for Medical Informatics'). PURL-based URIs
are easier to maintain when server configurations change or (in the worst
case) the original server is unavailable and the ontologies need to be served
from a different location. The increased stability of PURLs encourages the
re-use of entities in ontologies developed by other groups -- which is a key
factor in the creation of a coherent Semantic Web.
A SPARQL endpoint for the SenseLab ontologies was set up using the open
source version of the Openlink Virtuoso server [VIRTUOSO]. A SPARQL endpoint is a service that
allows clients to query a RDF store with the SPARQL query language through
simple HTTP GET requests. The ontologies were loaded into the triple store of
the server to make them accessible to SPARQL queries. Each ontology file was
put into a separate labeled graph, the label of each graph was identical to
the URL of the ontology file. For example, the ontology located at http://purl.org/ycmi/senselab/neuron_ontology.owl
was loaded into a graph labeled
http://purl.org/ycmi/senselab/neuron_ontology.owl
. Loading each
ontology into a separate graph makes it possible to restrict SPARQL queries
to certain graphs and hence, certain ontologies. This has the advantage that
queries can be more selective and can be executed with better performance.
The final products of the project are accessible at http://neuroweb.med.yale.edu/senselab/. A SVN repository can be accessed through a web interface at http://neuroweb.med.yale.edu/svn/trunk/ontology/senselab/. The SPARQL endpoint can be accessed at http://hcls.deri.ie/sparql. The SenseLab OWL ontologies are mentioned as an example of the application of OBO ontologies in the article The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration [OBO-ARTICLE].
We experienced the following advantages from using RDF/OWL:
We experienced the following problems while using RDF/OWL:
The SenseLab ontologies will be further integrated with other neuroscientific and biomedical ontologies. User friendly applications will be developed to query a multitude of interrelated ontologies in a scientifically meaningful way. To this end, we have implemented a prototype Web application called 'Entrez Neuron' that allows the user to query data across multiple sources based on key words. The user can browse the query results and retrieve more detailed information about neurons based on a 'brain-anatomy/neuron' view. A paper describing this application was published in the WWW/HCLS2008 workshop. Currently, we are expanding this application to include more views and features.
Based on our experiences we can make the following suggestions for other projects that have similar goals:
We experienced clear benefits from using Semantic Web technologies for the integration of SenseLab data with other neuroscientific data in a consistent, flexible and decentralised manner. The main obstacle in our work was the lack of mature and scalable open source software for editing the complex, expressive ontologies we were dealing with. Since the quality of these tools is rapidly improving, this may cease to be an issue in the near future. The detailed analysis of the experiences with the SenseLab ontologies and other complex biomedical ontologies may help drive the improvement of current ontology editors.
Thanks to Huajun Chen and Ernest Lim who contributed to the SenseLab conversion. Thanks to Gordon Shepherd, Perry Miller, Luis Marenco and Tom Morse for their input, suggestions and support. Thanks to Susie Stephens for her detailed suggestions for improving this document. Thanks to Alan Ruttenberg for his technical suggestions during the conversion process. Thanks to Eric Prud'hommeaux for technical advice and assistance on the creation of this document.