RDF Integration in HTML 5 Web Pages: Gijs Davis G.davis@student - Utwente.nl
RDF Integration in HTML 5 Web Pages: Gijs Davis G.davis@student - Utwente.nl
Gijs Davis
[email protected]
ABSTRACT
The adoption of the Semantic Web would benefit greatly if
small chunks of Semantic Web data could be integrated in
normal web pages. As of yet there is no standardized way of
doing this. We will construct a standard for doing this, based on
already existing implementations.
Keywords
OWL, RDF, Semantic Web, N3, N-Triples, XML, RDF/XML
Microformats, eRDF, HTML 5, XHTML 5, W3C, WHATWG
1. ITRODUCTIO
It has been more than ten years since Tim Berners-Lee first
published his vision of the Semantic Web [1]. The idea behind
the semantic web is to create web pages that contain the same
data as available now on the normal web, but written in a way
that they are easily understandable by computers instead of
human beings. While a lot has happened in the development of
the Semantic Web, the Semantic Web has yet to catch on.
To facilitate the use of Semantic Web several data formats
emerged that allow small units of data to be embedded in
currently existing HTML web pages. Early 2007 the World
Wide Web Consortium (W3C) and the Web Hypertext
Application Technology Working Group (WHATWG) began
working on a new revision of HTML: version 5 and XHTML
version 5 [2]. While some considerable steps are being taken in
modernizing HTML, nothing remotely resembling Semantic
Web integration has been considered in the latest drafts [3].
2. APPROACH
In this paper, we will try to find a way to make modifications to
(X)HTML 5 to improve RDF support. This will be
accomplished by following the following steps:
1.
2.
3.
<http://example.org/RDFpaper.pdf> <http://purl.org/dc/
elements/1.1/creator> _:author .
_:author <http://xmlns.com/foaf/0.1/#gender> "male" .
<?xml version="1.0"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:foaf=http://xmlns.com/foaf/0.1/#
xmlns="http://example.org"
>
<rdf:Description about="RDFpaper.pdf">
<dc:creator>
<rdf:Description>
<foaf:gender>Male</foaf:gender>
</rdf:Description>
</dc:creator>
</rdf:Description>
</rdf:RDF>
4. RDF I HTML
The embedding of chunks of semantic web data in HTML has
been done before. The most popular of such formats are
microformats [19-22] and embedded RDF (eRDF) [23].
However, these formats are just a small step in the direction of
the full Semantic Web format RDF [5]. To give the Semantic
Web a proper boost, a standardized way to embed RDF in
HTML is needed.
4.1 Microformats
Microformats started out as an initiative to develop a way for
people to include personal information on their homepages.
[24, 25]. A method was developed that allows vCard data in the
source of websites. vCard is a popular format for storing
business cards, introduced back in 1998 [26]. This new
embedded format is called hCards. Users can link these
business cards, creating a friends/relations network which is
independent of commercial profile sites such as Facebook or
MySpace. This network is called the XML Friends Network
(XFN) [27].
After the relative success of the hCard format, the microformat
community has developed more formats. The most well-known
is hCalendar, based on iCalendar, for embedding calendar data
in websites [28].
The existing microformats are roughly dividable in two
categories: Those that describe something that is on a webpage
(like hCard en hCalendar) and those describing the webpage
itself, such as the rel-tag microformat. This format allows for
tags to be set for a webpage, much like many bloggers do to
categorize their posts. Although all microformats within these
categories share a common philosophy, each microformat
requires its own specification. Microformats do not have a
connection to RDF and the semantic web, this might only seem
this way because XFN is often compared to FOAF, one of the
most widely used RDF formats, used to describe people in a
similar way [29, 30].
4.2 eRDF
eRDF (embedded RDF) is a format in which the basic features
of RDF can be embedded in HTML files. It was created in
2005, partly inspired by microformats. Even though eRDF is
based on proper RDF, and microformats is not, their syntaxes
are quite alike. They both use the meta and link elements in the
HTML head and use class attributes to insert predicates. [23,
32]
To add eRDF data in a HTML file, one first has to announce the
presence of eRDF data. This is done by adding the eRDF profile
to the HTML head tag:
<head profile="http://purl.org/NET/erdf/profile">
Now the first mayor difference from proper RDF comes to light:
triples in eRDF can only be in one of four following forms:
While a bit confusing, the above statements are all conform the
HTML standards. The HTML link element is used to add
custom links to a HTML file, by using rel or rev these links can
go two directions. The meta element is used in HTML to add
custom metadata to a HTML file, therefore basically has the
same behavior as an RDF triple with the HTML page as subject.
Note that there is no rel or rev in the meta element, but this is
not a problem because a literal cannot be a subject in RDF.
In eRDF there are no blank nodes, all nodes that are described
need to have an identifier. These are added by using the HTML
id attribute. It doesnt matter what kind of HTML element the
identifier is in. Like in microformats, the object can be in either
the content of a HTML element, or in specific attributes. If the
object is a literal, the object goes in the content field of an
element and the predicates are added in class attributes. These
can only be used with their prefix as declared in the head, there
is no way to use full URIs. It is also possible to overwrite the
value that is displayed in the HTML page, by using the title
attribute.
If the object or the subject is a reference, it goes in the href
attribute and the rel or rev attribute contains the predicate, here
the object or subject can only be referred to by the full URI.
If any of these elements overlap, they can be combined, as long
as remains clear which object belongs to which predicate. This
and other discussed features are shown in figure 11 below.
<html>
<head profile="http://purl.org/NET/erdf/profile">
<title>RDF in HTML</title>
<meta name="dc.title" content="RDF in HTML" />
<link rel="schema.dc"
href="http://purl.org/dc/elements/1.1/" />
<link rel="schema.foaf" href="http://xmlns.com/foaf/0.1/" />
<link rel="dc.creator"
href="#gdavis" />
</head>
<body>
<h1>RDF in HTML</h1>
<p id="gdavis">by
<a class="foaf-name"
rel="foaf-homepage"
href="example.org">
<span class="foaf-firstName">Gijs</span>
<span class="foaf-surname">Davis</span>
</a>
</p>
</body>
</html>
5. (X)HTML 5
Since the dawn of the web, webpages are written in HTML
(HyperText Markup Language). The first version of HTML,
created by Tim Berners-Lee himself, dates back to 1991. There
have been steady developments, first by the Internet
Engineering Task Force (IETF)[33], later by the W3C. But the
latest version (HTML 4) dates back to 1998. In 2001 a few
amendments were made to HTML 4 and XHTML 1.1 was
released. Both introduced only minor changes. XHTML 2 is in
development since 2002, but development has almost come to a
stop. There is much criticism on XHTML 2 and it has become
very unpopular even before coming close to a final release [34].
This means that the language in which todays webpages are
described in, dates back more than 10 years. That is a lot,
especially in computer science terms. In 2004 the Web
Hypertext
Application
Technology
Working
Group
(WHATWG)[35] was formed my individuals from three major
browser vendors: Opera, Mozilla and Apple; in response to the
lack of initiative from the side of the W3C. The WHATWG
immediately began working on a new version of HTML. In
2007 the newly formed HTML working group at the W3C
6.1 Features
To use the full extent of RDF functionality it is preferred that as
many RDF features as possible described in chapter 3 are
included. This includes using shortened statements, prefixes,
base paths and blank nodes. Furthermore the implementation
should be useful from a HTML point of view, meaning that
information in RDF should be visible in the resulting webpage,
much like the microformats specifications.
6.1.1 Triples
In RDF the subject always is a reference (a link or URI). This
has the advantage that this is data that doesnt need to be visible
in HTML. The easiest solution is adding a new attribute to
HTML elements in which subjects can be declared. To avoid
changing the HTML 5 specifications too much and keeping in
line with the way current HTML elements are used, this
attribute should be added to existing HTML elements.
The predicate too is always a reference, however here the
XML/RDF implementation cannot be mimicked. The only way
to add this data in HTML is to add an attribute in which the
predicate can be declared. This implementation however raises
a few issues. By only using attributes, there is no way to enforce
them in being used in the proper order. We want to construct
our implementation in a way that if a webpage is written in
valid HTML, the contained RDF data is valid RDF as well.
In table 1 below all the possible tag nestings are shown. A +
denotes the intended order of the tags, a - denotes a nesting
that results in invalid RDF. This leaves a few special cases:
A.
6. REQUIREMETS
The aim of embedding RDF in webpages is to promote the
usage of semantic web elements by individuals who have no
experience with semantic web. Normal HTML is used a lot
more than XHTML. Implementing RDF in XML is quite easy
without making changes to the HTML spec due to the
extendable nature of XML. The focus therefore must be on
finding an implementation that works for HTML.
B.
C.
D.
E.
F.
6.1.2
Subject Predicate
Object
No RDF tags
Root-node
No RDF data
Subject
Predicate
Object
The only solution for this issue is to add RDF data in new
elements instead of in new attributes. Putting the object in an
always available attribute and putting the predicate in a new
element matches table 1 exactly.
The object can either be a reference, or a literal (no link but
string of characters). Usually the object doesnt need to be
displayed in a webpage if it is a reference and should be
6.1.3 Shortening
Statement shortening will work just like in RDF/XML. A
subject can contain more than one predicate and a predicate can
contain more than one objects. There also are some other tricks
that can be used to make the RDF easier to read and write:
For example, the about attribute can be used on any element, so
if there is only one pred element needed, these can be
combined, as is shown in figure 14 below.
<p>
My name is
<pred about="#me" rel="foaf:name">Gijs</pred>
</p>
6.1.4 Prefixes
In XML or XHTML prefixes would be build in by using the
xmlns attribute in XML. Since this doesnt work in HTML,
something has to be constructed that works with the syntax
normal (non XML) HTML uses. The best place for the prefixes
is to put them in the webpage header, so that the HTML code
stays clean and as close to HTML as possible. This wont offer
as much flexibility as RDF/XML or N3 where prefixes can be
defined and redefined everywhere in the document, but it should
suffice for the relatively simple RDF in webpages. The link
element is the obvious place to put the prefixes, much like is
now done in eRDF. But the dot-notation eRDF uses to mimic
the colon-notation in XML is rather confusing and requires
extra steps to parse. This can be done neater by properly using
the already available attributes in the link. The prefix should go
in the title attribute, the URL in the href attribute and rel should
be set to prefix. An example is given in figure 16 below.
When the document is parsed its easier to look for all prefixes
if the rel value is a fixed value.
6.2 Restrictions
All basic RDF features can be used in the proposed notation.
Albeit that some features (prefixes, base paths and blank nodes)
are a bit restricted. But there are three features implemented in
microformats and eRDF that cannot be used in this proposed
notation:
First of all, microformats and eRDF both have specific shortcuts
for some HTML elements, such as the anchor and img elements.
These shortcuts are meant to prevent double data on a webpage.
Figure 18 demonstrates this feature in both microformats and
eRDF.
<!-- a refenence to an image and to a URL in the hCard
microformat -->
<div id="" class="vcard">
<img src="http://example.com/me.jpg" class="photo"/>
<a src="http://example.com" class="url">My Site</a>
</div>
<!-- a reference to an image and a URL in eRDF using foaf -->
<img src="http://example.com/me.jpg" class="foaf-depiction" />
<a href="http://example.com" class="foaf-homepage">My Site</a>
6.5 Parsing
RDF in HTML 5 has no use at all if RDF data is not extractable
from HTML 5 pages. This notation therefore must not only be
easy to write, but also must be easily parsable. XML and
XHTML are easily parsable by the many XML parsers that
exist, however we are dealing with HTML here. Since the data
resides in webpages, the most data will be extracted by
webbrowsers. For example saving contact information from a
website to your contact manager or adding events and dates
published on website to your agenda.
Browsers keep a model of the webpage in their memory, called
the DOM (Document Object Model). The DOM of a webpage
can be seen as a tree, with the html element as its root and each
nested element as a branch or leaf. The DOM can be parsed in
various ways offered by browsers. The most common is
ECMAscript (usually known as JavaScript).
In Appendix B an algorithm for extracting RDF data from a
HTML 5 DOM is given. This algorithm may look complex, but
is in reality simpler than algorithms required for extracting RDF
data from RDFe or extracting data from microformats.
The algorithm for eRDF needs a more intelligent system for
handling prefixes, it needs to take care of document related
triples in the HTML head and it has to look for data in different
places in different elements (see Figure 18: Examples of
element specific shortcuts in microformats and in eRDF).
Microformat parsers face the same obstacles, but also have to
deal with the fact that microformats are not based on RDF, so a
special parser has to be written for each specific microformat.
6.6 HTML/XHTML
This new way of was designed to be used with normal HTML
5, but what about XHTML 5? All added functionality (new
elements and attributes) could be used in XHTML the same
way as in HTML. The algorithm in appendix B is purely DOM
based, so that would work exactly the same. However, if one
uses XHTML, it seems a waste not to use the features XML
provides. By using proper xml namespaces, RDF/XML data can
be inserted into XHTML. This data can be easily extracted by
any XML parser.
7. COCLUSIOS
In this paper we present an extension to HTML 5 which can be
used to include RDF data in webpages. This format is based on
eRDF, microformats and RDF/XML and tries to solve the
biggest issues with those formats. The format is designed to
work with normal HTML (not XHTML). Because some new
HTML elements and attributes are introduced, RDF can be
inserted in a way that is easy to write. To keep the format clear
and easy to understand some functionality present in
microformats and eRDF has been omitted, but there is nothing
expressible in RDF that can be described in eRDF and cant be
described in the proposed format. Adding RDF data to a
webpage is rather useless if the data cant be extracted. This is
easier to do for our proposed format than from eRDF or
microformats. An algorithm to extract data is given in appendix
B.
8. FURTHER WORK
The work presented in this paper is all theoretical. It needs
testing and the best way to do that is to write a sample parser.
This parser should take a HTML 5 DOM as input and returns
the triples contained in that DOM.
In a later phase, a formal proposal for inclusion of these
proposed elements could be written and submitted to the W3C
and WHATWG.
ACKOWLEGDEMETS
I would like to thank Maarten Fokkinga for his guidance and
suggestions and valuable feedback during this course of this
research.
[16]
[17]
REFERECES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
[38]
https://developer.mozilla.org/en/Using_microformats
(visited: 2008-12-10)
J. Reimer. (2007-05-02). Microsoft drops hints about
Internet Explorer 8. [Online]. Available:
http://arstechnica.com/news.ars/post/20070502microsoft-drops-hints-about-internet-explorer-8.html
(visited: 2008-12-10)
5.