Linked Data Evolving The Web Into Global Data Space PDF
Linked Data Evolving The Web Into Global Data Space PDF
Linked Data
Evolving the Web into a
Global Data Space
Tom Heath
Christian Bizer
SYNTHESIS LECTURES ON
THE SEMANTIC WEB: THEORY AND TECHNOLOGY
James Hendler, Series Editor
Linked Data
Evolving the Web into a Global Data Space
Synthesis Lectures on the
Semantic Web: Theory and
Technology
Editors
James Hendler, Rensselaer Polytechnic Institute
Frank van Harmelen, Vrije Universiteit Amsterdam
Whether you call it the Semantic Web, Linked Data, or Web 3.0, a new generation of Web
technologies is offering major advances in the evolution of the World Wide Web. As the first
generation of this technology transitions out of the laboratory, new research is exploring how the
growing Web of Data will change our world. While topics such as ontology-building and logics remain
vital, new areas such as the use of semantics in Web search, the linking and use of open data on the
Web, and future applications that will be supported by these technologies are becoming important
research areas in their own right. Whether they be scientists, engineers or practitioners, Web users
increasingly need to understand not just the new technologies of the Semantic Web, but to understand
the principles by which those technologies work, and the best practices for assembling systems that
integrate the different languages, resources, and functionalities that will be important in keeping the
Web the rapidly expanding, and constantly changing, information space that has changed our lives.
Topics to be covered:
Semantic Web Principles from linked-data to ontology design
Key Semantic Web technologies and algorithms
Semantic Search and language technologies
The Emerging Web of Data and its use in industry, government and university applications
Trust, Social networking and collaboration technologies for the Semantic Web
The economics of Semantic Web application adoption and use
Publishing and Science on the Semantic Web
Semantic Web in health care and life sciences
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in
any form or by any meanselectronic, mechanical, photocopy, recording, or any other except for brief quotations in
printed reviews, without the prior permission of the publisher.
DOI 10.2200/S00334ED1V01Y201102WBE001
Lecture #1
Series Editors: James Hendler, Rensselaer Polytechnic Institute
Frank van Harmelen, Vrije Universiteit Amsterdam
First Edition
10 9 8 7 6 5 4 3 2 1
Series ISSN
Synthesis Lectures on the Semantic Web: Theory and Technology
ISSN pending.
Photo credits:
Tom Heath
Talis
Christian Bizer
Freie Universitt Berlin
M
&C Morgan & cLaypool publishers
ABSTRACT
The World Wide Web has enabled the creation of a global information space comprising linked
documents. As the Web becomes ever more enmeshed with our daily lives, there is a growing desire
for direct access to raw data not currently available on the Web or bound up in hypertext documents.
Linked Data provides a publishing paradigm in which not only documents, but also data, can be a
first class citizen of the Web, thereby enabling the extension of the Web with a global data space
based on open standards - the Web of Data. In this Synthesis lecture we provide readers with
a detailed technical introduction to Linked Data. We begin by outlining the basic principles of
Linked Data, including coverage of relevant aspects of Web architecture. The remainder of the text
is based around two main themes - the publication and consumption of Linked Data. Drawing on a
practical Linked Data scenario, we provide guidance and best practices on: architectural approaches
to publishing Linked Data; choosing URIs and vocabularies to identify and describe resources;
deciding what data to return in a description of a resource on the Web; methods and frameworks for
automated linking of data sets; and testing and debugging approaches for Linked Data deployments.
We give an overview of existing Linked Data applications and then examine the architectures that
are used to consume Linked Data from the Web, alongside existing tools and frameworks that enable
these. Readers can expect to gain a rich technical understanding of Linked Data fundamentals, as
the basis for application development, research or further study.
KEYWORDS
web technology, databases, linked data, web of data, semantic web, world wide web,
dataspaces, data integration, data management, web engineering, resource description
framework
vii
Contents
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 The Data Deluge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The Rationale for Linked Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Structure Enables Sophisticated Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 Hyperlinks Connect Distributed Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 From Data Islands to a Global Data Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Introducing Big Lynx Productions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
List of Figures
2.1 URIs are used to identify people and the relationships between them. . . . . . . . . . . 9
3.1 Growth in the number of data sets published on the Web as Linked Data. . . . . . 31
3.2 Linking Open Data cloud as of November 2010. The colors classify data sets
by topical domain.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.1 The Marbles Linked Data browser displaying data about Tim Berners-Lee.
The colored dots indicate the data sources from which data was merged. . . . . . . . 87
6.2 Sig.ma Linked Data search engine displaying data about Richard Cyganiak. . . . . 88
6.3 Google search results containing structured data in the form of Rich Snippets. . . 89
6.4 Google result answering a query about the birth date of Catherine Zeta-Jones. . 90
6.5 US Global Foreign Aid Mashup combining and visualizing data from
different branches of the US government. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.6 The HTML view of a Talis Aspire List generated from the underlying RDF
representation of the data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.7 Architecture of a Linked Data application that implements the crawling pattern. 99
Preface
This book provides a conceptual and technical introduction to the field of Linked Data. It
is intended for anyone who cares about data using it, managing it, sharing it, interacting with it
and is passionate about the Web. We think this will include data geeks, managers and owners
of data sets, system implementors and Web developers. We hope that students and teachers of
information management and computer science will find the book a suitable reference point for
courses that explore topics in Web development and data management. Established practitioners of
Linked Data will find in this book a distillation of much of their knowledge and experience, and a
reference work that can bring this to all those who follow in their footsteps.
Chapter 2 introduces the basic principles and terminology of Linked Data. Chapter 3 provides
a 30,000 ft view of the Web of Data that has arisen from the publication of large volumes of Linked
Data on the Web. Chapter 4 discusses the primary design considerations that must be taken into
account when preparing to publish Linked Data, covering topics such as choosing and using URIs,
describing things using RDF, data licensing and waivers, and linking data to external data sets.
Chapter 5 introduces a number of recipes that highlight the wide variety of approaches that can be
adopted to publish Linked Data, while Chapter 6 describes deployed Linked Data applications and
examines their architecture. The book concludes in Chapter 7 with a summary and discussion of the
outlook for Linked Data.
We would like to thank the series editors Jim Hendler and Frank van Harmelen for giving us
the opportunity and the impetus to write this book. Summarizing the state of the art in Linked Data
was a job that needed doing we are glad they asked us. It has been a long process, throughout which
Mike Morgan of Morgan & Claypool has shown the patience of a saint, for which we are extremely
grateful. Richard Cyganiak wrote a significant portion of the 2007 tutorial How to Publish Linked
Data on the Web which inspired a number of sections of this book thank you Richard. Mike
Bergman, Dan Brickley, Fabio Ciravegna, Ian Dickinson, John Goodwin, Harry Halpin, Frank van
Harmelen, Olaf Hartig, Andreas Harth, Michael Hausenblas, Jim Hendler, Bernadette Hyland,
Toby Inkster, Anja Jentzsch, Libby Miller, Yves Raimond, Matthew Rowe, Daniel Schwabe, Denny
Vrandecic, and David Wood reviewed drafts of the book and provided valuable feedback when we
needed fresh pairs of eyes they deserve our gratitude. We also thank the European Commission
for supporting the creation of this book by funding the LATC LOD Around The Clock project
(Ref. No. 256975). Lastly, we would like to thank the developers of LaTeX and Subversion, without
which this exercise in remote, collaborative authoring would not have been possible.
CHAPTER 1
Introduction
1.1 THE DATA DELUGE
We are surrounded by data data about the performance of our locals schools, the fuel efficiency
of our cars, a multitude of products from different vendors, or the way our taxes are spent. By
helping us make better decisions, this data is playing an increasingly central role in our lives and
driving the emergence of a data economy [47]. Increasing numbers of individuals and organizations
are contributing to this deluge by choosing to share their data with others, including Web-native
companies such as Amazon and Yahoo!, newspapers such as The Guardian and The New York Times,
public bodies such as the UK and US governments, and research initiatives within various scientific
disciplines.
Third parties, in turn, are consuming this data to build new businesses, streamline online
commerce, accelerate scientific progress, and enhance the democratic process. For example:
The online retailer Amazon makes their product data available to third parties via a Web
API 1 . In doing so they have created a highly successful ecosystem of affiliates2 who build
micro-businesses, based on driving transactions to Amazon sites.
Search engines such as Google and Yahoo! consume structured data from the Web sites of
various online stores, and use this to enhance the search listings of items from these stores.
Users and online retailers benefit through enhanced user experience and higher transaction
rates, while the search engines need expend fewer resources on extracting structured data from
plain HTML pages.
Innovation in disciplines such as Life Sciences requires the world-wide exchange of research
data between scientists, as demonstrated by the progress resulting from cooperative initiatives
such as the Human Genome Project.
The availability of data about the political process, such as members of parliament, voting
records, and transcripts of debates, has enabled the organisation mySociety 3 to create services
such as TheyWorkForYou4 , through which voters can readily assess the performance of elected
representatives.
1 API stands for Application Programming Interface - a mechanism for enabling interaction between different software programs.
2 https://affiliate-program.amazon.co.uk/
3 http://www.mysociety.org/
4 http://www.theyworkforyou.com/
2 1. INTRODUCTION
The strength and diversity of the ecosystems that have evolved in these cases demonstrates
a previously unrecognised, and certainly unfulfilled, demand for access to data, and that those or-
ganizations and individuals who choose to share data stand to benefit from the emergence of these
ecosystems. This raises three key questions:
How to enable the discovery of relevant data within the multitude of available data sets?
How to enable applications to integrate data from large numbers of formerly unknown data
sources?
Just as the World Wide Web has revolutionized the way we connect and consume documents,
so can it revolutionize the way we discover, access, integrate and use data. The Web is the ideal
medium to enable these processes, due to its ubiquity, its distributed and scalable nature, and its
mature, well-understood technology stack.
The topic of this book is on how a set of principles and technologies, known as Linked Data,
harnesses the ethos and infrastructure of the Web to enable data sharing and reuse on a massive
scale.
6 http://docs.amazonwebservices.com/AWSECommerceService/latest/DG/
7 http://www.flickr.com/services/api/
8 http://www.programmableweb.com/
9 http://www.json.org/
4 1. INTRODUCTION
To return to the comparison with HTML, the analogous situation would be a search engine
that required a priori knowledge of all Web documents before it could assemble its index. To provide
this a priori knowledge, every Web publisher would need to register each Web page with each search
engine. The ability for anyone to add new documents to the Web at will, and for these documents to
be automatically discovered by search engines and humans with browsers, have historically been key
drivers of the Webs explosive growth. The same principles of linking, and therefore ease of discovery,
can be applied to data on the Web, and Linked Data provides a technical solution to realize such
linkage.
a blog where staff post news items of interest to the television networks and/or freelancers
10 http://www.semanticdesktop.org/
6 1. INTRODUCTION
Information that changes rarely (such as the company overview) is published on the site as
static HTML documents. Frequently changing information (such as listing of productions) is stored
in a relational database and published to the Web site as HTML by a series of PHP scripts developed
for the company. The company blog is based on a blogging platform developed in-house and forms
part of the main Big Lynx site.
In the remainder of this book we will explore how Linked Data can be integrated into the
workflows and technical architectures of Big Lynx, thereby maximising the discoverability of the Big
Lynx data and making it easy for search engines as well as specialized Web sites, such as film and
TV sites, freelancer directories or online job markets, to pick up and integrate Big Lynx data with
data from other companies.
7
CHAPTER 2
3. When someone looks up a URI, provide useful information, using the standards (RDF,
SPARQL).
4. Include links to other URIs, so that they can discover more things.
The basic idea of Linked Data is to apply the general architecture of the World Wide Web [67]
to the task of sharing structured data on global scale. In order to understand these Linked Data
principles, it is important to understand the architecture of the classic document Web.
The document Web is built on a small set of simple standards: Uniform Resource Identifiers
(URIs) as globally unique identification mechanism [20], the Hypertext Transfer Protocol (HTTP)
as universal access mechanism [50], and the Hypertext Markup Language (HTML) as a widely
used content format [97]. In addition, the Web is built on the idea of setting hyperlinks between
Web documents that may reside on different Web servers.
The development and use of standards enables the Web to transcend different technical
architectures. Hyperlinks enable users to navigate between different servers. They also enable search
engines to crawl the Web and to provide sophisticated search capabilities on top of crawled content.
Hyperlinks are therefore crucial in connecting content from different servers into a single global
information space. By combining simplicity with decentralization and openness, the Web seems to
have hit an architectural sweet spot, as demonstrated by its rapid growth over the past 20 years.
Linked Data builds directly on Web architecture and applies this architecture to the task of
sharing data on global scale.
Figure 2.1: URIs are used to identify people and the relationships between them.
tionships. The picture shows a Big Lynx film team at work. Within the picture, Big Lynx Lead
Cameraman Matt Briggs as well as his two assistants, Linda Meyer and Scott Miller, are identified
by HTTP URIs from the Big Lynx namespace. The relationship, that they know each other, is
represented by connecting lines having the relationship type http://xmlns.com/foaf/0.1/knows.
1 Please see copyright page for photo credits.
10 2. PRINCIPLES OF LINKED DATA
As discussed above, Linked Data uses only HTTP URIs, avoiding other URI schemes such
as URNs [83] and DOIs [92]. HTTP URIs make good names for two reasons:
1. They provide a simple way to create globally unique names in a decentralized fashion, as
every owner of a domain name, or delegate of the domain name owner, may create new URI
references.
2. They serve not just as a name but also as a means of accessing information describing the
identified entity.
If thinking about HTTP URIs as names for things rather than as addresses for Web documents
feels strange to you, then references [113] and [106] are highly recommended reading and warrant
re-visiting on a regular basis.
1. The client performs a HTTP GET request on a URI identifying a real-world object or
abstract concept. If the client is a Linked Data application and would prefer an RDF/XML
representation of the resource, it sends an Accept: application/rdf+xml header along with
the request. HTML browsers would send an Accept: text/html header instead.
2. The server recognizes that the URI identifies a real-world object or abstract concept. As the
server can not return a representation of this resource, it answers using the HTTP 303 See
Other response code and sends the client the URI of a Web document that describes the
real-world object or abstract concept in the requested format.
3. The client now performs an HTTP GET request on this URI returned by the server.
4. The server answers with a HTTP response code 200 OK and sends the client the requested
document, describing the original resource in the requested format.
This process can be illustrated with a concrete example. Imagine Big Lynx wants to serve data
about their Managing Director Dave Smith on the Web. This data should be understandable for
humans as well as for machines. Big Lynx therefore defines a URI reference that identifies the person
Dave Smith (real-world object) and publishes two documents on its Web server: an RDF document
containing the data about Dave Smith and an HTML document containing a human-readable
representation of the same data. Big Lynx uses the following three URIs to refer to Dave and the
two documents:
http://biglynx.co.uk/people/dave-smith
(URI identifying the person Dave Smith)
http://biglynx.co.uk/people/dave-smith.rdf
(URI identifying the RDF/XML document describing Dave Smith)
12 2. PRINCIPLES OF LINKED DATA
http://biglynx.co.uk/people/dave-smith.html
(URI identifying the HTML document describing Dave Smith)
To obtain the RDF data describing Dave Smith, a Linked Data client would connect to the
http://biglynx.co.uk/ server and issue the following HTTP GET request:
1 GET / p e o p l e / dave s m i t h HTTP/ 1 . 1
2 H o s t : b i g l y n x . co . uk
3 A c c e p t : t e x t / html ; q = 0 . 5 , a p p l i c a t i o n / r d f + xml
The client sends an Accept: header to indicate that it would take either HTML or RDF, but
would prefer RDF. This preference is indicated by the quality value q=0.5 for HTML. The server
would answer:
1 HTTP/ 1 . 1 303 S e e Other
2 L o c a t i o n : h t t p : / / b i g l y n x . co . uk / p e o p l e / dave s m i t h . r d f
3 V a r y : Accept
This is a 303 redirect, which tells the client that a Web document containing a description
of the requested resource, in the requested format, can be found at the URI given in the Location:
response header. Note that if the Accept: header had indicated a preference for HTML, the client
would have been redirected to a different URI. This is indicated by the Vary: header, which is
required so that caches work correctly. Next, the client will try to dereference the URI given in the
response from the server.
1 GET / p e o p l e / dave s m i t h . r d f HTTP/ 1 . 1
2 H o s t : b i g l y n x . co . uk
3 A c c e p t : t e x t / html ; q = 0 . 5 , a p p l i c a t i o n / r d f + xml
The Big Lynx Web server would answer this request by sending the client the RDF/XML
document containing data about Dave Smith:
1 HTTP/ 1 . 1 200 OK
2 ContentT y p e : a p p l i c a t i o n / r d f + xml
3
4
5 < ? xml v e r s i o n = " 1 . 0 " e n c o d i n g = "UTF8" ? >
6 < rdf:RDF
7 x m l n s : r d f = " h t t p : / / www . w3 . o r g /1999/02/22 r d f s y n t a x n s # "
8 x m l n s : f o a f = " h t t p : / / x m l n s . com / f o a f / 0 . 1 / " >
9
10 < r d f : D e s c r i p t i o n r d f : a b o u t = " h t t p : / / b i g l y n x . co . uk / p e o p l e / dave s m i t h " >
11 < r d f : t y p e r d f : r e s o u r c e = " h t t p : / / x m l n s . com / f o a f / 0 . 1 / P e r s o n " / >
12 < f o a f : n a m e >Dave Smith < / f o a f : n a m e >
13 ...
The 200 status code tells the client that the response contains a representation of the requested
resource. The Content-Type: header indicates that the representation is in RDF/XML format. The
rest of the message contains the representation itself, in this case an RDF/XML description of Dave
Smith. Only the beginning of this description is shown. The RDF data model, in general, will be
described in 2.4.1, while the RDF/XML syntax itself will be described in Section 2.4.2.
2.3. MAKING URIS DEFERERENCEABLE 13
2.3.2 HASH URIS
A widespread criticism of the 303 URI strategy is that it requires two HTTP requests to retrieve a
single description of a real-world object. One option for avoiding these two requests is provided by
the hash URI strategy.
The hash URI strategy builds on the characteristic that URIs may contain a special part that is
separated from the base part of the URI by a hash symbol (#). This special part is called the fragment
identifier.
When a client wants to retrieve a hash URI, the HTTP protocol requires the fragment part
to be stripped off before requesting the URI from the server. This means a URI that includes a
hash cannot be retrieved directly and therefore does not necessarily identify a Web document. This
enables such URIs to be used to identify real-world objects and abstract concepts, without creating
ambiguity [98].
Big Lynx has defined various vocabulary terms in order to describe the company in data
published on the Web. They may decide to use the hash URI strategy to serve an RDF/XML file
containing the definitions of all these vocabulary terms. Big Lynx therefore assigns the URI
http://biglynx.co.uk/vocab/sme
to the file (which contains a vocabulary of Small and Medium-sized Enterprises) and appends
fragment identifiers to the files URI in order to identify the different vocabulary terms that are
defined in the document. In this fashion, URIs such as the following are created for the vocabulary
terms:
http://biglynx.co.uk/vocab/sme#SmallMediumEnterprise
http://biglynx.co.uk/vocab/sme#Team
To dereference any of these URIs, the HTTP communication between a client application
and the server would look as follows:
First, the client truncates the URI, removing the fragment identifier (e.g., #Team). Then, it
connects to the server at biglynx.co.uk and issues the following HTTP GET request:
1 GET / v o c a b / sme HTTP/ 1 . 1
2 H o s t : b i g l y n x . co . uk
3 A c c e p t : a p p l i c a t i o n / r d f + xml
The server answers by sending the requested RDF/XML document, an abbreviated version
of which is shown below:
1 HTTP/ 1 . 1 200 OK
2 ContentT y p e : a p p l i c a t i o n / r d f + xml ; c h a r s e t = u t f 8
3
4
5 < ? xml v e r s i o n = " 1 . 0 " ? >
6 < rdf:RDF
7 x m l n s : r d f = " h t t p : / / www . w3 . o r g /1999/02/22 r d f s y n t a x n s # "
8 x m l n s : r d f s = " h t t p : / / www . w3 . o r g / 2 0 0 0 / 0 1 / r d f schema # " >
9
14 2. PRINCIPLES OF LINKED DATA
10 < rdf:Description
r d f : a b o u t = " h t t p : / / b i g l y n x . co . uk / v o c a b / sme # S m a l l M e d i u m E n t e r p r i s e " >
11 < r d f : t y p e r d f : r e s o u r c e = " h t t p : / / www . w3 . o r g / 2 0 0 0 / 0 1 / r d f schema # C l a s s " / >
12 </ rdf:Description >
13 < r d f : D e s c r i p t i o n r d f : a b o u t = " h t t p : / / b i g l y n x . co . uk / v o c a b / sme #Team " >
14 < r d f : t y p e r d f : r e s o u r c e = " h t t p : / / www . w3 . o r g / 2 0 0 0 / 0 1 / r d f schema # C l a s s " / >
15 </ rdf:Description >
16 ...
This demonstrates that the returned document contains not only a descrip-
tion of the vocabulary term http://biglynx.co.uk/vocab/sme#Team but also of the
term http://biglynx.co.uk/vocab/sme#SmallMediumEnterprise. The Linked Data-aware
client will now inspect the response and find triples that tell it more about the
http://biglynx.co.uk/vocab/sme#Team resource. If it is not interested in the triples describ-
ing the second resource, it can discard them before continuing to process the retrieved data.
One way to think of a set of RDF triples is as an RDF graph. The URIs occurring as subject
and object are the nodes in the graph, and each triple is a directed arc that connects the subject and the
object. As Linked Data URIs are globally unique and can be dereferenced into sets of RDF triples,
it is possible to imagine all Linked Data as one giant global graph, as proposed by Tim Berners-Lee
in [17]. Linked Data applications operate on top of this giant global graph and retrieve parts of it
by dereferencing URIs as required.
2.4.1.1 Benefits of using the RDF Data Model in the Linked Data Context
The main benefits of using the RDF data model in a Linked Data context are that:
1. By using HTTP URIs as globally unique identifiers for data items as well as for vocabulary
terms, the RDF data model is inherently designed for being used at global scale and enables
anybody to refer to anything.
2.4. PROVIDING USEFUL RDF INFORMATION 17
2. Clients can look up any URI in an RDF graph over the Web to retrieve additional information.
Thus each RDF triple is part of the global Web of Data and each RDF triple can be used as
a starting point to explore this data space.
3. The data model enables you to set RDF links between data from different sources.
4. Information from different sources can easily be combined by merging the two sets of triples
into a single graph.
5. RDF allows you to represent information that is expressed using different schemata in a single
graph, meaning that you can mix terms for different vocabularies to represent data. This
practice is explained in Section 4.4.
6. Combined with schema languages such as RDF-Schema [37] and OWL [79], the data model
allows the use of as much or as little structure as desired, meaning that tightly structured data
as well as semi-structured data can be represented. A short introduction to RDF Schema and
OWL is also given in Section 4.4.
1. RDF reification should be avoided, as reified statements are rather cumbersome to query
with the SPARQL query language [95]. Instead of using reification to publish metadata
about individual RDF statements, meta-information should instead be attached to the Web
document containing the relevant triples, as explained in Section 4.3.
2. RDF collections and RDF containers are also problematic if the data needs to be queried with
SPARQL. Therefore, in cases where the relative ordering of items in a set is not significant,
the use of multiple triples with the same predicate is recommended.
3. The scope of blank nodes is limited to the document in which they appear, meaning it is
not possible to create RDF links to them from external documents, reducing the potential
for interlinking between different Linked Data sources. In addition, it becomes much more
difficult to merge data from different sources when blank nodes are used, as there is no URI
to serve as a common key. Therefore, all resources in a data set should be named using URI
references.
18 2. PRINCIPLES OF LINKED DATA
2.4.2 RDF SERIALIZATION FORMATS
It is important to remember that RDF is not a data format, but a data model for describing resources
in the form of subject, predicate, object triples. In order to publish an RDF graph on the Web, it must
first be serialized using an RDF syntax. This simply means taking the triples that make up an RDF
graph, and using a particular syntax to write these out to a file (either in advance for a static data set
or on demand if the data set is more dynamic). Two RDF serialization formats - RDF/XML and
RDFa - have been standardized by the W3C. In addition several other non-standard serialization
formats are used to fulfill specific needs. The relative advantages and disadvantages of the different
serialization formats are discussed below, along with a code sample showing a simple graph expressed
in each serialization.
2.4.2.1 RDF/XML
The RDF/XML syntax [9] is standardized by the W3C and is widely used to publish Linked
Data on the Web. However, the syntax is also viewed as difficult for humans to read and write,
and, therefore, consideration should be given to using other serializations in data management and
curation workflows that involve human intervention, and to the provision of alternative serializations
for consumers who may wish to eyeball the data. The RDF/XML syntax is described in detail
as part of the W3C RDF Primer [76]. The MIME type that should be used for RDF/XML
within HTTP content negotiation is application/rdf+xml. The listing below show the RDF/XML
serialization of two RDF triples. The first one states that there is a thing, identified by the URI
http://biglynx.co.uk/people/dave-smith of type Person. The second triple state that this thing
has the name Dave Smith.
1 < ? xml v e r s i o n = " 1 . 0 " e n c o d i n g = "UTF8" ? >
2 < rdf:RDF
3 x m l n s : r d f = " h t t p : / / www . w3 . o r g /1999/02/22 r d f s y n t a x n s # "
4 x m l n s : f o a f = " h t t p : / / x m l n s . com / f o a f / 0 . 1 / " >
5
6 < r d f : D e s c r i p t i o n r d f : a b o u t = " h t t p : / / b i g l y n x . co . uk / p e o p l e / dave s m i t h " >
7 < r d f : t y p e r d f : r e s o u r c e = " h t t p : / / x m l n s . com / f o a f / 0 . 1 / P e r s o n " / >
8 < f o a f : n a m e >Dave Smith < / f o a f : n a m e >
9 </ rdf:Description >
10
11 < / rdf:RDF >
2.4.2.2 RDFa
RDFa [1] is a serialization format that embeds RDF triples in HTML documents. The RDF data is
not embedded in comments within the HTML document, as was the case with some early attempts
to mix RDF and HTML, but is interwoven within the HTML Document Object Model (DOM).This
means that existing content within the page can be marked up with RDFa by modifying HTML
code, thereby exposing structured data to the Web. A detailed introduction into RDFa is given in
the W3C RDFa Primer [1].
2.4. PROVIDING USEFUL RDF INFORMATION 19
RDFa is popular in contexts where data publishers are able to modify HTML templates
but have relatively little additional control over the publishing infrastructure. For example, many
content management systems will enable publishers to configure the HTML templates used to
expose different types of information, but may not be flexible enough to support 303 redirects and
HTTP content negotiation. When using RDFa to publish Linked Data on the Web, it is important
to maintain the unambiguous distinction between the real-world objects described by the data and
the HTML+RDFa document that embodies these descriptions. This can be achieved by using the
RDFa about= attribute to assign URI references to the real-world objects described by the data. If
these URIs are first defined in an RDFa document they will follow the hash URI pattern.
1 < !DOCTYPE html PUBLIC " //W3C/ /DTD XHTML+RDFa 1 . 0 / /EN"
" h t t p : / / www . w3 . o r g / MarkUp /DTD/ xhtmlr d f a 1. d t d " >
2 < html x m l n s = " h t t p : / / www . w3 . o r g / 1 9 9 9 / x h t m l "
x m l n s : r d f = " h t t p : / / www . w3 . o r g /1999/02/22 r d f s y n t a x n s # "
x m l n s : f o a f = " h t t p : / / x m l n s . com / f o a f / 0 . 1 / " >
3
4 < head >
5 < meta h t t p e q u i v = " ContentType " c o n t e n t = " a p p l i c a t i o n / x h t m l + xml ;
c h a r s e t =UTF8" / >
6 < t i t l e > P r o f i l e Page f o r Dave Smith < / t i t l e >
7 < / head >
8
9 < body >
10 < d i v a b o u t = " h t t p : / / b i g l y n x . co . uk / p e o p l e # dave s m i t h " t y p e o f = " f o a f : P e r s o n " >
11 < s p a n p r o p e r t y = " f o a f : n a m e " >Dave Smith < / s p a n >
12 </ div >
13 < / body >
14
15 < / html >
2.4.2.3 Turtle
Turtle is a plain text format for serializing RDF data. Due to its support for namespace prefixes
and various other shorthands, Turtle is typically the serialization format of choice for reading RDF
triples or writing them by hand. A detailed introduction to Turtle is given in the W3C Team
Submission document Turtle - Terse RDF Triple Language [10]. The MIME type for Turtle is
text/turtle;charset=utf-8.
1 @ p r e f i x r d f : < h t t p : / / www . w3 . o r g /1999/02/22 r d f s y n t a x n s # > .
2 @ p r e f i x f o a f : < h t t p : / / x m l n s . com / f o a f / 0 . 1 / > .
3
4 < h t t p : / / b i g l y n x . co . uk / p e o p l e / dave s m i t h >
5 r d f :t y p e foaf:Person ;
6 f o a f : n a m e " Dave Smith " .
2.4.2.4 N-Triples
N-Triples is a subset of Turtle, minus features such as namespace prefixes and shorthands. The result
is a serialization format with lots of redundancy, as all URIs must be specified in full in each triple.
20 2. PRINCIPLES OF LINKED DATA
Consequently, N-Triples files can be rather large relative to Turtle and even RDF/XML. However,
this redundancy is also the primary advantage of N-Triples over other serialization formats, as it
enables N-Triples files to be parsed one line at a time, making it ideal for loading large data files that
will not fit into main memory. The redundancy also makes N-Triples very amenable to compression,
thereby reducing network traffic when exchanging files. These two factors make N-Triples the de
facto standard for exchanging large dumps of Linked Data, e.g., for backup or mirroring purposes.
The complete definition of the N-Triples syntax is given as part of the W3C RDF Test Cases
recommendation2 .
1 < h t t p : / / b i g l y n x . co . uk / p e o p l e / dave s m i t h >
< h t t p : / / www . w3 . o r g /1999/02/22 r d f s y n t a x n s # t y p e >
< h t t p : / / x m l n s . com / f o a f / 0 . 1 / P e r s o n > .
2 < h t t p : / / b i g l y n x . co . uk / p e o p l e / dave s m i t h > < h t t p : / / x m l n s . com / f o a f / 0 . 1 / name > " Dave
Smith " .
2.4.2.5 RDF/JSON
RDF/JSON refers to efforts to provide a JSON ( JavaScript Object Notation) serialization for RDF,
the most widely adopted of which is the Talis specification3 [4]. Availability of a JSON serialization
of RDF is highly desirable, as a growing number of programming languages provide native JSON
support, including staples of Web programming such as JavaScript and PHP. Publishing RDF data
in JSON therefore makes it accessible to Web developers without the need to install additional
software libraries for parsing and manipulating RDF data. It is likely that further efforts will be
made in the near future to standardize a JSON serialization of RDF4 .
The example above demonstrates how Big Lynx uses RDF links pointing at related entities
to enrich the data it publishes about its Managing Director Dave Smith. In order to provide back-
ground information about the place where he lives, the example contains an RDF link stating that
Dave is based_near something identified by the URI http://sws.geonames.org/3333125/. Linked
Data applications that look up this URI will retrieve a extensive description of Birmingham from
Geonames5 , a data source that provides names of places (in different languages), geo-coordinates, and
5 http://www.geonames.org/
22 2. PRINCIPLES OF LINKED DATA
information about administrative structures. The Geonames data about Birmingham will contain
a further RDF link pointing at http://dbpedia.org/resource/Birmingham.
By following this link, applications can find population counts, postal codes, descriptions in
90 languages, and lists of famous people and bands that are related to Birmingham . The description
of Birmingham provided by DBpedia, in turn, contains RDF links pointing at further data sources
that contain data about Birmingham. Therefore, by setting a single RDF link, Big Lynx has enabled
applications to retrieve data from a network of interlinked data sources.
1. Different opinions. URI aliases have an important social function on the Web of Data as they
are dereferenced to descriptions of the same resource provided by different data publishers and
thus allow different views and opinions to be expressed.
2. Traceability. Using different URIs allows consumers of Linked Data to know what a particular
publisher has to say about a specific entity by dereferencing the URI that is used by this
publisher to identify the entity.
3. No central points of failure. If all things in the world were to each have one, and only one, URI,
this would entail the creation and operation of a centralized naming authority to assign URIs.
The coordination complexity, administrative and bureaucratic overhead this would introduce
would create a major barrier to growth in the Web of Data.
The last point becomes especially clear when one considers the size of many data sets that
are part of the Web of Data. For instance, the Geonames data set provides information about over
eight million locations. If in order to start publishing their data on the Web of Data, the Geonames
team would need to find out what the commonly accepted URIs for all these places would be, doing
so would be so much effort that it would likely prevent Geonames from publishing their dataset
as Linked Data at all. Defining URIs for the locations in their own namespace lowers the barrier
to entry, as they do not need to know about other peoples URIs for these places. Later, they, or
somebody else, may invest effort into finding and publishing owl:sameAs links pointing to data
about these places other datasets, enabling progressive adoption of the Linked Data principles.
Therefore, in contrast to relying on upfront agreement on URIs, the Web of Linked Data relies
on solving the identity resolution problem in an evolutionary and distributed fashion: evolutionary,
in that more and more owl:sameAs links can be added over time; and distributed, in that different
data providers can publish owl:sameAs links and as the overall effort for creating these links can thus
shared between the different parties.
There has been significant uncertainty in recent years about whether owl:sameAs or other
predicates should be used to express identity links [53]. A major source of this uncertainty is that the
OWL semantics [93] treat RDF statements as facts rather then as claims by different information
providers. Today, owl:sameAs is widely used in the Linked Data context and hundreds of millions
of owl:sameAs links are published on the Web. Therefore, we recommend to also use owl:sameAs
to express identity links, but always to keep in mind that the Web is a social system and that all
its content needs to be treated as claims by different parties rather than as facts (see Section 6.3.5
on Data Quality Assessment). This guidance is also supported by members of the W3C Technical
Architecture Group (TAG)6 .
6 http://lists.w3.org/Archives/Public/www-tag/2007Jul/0032.html
24 2. PRINCIPLES OF LINKED DATA
2.5.3 VOCABULARY LINKS
The promise of the Web of Data is not only to enable client applications to discover new data
sources by following RDF links at run-time but also to help them to integrate data from these
sources. Integrating data requires bridging between the schemata that are used by different data
sources to publish their data. The term schema is understood in the Linked Data context as the
mixture of distinct terms from different RDF vocabularies that are used by a data source to publish
data on the Web. This mixture may include terms from widely used vocabularies (see Section 4.4.4)
as well as proprietary terms.
The Web of Data takes a two-fold approach to dealing with heterogeneous data represen-
tation [22]. On the one hand side, it tries to avoid heterogeneity by advocating the reuse of terms
from widely deployed vocabularies. As discussed in Section 4.4.4 a set of vocabularies for describing
common things like people, places or projects has emerged in the Linked Data community. Thus,
whenever these vocabularies already contain the terms needed to represent a specific data set, they
should be used. This helps to avoid heterogeneity by relying on ontological agreement.
On the other hand, the Web of Data tries to deal with heterogeneity by making data as
self-descriptive as possible. Self-descriptiveness [80] means that a Linked Data application which
discovers some data on the Web that is represented using a previously unknown vocabulary should
be able to find all meta-information that it requires to translate the data into a representation that
it understands and can process. Technically, this is realized in a twofold manner: first, by making
the URIs that identify vocabulary terms dereferenceable so that client applications can look up the
RDFS and OWL definition of terms this means that every vocabulary term links to its own
definition [23]; second, by publishing mappings between terms from different vocabularies in the
form of RDF links [80]. Together these techniques enable Linked Data applications to discover the
meta-information that they need to integrate data in a follow-your-nose fashion along RDF links.
Linked Data publishers should therefore adopt the following workflow: first, search for terms
from widely used vocabularies that could be reused to represent data (as described in Section 4.4.4);
if widely deployed vocabularies do not provide all terms that are needed to publish the complete
content of a data set, the required terms should be defined as a proprietary vocabulary (as described in
Section 4.4.6) and used in addition to terms from widely deployed vocabularies. Wherever possible,
the publisher should seek wider adoption for the new, proprietary vocabulary from others with related
data.
If at a later point in time, the data publisher discovers that another vocabulary contains the
same term as the proprietary vocabulary, an RDF link should be set between the URIs identifying
the two vocabulary terms, stating that these URIs actually refer to the same concept (= the term).The
Web Ontology Language (OWL) [79], RDF Schema (RDFS) [37] and the Simple Knowledge Or-
ganization System (SKOS) [81] define RDF link types that can be used to represent such mappings.
owl:equivalentClass and owl:equivalentProperty can be used to state that terms in different vo-
cabularies are equivalent. If a looser mapping is desired, then rdfs:subClassOf, rdfs:subPropertyOf,
skos:broadMatch, and skos:narrowMatch can be used.
2.6. CONCLUSIONS 25
The example below illustrates
proprietary how
vocabulary the
term
http://biglynx.co.uk/vocab/sme#SmallMediumEnterprise is interlinked with related terms
from the DBpedia, Freebase, UMBEL, and OpenCyc.
1 @prefix r d f : < h t t p : / / www . w3 . o r g /1999/02/22 r d f s y n t a x n s # > .
2 @prefix r d f s : < h t t p : / / www . w3 . o r g / 2 0 0 0 / 0 1 / r d f schema # > .
3 @prefix o w l : < h t t p : / / www . w3 . o r g / 2 0 0 2 / 0 7 / owl # > .
4 @prefix c o : < h t t p : / / b i g l y n x . co . uk / v o c a b / sme # > .
5
6 < h t t p : / / b i g l y n x . co . uk / v o c a b / sme # S m a l l M e d i u m E n t e r p r i s e >
7 rdf:type rdfs:Class ;
8 r d f s : l a b e l " S m a l l o r Mediums i z e d E n t e r p r i s e " ;
9 r d f s : s u b C l a s s O f < h t t p : / / d b p e d i a . o r g / o n t o l o g y / Company> .
10 r d f s : s u b C l a s s O f < h t t p : / / umbel . o r g / umbel / s c / B u s i n e s s > ;
11 r d f s : s u b C l a s s O f < h t t p : / / sw . o p e n c y c . o r g / c o n c e p t / Mx4rvVjQNpwpEbGdrcN5Y29ycA> ;
12 r d f s : s u b C l a s s O f < h t t p : / / r d f . f r e e b a s e . com / n s /m/ 0 q b 7 t > .
Just as owl:sameAs links can be used to incrementally interconnect URI aliases, term-level links
between different vocabularies can also be set over time by different parties. The more links that are
set between vocabulary terms, the better client applications can integrate data that is represented
using different vocabularies. Thus, the Web of Data relies on a distributed, pay-as-you-go approach
to data integration, which enables the integration effort to be split over time and between different
parties [51][74][34]. This type of data integration is discussed in more detail in Section 6.4.
2.6 CONCLUSIONS
This chapter has outlined the basic principles of Linked Data and has described how the principles
interplay in order to extend the Web with a global data space. Similar to the classic document Web,
the Web of Data is built on a small set of standards and the idea to use links to connect content from
different sources. The extent of its dependence on URIs and HTTP demonstrates that Linked Data
is not disjoint from the Web at large, but simply an application of its principles and key components
to novel forms of usage. Far from being an additional layer on top of but separate from the Web,
Linked Data is just another warp or weft being steadily interwoven with the fabric of the Web.
Structured data is made available on the Web today in forms. Data is published as CSV data
dumps, Excel spreadsheets, and in a multitude of domain-specific data formats. Structured data is
embedded into HTML pages using Microformats7 . Various data providers have started to allow
direct access to their databases via Web APIs.
So what is the rationale for adopting Linked Data instead of, or in addition to, these well-
established publishing techniques? In summary, Linked Data provides a more generic, more flexible
publishing paradigm which makes it easier for data consumers to discover and integrate data from
large numbers of data sources. In particular, Linked Data provides:
A unifying data model. Linked Data relies on RDF as a single, unifying data model. By
providing for the globally unique identification of entities and by allowing different schemata
7 http://microformats.org/
26 2. PRINCIPLES OF LINKED DATA
to be used in parallel to represent data, the RDF data model has been especially designed for
the use case of global data sharing. In contrast, the other methods for publishing data on the
Web rely on a wide variety of different data models, and the resulting heterogeneity needs to
be bridged in the integration process.
A standardized data access mechanism. Linked Data commits itself to a specific pattern of
using the HTTP protocol. This agreement allows data sources to be accessed using generic
data browsers and enables the complete data space to be crawled by search engines. In contrast,
Web APIs are accessed using different proprietary interfaces.
Hyperlink-based data discovery. By using URIs as global identifiers for entities, Linked
Data allows hyperlinks to be set between entities in different data sources. These data links
connect all Linked Data into a single global data space and enable Linked Data applications
to discover new data sources at run-time. In contrast, Web APIs as well as data dumps in
proprietary formats remain isolated data islands.
Self-descriptive data. Linked Data eases the integration of data from different sources by
relying on shared vocabularies, making the definitions of these vocabularies retrievable, and by
allowing terms from different vocabularies to be connected to each other by vocabulary links.
Compared to the other methods of publishing data on the Web, these properties of the Linked
Data architecture make it easier for data consumers to discover, access and integrate data. However,
it is important to remember that the various publication methods represent a continuum of benefit,
from making data available on the Web in any form, to publishing Linked Data according to the
principles described in this chapter.
Progressive steps can be taken towards Linked Data publishing, each of which make it easier
for third parties to consume and work with the data. These steps include making data available on
the Web in any format but under an open license, to using structured, machine-readable formats
that are preferably non-proprietary, to adoption of open standards such as RDF, and to inclusion of
links to other data sources.
Tim Berners-Lee has described this continuum in terms of a five-star rating scheme [16],
whereby data publishers can nominally award stars to their data sets according to the following
criteria:
1 Star: data is available on the web (whatever format), but with an open license.
2 Stars: data is available as machine-readable structured data (e.g., Microsoft Excel instead of
a scanned image of a table).
3 Stars: data is available as (2) but in a non-proprietary format (e.g., CSV instead of Excel).
4 Stars: data is available according to all the above, plus the use of open standards from the
W3C (RDF and SPARQL) to identify things, so that people can link to it.
2.6. CONCLUSIONS 27
5 Stars: data is available according to all the above, plus outgoing links to other peoples data
to provide context.
Crucially, each rating can be obtained in turn, representing a progressive transition to Linked
Data rather than a wholesale adoption in one operation.
29
CHAPTER 3
1. The Web of Data is generic and can contain any type of data.
3. The Web of Data is able to represent disagreement and contradictionary information about
an entity.
4. Entities are connected by RDF links, creating a global data graph that spans data sources and
enables the discovery of new data sources. This means that applications do not have to be
implemented against a fixed set of data sources, but they can discover new data sources at
run-time by following RDF links.
5. Data publishers are not constrained in their choice of vocabularies with which to represent
data.
7. The use of HTTP as a standardized data access mechanism and RDF as a standardized data
model simplifies data access compared to Web APIs, which rely on heterogeneous data models
and access interfaces.
30 3. THE WEB OF DATA
3.1 BOOTSTRAPPING THE WEB OF DATA
The origins of this Web of Data lie in the efforts of the Semantic Web research community and
particularly in the activities of the W3C Linking Open Data (LOD) project 1 , a grassroots community
effort founded in January 2007. The founding aim of the project, which has spawned a vibrant and
growing Linked Data community, was to bootstrap the Web of Data by identifying existing data
sets available under open licenses, convert them to RDF according to the Linked Data principles,
and to publish them on the Web. As a point of principle, the project has always been open to anyone
who publishes data according to the Linked Data principles. This openness is a likely factor in the
success of the project in bootstrapping the Web of Data.
Figure 3.1 and Figure 3.2 demonstrates how the number of data sets published on the Web
as Linked Data has grown since the inception of the Linking Open Data project. Each node in the
diagram represents a distinct data set published as Linked Data. The arcs indicate the existence of
links between items in the two data sets. Heavier arcs correspond to a greater number of links, while
bidirectional arcs indicate that outward links to the other exist in each data set.
Figure 3.2 illustrates the November 2010 scale of the Linked Data Cloud originating from the
Linking Open Data project and classifies the data sets by topical domain, highlighting the diversity
of data sets present in the Web of Data. The graphic shown in this figure is available online at
http://lod-cloud.net. Updated versions of the graphic will be published on this website in
regular intervals. More information about each of these data sets can be found by exploring the
LOD Cloud Data Catalog 2 which is maintained by the LOD community within the Comprehensive
Knowledge Archive Network (CKAN)3 , a generic catalog that lists open-license datasets represented
using any format.
If you publish a linked data set yourself, please also add it to this catalog so that it will be
included into the next version of the cloud diagram. Instructions on how to add data sets to the
catalog are found in the ESW wiki4 .
1 http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData
2 http://www.ckan.net/group/lodcloud
3 http://www.ckan.net/
4 http://esw.w3.org/TaskForces/CommunityProjects/LinkingOpenData/DataSets/CKANmetainformation
5 http://lod-cloud.net/state/
3.2. TOPOLOGY OF THE WEB OF DATA 31
Figure 3.1: Growth in the number of data sets published on the Web as Linked Data.
32
3. THE WEB OF DATA
Figure 3.2: Linking Open Data cloud as of November 2010. The colors classify data sets by topical domain.
3.2. TOPOLOGY OF THE WEB OF DATA 33
which on a regular basis compiles summary statistics about the data sets that are cataloged within
the LOD Cloud Data Catalog on CKAN.
6 http://dbpedia.org/
7 http://www.freebase.com
34 3. THE WEB OF DATA
Further cross-domain data sets include UMBEL8 , YAGO [104], and OpenCyc9 . These are,
in turn, linked with DBpedia, helping to facilitate data integration across a wide range of interlinked
sources.
17 http://www.bbc.co.uk/wildlifefinder/
18 http://news.bbc.co.uk/sport1/hi/football/world_cup_2010/default.stm
19 http://www.bbc.co.uk/blogs/bbcinternet/2010/07/bbc_world_cup_2010_dynamic_sem.html
20 http://data.nytimes.com/
21 http://www.opencalais.com/
22 http://data.australia.gov.au/
23 http://www.data.govt.nz/
24 http://data.gov.uk
36 3. THE WEB OF DATA
U.S.A.25 , have led to a significant increase in the amount of governmental and public-sector data
that is made accessible on the Web. Making this data easily accessible enables organisations and
members of the public to work with the data, analyse it to discover new insights, and build tools
that help communicate these findings to others, thereby helping citizens make informed choices and
hold public servants to account.
The potential of Linked Data for easing the access to government data is increasingly under-
stood, with both the data.gov.uk26 and data.gov 27 initiatives publishing significant volumes of data
in RDF. The approach taken in the two countries differs slightly: to date the latter has converted
very large volumes of data, while the former has focused on the creation of core data-level infras-
tructure for publishing Linked Data, such as stable URIs to which increasing amounts of data can
be connected [101].
In a very interesting initiative is being pursued by the UK Civil Service 28 which has started
to mark up job vacancies using RDFa. By providing information about open positions in a struc-
tured form, it becomes easier for external job portals to incorporate civil service jobs [25]. If more
organizations would follow this example, the transparency in the labor market could be significantly
increased [31].
Further high-level guidance on "Putting Government Data online" can be found in [18]. In
order to provide a forum for coordinating the work on using Linked Data and other Web standards
to improve access to government data and increase government transparency, W3C has formed a
eGovernment Interest Group29 .
25 http://www.data.gov
26 http://data.gov.uk/linked-data
27 http://www.data.gov/semantic
28 http://www.civilservice.gov.uk/
29 http://www.w3.org/egov/wiki/Main_Page
30 http://id.loc.gov/authorities/about.html
31 http://blog.libris.kb.se/semweb/?p=7
32 http://blogs.talis.com/nodalities/2009/01/libris-linked-library-data.php
3.2. TOPOLOGY OF THE WEB OF DATA 37
page for every book ever published"33 publishes its catalogue in RDF, with incoming links from data
sets such as ProductDB (see Section 3.2.7 below).
Scholarly articles from journals and conferences are also well represented in the Web of Data
through community publishing efforts such as DBLP as Linked Data343536 , RKBexplorer37 , and
the Semantic Web Dogfood Server38 [84].
An application that facilitates this scholarly data space is Talis Aspire 39 . The application
supports educators in the creation and management of literature lists for university courses. Items
are added to these lists through a conventional Web interface; however, behind the scenes, the system
stores these records as RDF and makes the lists available as Linked Data. Aspire is used by various
universities in the UK, which, in turn, have become Linked Data providers. The Aspire application
is explored in more detail in Section 6.1.2.
High levels of ongoing activity in the library community will no doubt lead to further signif-
icant Linked Data deployments in this area. Of particular note in this area is the new Object Reuse
and Exhange (OAI-ORE) standard from the Open Archives Initiative [110], which is based on the
Linked Data principles. The OAI-ORE, Dublin Core, SKOS, and FOAF vocabularies form the
foundation of the new Europeana Data Model40 . The adoption of this model by libraries, muse-
ums and cultural institutions that participate in Europeana will further accelerate the availability of
Linked Data related to publications and cultural heritage artifacts.
In order to provide a forum and to coordinate the efforts to increase the global interoperability
of library data, W3C has started a Library Linked Data Incubator Group41 .
33 http://openlibrary.org/
34 http://dblp.l3s.de/
35 http://www4.wiwiss.fu-berlin.de/dblp/
36 http://dblp.rkbexplorer.com/
37 http://www.rkbexplorer.com/data/
38 http://data.semanticweb.org/
39 http://www.talis.com/aspire
40 http://version1.europeana.eu/c/document_library/get_file?uuid=9783319c-9049-436c-bdf9-
25f72e85e34c&groupId=10602
41 http://www.w3.org/2005/Incubator/lld/
42 http://esw.w3.org/HCLSIG/LODD
38 3. THE WEB OF DATA
3.2.7 RETAIL AND COMMERCE
The RDF Book Mashup43 [29] provided an early example of publishing Linked Data related to
retail and commerce. The Book Mashup uses the Simple Commerce Vocabulary44 to represent and
republish data about book offers retrieved from the Amazon.com and Google Base Web APIs.
More recently, the GoodRelations ontology45 [63] has provided a richer ontology for de-
scribing many aspects of e-commerce, such as businesses, products and services, offerings, opening
hours, and prices. GoodRelations has seen significant uptake from retailers such as Best Buy 46 and
Overstock.com47 seeking to increase their visibility in search engines such as Yahoo! and Google, that
recognise data published in RDFa using certain vocabularies and use this data to enhance search
results (see Section 6.1.1.2). The adoption of the GoodRelations ontology has even extended to the
publication of price lists for courses offered by The Open University 48 .
The ProductDB Web site and data set49 aggregates and links data about products for a range
of different sources and demonstrates the potential of Linked Data for the area of product data
integration.
3.3 CONCLUSIONS
The data sets described in this chapter demonstrate the diversity in the Web of Data. Recently
published data sets, such as Ordnance Survey, legislation.gov.uk, the BBC, and the New York Times
data sets, demonstrate how the Web of Data is evolving from data publication primarily by third
party enthusiasts and researchers, to data publication at source by large media and public sector
organisations. This trend is expected to gather significant momentum, with organisations in other
industry sectors publishing their own data according to the Linked Data principles.
Linked Data is made available on the Web using a wide variety of tools and publishing patterns.
In the following Chapters 4 and 5, we will examine the design decisions that must be taken to ensure
your Linked Data sits well in the Web, and the technical options available for publishing it.
56 http://www.imdb.com/
57 http://drupal.org/
41
CHAPTER 4
As discussed in Chapter 2, the first principle of Linked Data is that URIs should be used as names
for things that feature in your data set. These things might be concrete real-world entities such as
a person, a building, your dog, or more abstract notions such as a scientific concept. Each of these
things needs a name so that you and others can refer to it. Just as significant care should go into the
design of URIs for pages in a conventional Web site, so should careful decisions be made about the
design of URIs for a set of Linked Data. This section will explore these issues in detail.
2. Triples that describe the resource by linking to other resources (e.g., triples stating the resources
creator, or its type)
3. Triples that describe the resource by linking from other resources (i.e., incoming links)
4. Triples describing related resources (e.g., the name and maybe affiliation of the resources
creator)
5. Triples describing the description itself (i.e., data about the data, such as its provenance, date
of collection, or licensing terms)
6. Triples about the broader data set of which this description is a part.
Of particular importance are triples that provide human-readable labels for resources that
can be used within client applications. Predicates such as rdfs:label, foaf:name, skos:prefLabel
and dcterms:title should be used for this purpose as they are widely supported by Linked Data
applications. In cases where a comment or textual description of the resource is available, these
should be published using predicates such as dcterms:description or rdfs:comment.
46 4. LINKED DATA DESIGN CONSIDERATIONS
4.2.2 INCOMING LINKS
If an RDF triple links person a to person b, the document describing b should include this triple,
which can be thought of as an incoming link to b (item 3 on the list above). This helps ensure that
data about person a is discoverable from the description of b, even if a is not the object of any triples
in the description of b. For instance, when you use a Linked Data browser to navigate from resource
a to b, incoming links enable you to navigate backward to resource a. They also enable crawlers of
Linked Data search engines, which have entered a Linked Data site via an external Link pointing
at resource b, to discover resource a and continue crawling the site.
The code sample below demonstrates this principle applied to the Big Lynx scenario. In this
case the code shows both an outgoing employerOf link from Dave Smith to Matt Briggs and the
inverse employedBy link from Matt Briggs to Dave Smith. Following this principle, the same two
triples would be published in the document describing Matt Briggs.
1 @ p r e f i x r d f : < h t t p : / / www . w3 . o r g /1999/02/22 r d f s y n t a x n s # > .
2 @ p r e f i x f o a f : < h t t p : / / x m l n s . com / f o a f / 0 . 1 / > .
3 @prefix r e l : < h t t p : / / p u r l . org / vocab / r e l a t i o n s h i p / > .
4
5 < h t t p : / / b i g l y n x . co . uk / p e o p l e / dave s m i t h >
6 r d f :t y p e foaf:Person ;
7 f o a f : n a m e " Dave Smith " ;
8 f o a f : b a s e d _ n e a r < h t t p : / / s w s . geonames . o r g / 3 3 3 3 1 2 5 / > ;
9 f o a f : b a s e d _ n e a r < h t t p : / / d b p e d i a . o r g / r e s o u r c e / Birmingham > ;
10 f o a f : t o p i c _ i n t e r e s t < h t t p : / / dbpedia . org / r e s o u r c e / Wildlife_photography > ;
11 f o a f :k n o w s < h t t p : / / dbpedia . org / r e s o u r c e / David_Attenborough > ;
12 r e l : e m p l o y e r O f < h t t p : / / b i g l y n x . co . uk / p e o p l e / mattb r i g g s > ;
13 f o a f : i s P r i m a r y T o p i c O f < h t t p : / / b i g l y n x . co . uk / p e o p l e / dave s m i t h . r d f > .
14
15 < h t t p : / / b i g l y n x . co . uk / p e o p l e / dave s m i t h . r d f >
16 f o a f : p r i m a r y T o p i c < h t t p : / / b i g l y n x . co . uk / p e o p l e / dave s m i t h > .
17
18 < h t t p : / / b i g l y n x . co . uk / p e o p l e / mattb r i g g s >
19 r e l : e m p l o y e d B y < h t t p : / / b i g l y n x . co . uk / p e o p l e / dave s m i t h > .
4 http://n2.talis.com/wiki/Bounded_Descriptions_in_RDF
48 4. LINKED DATA DESIGN CONSIDERATIONS
4.3 PUBLISHING DATA ABOUT DATA
4.3.1 DESCRIBING A DATA SET
In addition to ensuring that resources featured in a data set are richly described, the same principles
should be applied to the data set itself, to include information about authorship of a data set, its
currency (i.e., how recently the data set was updated), and its licensing terms. This metadata gives
consumers of the data clarity about the provenance and timeliness of a data set and the terms under
which it can be reused; each, of which, are important in encouraging reuse of a data set.
Furthermore, descriptions of a data set can include pointers to example resources within the
data set, thereby aiding discovery and indexing of the data by crawlers. If the load created by crawlers
indexing a site is too great, descriptions of a data set can also include links to RDF data dumps,
which can be downloaded and indexed separately.
Two primary mechanisms are available for publishing descriptions of a data set: Semantic
Sitemaps [43] and voiD [5] descriptions.
5 http://sw.deri.org/2007/07/sitemapextension/
6 http://www.sitemaps.org/
4.3. PUBLISHING DATA ABOUT DATA 49
10 < sc:datasetURI >
11 h t t p : / / b i g l y n x . co . uk / d a t a s e t s / p e o p l e
12 </ sc:datasetURI >
13 < sc:linkedDataPrefix >
14 h t t p : / / b i g l y n x . co . uk / p e o p l e /
15 </ sc:linkedDataPrefix >
16 < s c :s a m p l e U R I >
17 h t t p : / / b i g l y n x . co . uk / p e o p l e / dave s m i t h
18 < / s c :s a m p l e U R I >
19 < s c :s a m p l e U R I >
20 h t t p : / / b i g l y n x . co . uk / p e o p l e / mattb r i g g s
21 < / s c :s a m p l e U R I >
22 < sc:sparqlEndpointLocation >
23 h t t p : / / b i g l y n x . co . uk / s p a r q l
24 </ sc:sparqlEndpointLocation >
25 < s c :d a t a D u m p L o c a t i o n >
26 h t t p : / / b i g l y n x . co . uk / dumps / p e o p l e . r d f . gz
27 < / s c :d a t a D u m p L o c a t i o n >
28 < c h a n g e f r e q > monthly < / c h a n g e f r e q >
29 </ sc:dataset >
30 </ urlset >
If the data publisher wishes to convey to a consumer the shape of the graph that can be expected
when a URI is dereferenced (for example, a Concise Bounded Description or a Symmetric Concise
Bounded Description), the optional slicing attribute can be added to the sc:linkedDataPrefix.
1 < s c : l i n k e d D a t a P r e f i x s l i c i n g = " cbd " >
2 h t t p : / / b i g l y n x . co . uk / p e o p l e /
3 </ sc:linkedDataPrefix >
Acceptable values for this attribute include cbd and scbd. A full list of these values, with
explanatory notes, can be found at 7 .
Note that these triples could be included in a document published anywhere on the Web.
However, to ensure the online profile of Dave Smith is as self-describing as possible, they should
appear within the RDF document itself, at http://biglynx.co.uk/people/dave-smith.rdf. It should
also be noted that, given convention regarding the use of the .rdf extension, the online profile should
be published using the RDF/XML serialisation of RDF. However, for the sake of readability this
and subsequent examples are shown using the more-readable Turtle serialisation.
This example is extended below to include the publisher of the staff profile document (in this
case Big Lynx, identified by the URI http://biglynx.co.uk/company.rdf#company) and its date of
publication. A triple has also been added stating that Dave Smiths staff profile document is part
of the broader People data set introduced in Section 4.3.1.1 above, and it is identified by the URI
http://biglynx.co.uk/datasets/people:
1 @prefix r d f : < h t t p : / / www . w3 . o r g /1999/02/22 r d f s y n t a x n s # > .
2 @prefix r d f s : < h t t p : / / www . w3 . o r g / 2 0 0 0 / 0 1 / r d f schema # > .
3 @prefix x s d : < h t t p : / / www . w3 . o r g / 2 0 0 1 / XMLSchema# > .
4 @prefix d c t e r m s : < h t t p : / / p u r l . o r g / dc / t e r m s / > .
5 @prefix f o a f : < h t t p : / / x m l n s . com / f o a f / 0 . 1 / > .
6 @prefix r e l : < h t t p : / / p u r l . org / vocab / r e l a t i o n s h i p / > .
7 @prefix v o i d : < h t t p : / / r d f s . org / ns / v o i d #> .
8
9 < h t t p : / / b i g l y n x . co . uk / p e o p l e / dave s m i t h . r d f >
10 f o a f : p r i m a r y T o p i c < h t t p : / / b i g l y n x . co . uk / p e o p l e / dave s m i t h > ;
11 r d f :t y p e foaf:PersonalProfileDocument ;
12 r d f s : l a b e l " Dave Smith s P e r s o n a l P r o f i l e i n RDF" ;
4.3. PUBLISHING DATA ABOUT DATA 51
13 d c t e r m s : c r e a t o r < h t t p : / / b i g l y n x . co . uk / p e o p l e / n e l l y j o n e s > ;
14 d c t e r m s : p u b l i s h e r < h t t p : / / b i g l y n x . co . uk / company . r d f # company > ;
15 d c t e r m s : d a t e " 20101105 " ^^ x s d : d a t e ;
16 d c t e r m s : i s P a r t O f < h t t p : / / b i g l y n x . co . uk / d a t a s e t s / p e o p l e > .
17
18 < h t t p : / / b i g l y n x . co . uk / p e o p l e / dave s m i t h >
19 f o a f : i s P r i m a r y T o p i c O f < h t t p : / / b i g l y n x . co . uk / p e o p l e / dave s m i t h . r d f > ;
20 r d f :t y p e foaf:Person ;
21 f o a f : n a m e " Dave Smith " ;
22 f o a f : b a s e d _ n e a r < h t t p : / / s w s . geonames . o r g / 3 3 3 3 1 2 5 / > ;
23 f o a f : b a s e d _ n e a r < h t t p : / / d b p e d i a . o r g / r e s o u r c e / Birmingham > ;
24 f o a f : t o p i c _ i n t e r e s t < h t t p : / / dbpedia . org / r e s o u r c e / Wildlife_photography > ;
25 f o a f :k n o w s < h t t p : / / dbpedia . org / r e s o u r c e / David_Attenborough > ;
26 r e l : e m p l o y e r O f < h t t p : / / b i g l y n x . co . uk / p e o p l e / mattb r i g g s > .
27
28 < h t t p : / / b i g l y n x . co . uk / p e o p l e / mattb r i g g s >
29 r e l : e m p l o y e d B y < h t t p : / / b i g l y n x . co . uk / p e o p l e / dave s m i t h > .
The People data set is itself a subset of the entire Big Lynx Master data set, which is briefly
mentioned at the bottom of the example above, to demonstrate use of the void:subset property to
create an incoming link. Note the directionality of this property, which points from a super data
52 4. LINKED DATA DESIGN CONSIDERATIONS
set to one of its subsets, not the other way around. At present, no inverse of void:subset has been
defined, although a term such as void:inDataset has been discussed to address this issue. In the
meantime, the term dcterms:isPartOf makes a reasonable substitute, and therefore is used in the
examples above.
Applying this license to content published on the Web and described in RDF is simple,
involving the addition of just one RDF triple. The code sample below shows an RDF description
of the "Making Pacific Sharks" blog post. The post is described using the SIOC ontology [36]
of online communities. Note that a URI, http://biglynx.co.uk/blog/making-pacific-sharks, has
been minted to identify the post itself, which is reproduced and described in corresponding RDF
and HTML documents.
The cc:license triple in the code associates the CC-BY-SA license with the blog post, using
the vocabulary that the Creative Commons has defined for this purpose15 .
1 @prefix r d f : < h t t p : / / www . w3 . o r g /1999/02/22 r d f s y n t a x n s # > .
2 @prefix r d f s : < h t t p : / / www . w3 . o r g / 2 0 0 0 / 0 1 / r d f schema # > .
3 @prefix d c t e r m s : < h t t p : / / p u r l . o r g / dc / t e r m s / > .
4 @prefix f o a f : < h t t p : / / x m l n s . com / f o a f / 0 . 1 / > .
5 @prefix s i o c : < h t t p : / / r d f s . org / s i o c / ns #> .
6 @prefix c c : < h t t p : / / creativecommons . org / ns #> .
7
8 < h t t p : / / b i g l y n x . co . uk / b l o g / makingp a c i f i c s h a r k s >
9 rdf:type sioc:Post ;
10 d c t e r m s : t i t l e " Making P a c i f i c S h a r k s " ;
11 d c t e r m s : d a t e " 20101110T 1 6 :3 4 :1 5 Z " ^^ x s d : d a t eT i m e ;
12 s i o c : h a s _ c o n t a i n e r < h t t p : / / b i g l y n x . co . uk / b l o g / > ;
13 s i o c : h a s _ c r e a t o r < h t t p : / / b i g l y n x . co . uk / p e o p l e / mattb r i g g s > ;
14 f o a f : p r i m a r y T o p i c < h t t p : / / b i g l y n x . co . uk / p r o d u c t i o n s / p a c i f i c s h a r k s > ;
15 s i o c : t o p i c < h t t p : / / b i g l y n x . co . uk / p r o d u c t i o n s / p a c i f i c s h a r k s > ;
16 s i o c : t o p i c < h t t p : / / b i g l y n x . co . uk / l o c a t i o n s / p a c i f i c o c e a n > ;
14 http://creativecommons.org/licenses/by-sa/3.0/rdf
15 http://creativecommons.org/ns#
4.3. PUBLISHING DATA ABOUT DATA 55
17 s i o c : t o p i c < h t t p : / / b i g l y n x . co . uk / s p e c i e s / s h a r k > ;
18 s i o c : c o n t e n t " Day one o f t h e e x p e d i t i o n was a s h o c k e r monumental s w e l l ,
t r o p i c a l s t o r m s , and n o t a s h a r k i n s i g h t . The P a c i f i c was up t o i t s o l d
t r i c k s a g a i n . I wasn t h o l d i n g o u t hope o f f i l m i n g i n t h e f o l l o w i n g 48
h o u r s , when t h e u n e x p e c t e d h a pp e n e d . . . " ;
19 f o a f : i s P r i m a r y T o p i c O f < h t t p : / / b i g l y n x . co . uk / b l o g / makingp a c i f i c s h a r k s . r d f > ;
20 f o a f : i s P r i m a r y T o p i c O f < h t t p : / / b i g l y n x . co . uk / b l o g / makingp a c i f i c s h a r k s . html > ;
21 c c : l i c e n s e < h t t p : / / c r e a t i v e c o m m o n s . o r g / l i c e n s e s / bys a / 3 . 0 / > .
22
23 < h t t p : / / b i g l y n x . co . uk / b l o g / makingp a c i f i c s h a r k s . r d f >
24 r d f : t y p e f o a f :D o c u m e n t ;
25 d c t e r m s : t i t l e " Making P a c i f i c S h a r k s (RDF v e r s i o n ) " ;
26 f o a f : p r i m a r y T o p i c < h t t p : / / b i g l y n x . co . uk / b l o g / makingp a c i f i c s h a r k s > .
27
28 < h t t p : / / b i g l y n x . co . uk / b l o g / makingp a c i f i c s h a r k s . html >
29 r d f : t y p e f o a f :D o c u m e n t ;
30 d c t e r m s : t i t l e " Making P a c i f i c S h a r k s (HTML v e r s i o n ) " ;
31 f o a f : p r i m a r y T o p i c < h t t p : / / b i g l y n x . co . uk / b l o g / makingp a c i f i c s h a r k s > .
Note how the license has been applied to the blog post itself, not the documents in which
it is reproduced. This is because the documents in which the post is reproduced contain both
copyrightable and non-copyrightable material (e.g., the post content and the date of publication,
respectively) and therefore cannot be covered meaningfully by the license applied to the blog post
itself.
16 http://vocab.org/waiver/terms/
56 4. LINKED DATA DESIGN CONSIDERATIONS
4 @prefix d c t e r m s : < h t t p : / / p u r l . o r g / dc / t e r m s / > .
5 @prefix f o a f : < h t t p : / / x m l n s . com / f o a f / 0 . 1 / > .
6 @prefix r e l : < h t t p : / / p u r l . org / vocab / r e l a t i o n s h i p / > .
7 @prefix v o i d : < h t t p : / / r d f s . org / ns / v o i d #> .
8 @prefix wv: < h t t p : / / vocab . org / waiver / terms / > .
9
10 < h t t p : / / b i g l y n x . co . uk / p e o p l e / dave s m i t h . r d f >
11 f o a f : p r i m a r y T o p i c < h t t p : / / b i g l y n x . co . uk / p e o p l e / dave s m i t h > ;
12 r d f :t y p e foaf:PersonalProfileDocument ;
13 r d f s : l a b e l " Dave Smith s P e r s o n a l P r o f i l e i n RDF" ;
14 d c t e r m s : c r e a t o r < h t t p : / / b i g l y n x . co . uk / p e o p l e / n e l l y j o n e s > ;
15 d c t e r m s : p u b l i s h e r < h t t p : / / b i g l y n x . co . uk / company . r d f # company > ;
16 d c t e r m s : d a t e " 20101105 " ^^ x s d : d a t e ;
17 d c t e r m s : i s P a r t O f < h t t p : / / b i g l y n x . co . uk / d a t a s e t s / p e o p l e > ;
18 wv:waiver
19 < h t t p : / / www . opendatacommons . o r g / odcp u b l i c domaind e d i c a t i o n andl i c e n c e / > ;
20 w v :n o r m s
21 < h t t p : / / www . opendatacommons . o r g / norms / odcbys a / > .
22
23 < h t t p : / / b i g l y n x . co . uk / p e o p l e / dave s m i t h >
24 f o a f : i s P r i m a r y T o p i c O f < h t t p : / / b i g l y n x . co . uk / p e o p l e / dave s m i t h . r d f > ;
25 r d f :t y p e foaf:Person ;
26 f o a f : n a m e " Dave Smith " ;
27 f o a f : b a s e d _ n e a r < h t t p : / / s w s . geonames . o r g / 3 3 3 3 1 2 5 / > ;
28 f o a f : b a s e d _ n e a r < h t t p : / / d b p e d i a . o r g / r e s o u r c e / Birmingham > ;
29 f o a f : t o p i c _ i n t e r e s t < h t t p : / / dbpedia . org / r e s o u r c e / Wildlife_photography > ;
30 f o a f :k n o w s < h t t p : / / dbpedia . org / r e s o u r c e / David_Attenborough > ;
31 r e l : e m p l o y e r O f < h t t p : / / b i g l y n x . co . uk / p e o p l e / mattb r i g g s > .
32
33 < h t t p : / / b i g l y n x . co . uk / p e o p l e / mattb r i g g s >
34 r e l : e m p l o y e d B y < h t t p : / / b i g l y n x . co . uk / p e o p l e / dave s m i t h > .
Note that in cases where an entire data set contains purely non-copyrightable material it would
also be prudent to explicitly apply the waiver to the entire data set, in addition to the individual RDF
documents that make up that data set.
The terms prod:Production and prod:Director are declared to be classes by typing them as
instances of rdfs:Class. The term prod:director is declared to be a property by typing it as an
instance of rdf:Property.
4.4. CHOOSING AND USING VOCABULARIES 59
The staff profile of Matt Briggs, the Lead Cameraman at Big Lynx, shows how these terms
can be used to describe Matt Briggss role as a director (Line 17) and, specifically, as director of
Pacific Sharks (Line 21):
1 @prefix r d f : < h t t p : / / www . w3 . o r g /1999/02/22 r d f s y n t a x n s # > .
2 @prefix r d f s : < h t t p : / / www . w3 . o r g / 2 0 0 0 / 0 1 / r d f schema # > .
3 @prefix d c t e r m s : < h t t p : / / p u r l . o r g / dc / t e r m s / > .
4 @prefix f o a f : < h t t p : / / x m l n s . com / f o a f / 0 . 1 / > .
5 @prefix p r o d : < h t t p : / / b i g l y n x . co . uk / v o c a b / p r o d u c t i o n s # > .
6
7 < h t t p : / / b i g l y n x . co . uk / p e o p l e / mattb r i g g s . r d f >
8 f o a f : p r i m a r y T o p i c < h t t p : / / b i g l y n x . co . uk / p e o p l e / mattb r i g g s > ;
9 r d f :t y p e foaf:PersonalProfileDocument ;
10 r d f s : l a b e l " Matt B r i g g s P e r s o n a l P r o f i l e i n RDF" ;
11 d c t e r m s : c r e a t o r < h t t p : / / b i g l y n x . co . uk / p e o p l e / n e l l y j o n e s > ;
12
13 < h t t p : / / b i g l y n x . co . uk / p e o p l e / mattb r i g g s >
14 f o a f : i s P r i m a r y T o p i c O f < h t t p : / / b i g l y n x . co . uk / p e o p l e / mattb r i g g s . r d f > ;
15 rdf:type
16 foaf:Person ,
17 prod:Director ;
18 f o a f : n a m e " Matt B r i g g s " ;
19 f o a f : b a s e d _ n e a r < h t t p : / / s w s . geonames . o r g / 3 3 3 3 1 2 5 / > ;
20 f o a f : b a s e d _ n e a r < h t t p : / / d b p e d i a . o r g / r e s o u r c e / Birmingham > ;
21 f o a f : t o p i c _ i n t e r e s t < h t t p : / / dbpedia . org / r e s o u r c e / Wildlife_photography > ;
22 p r o d : d i r e c t e d < h t t p : / / b i g l y n x . co . uk / p r o d u c t i o n s / p a c i f i c s h a r k s > .
Use of both of these properties when defining RDFS vocabularies is recommended, as they
provide guidance to potential users of the vocabulary and are relied upon by many Linked Data
applications when displaying data. In addition to being used to annotate terms in RDFS vocabularies,
these properties are also commonly used to provide labels and descriptions for other types of RDF
resources.
rdfs:subClassOf is used to state that all the instances of one class are also instances of another.
In the example above, prod:Director is declared to be a subclass of the Person class from the
Friend of a Friend (FOAF) ontology. This has the implication that all instances of the class
prod:Director are also instances of the class foaf:Person.
60 4. LINKED DATA DESIGN CONSIDERATIONS
rdfs:subPropertyOf is used to state that resources related by one property are also related
by another. In the example vocabulary above, the property prod:directed is a subproperty of
foaf:made, meaning that a director who directed a production also made that production.
rdfs:domain is used to state that any resource that has a given property is an instance of one
or more classes. The domain of the prod:director property defined above is declared as
prod:Production, meaning that all resources which are described using the prod:director
property are instances of the class prod:Production.
rdfs:range is used to state that all values of a property are instances of one or more classes.
In the example above, the range of the prod:director property is declared as prod:Director.
Therefore, a triple stating <a> prod:director <b> implies that <b> is an instance of the class
prod:Director.
By using these relational primitives, the author of an RDFS vocabulary implicitly defines rules
that allow additional information to be inferred from RDF graphs. For instance, the rule that all
directors are also people, enables the triple <http://biglynx.co.uk/people/matt-briggs> rdf:type
foaf:Person to be inferred from the triple <http://biglynx.co.uk/people/matt-briggs> rdf:type
prod:Director. The result is that not all relations need to be created explicitly in the original data
set, as many can be inferred based on axioms in the vocabulary. This can simplify the management
of data in Linked Data applications without compromising the comprehensiveness of a data set.
The Dublin Core Metadata Initiative (DCMI) Metadata Terms vocabulary18 defines gen-
eral metadata attributes such as title, creator, date and subject.
The Friend-of-a-Friend (FOAF) vocabulary19 defines terms for describing persons, their
activities and their relations to other people and objects.
The Description of a Project (DOAP) vocabulary21 (pronounced "dope" ) defines terms for
describing software projects, particularly those that are Open Source.
The Music Ontology 22 defines terms for describing various aspects related to music, such as
artists, albums, tracks, performances and arrangements.
The Programmes Ontology 23 defines terms for describing programmes such as TV and radio
broadcasts.
The Good Relations Ontology 24 defines terms for describing products, services and other
aspects relevant to e-commerce applications.
The Creative Commons (CC) schema25 defines terms for describing copyright licenses in
RDF.
The Bibliographic Ontology (BIBO)26 provides concepts and properties for describing ci-
tations and bibliographic references (i.e., quotes, books, articles, etc.).
18 http://dublincore.org/documents/dcmi-terms/
19 http://xmlns.com/foaf/spec/
20 http://rdfs.org/sioc/spec/
21 http://trac.usefulinc.com/doap
22 http://musicontology.com/
23 http://www.bbc.co.uk/ontologies/programmes/2009-09-07.shtml
24 http://purl.org/goodrelations/
25 http://creativecommons.org/ns#
26 http://bibliontology.com/
62 4. LINKED DATA DESIGN CONSIDERATIONS
The OAI Object Reuse and Exchange vocabulary27 is used by various library and publication
data sources to represent resource aggregations such as different editions of a document or its
internal structure.
The Review Vocabulary 28 provides a vocabulary for representing reviews and ratings, as are
often applied to products and services.
The Basic Geo (WGS84) vocabulary29 defines terms such as lat and long for describing
geographically-located things.
There will always be cases where new terms need to be developed to describe aspects of
a particular data set [22], in which case these terms should be mapped to related terms in well-
established vocabularies, as discussed in 2.5.3.
Where newly defined terms are specialisations of existing terms, there is an argument for
using both terms in tandem when publishing data. For example, in the Big Lynx scenario, Nelly may
decide to explicitly add RDF triples to Matt Briggss profile stating that he is a foaf:Person as well as
a prod:Director, even though this could be inferred by a reasoner based on the relationships defined
in the Big Lynx Productions vocabulary. This practice can be seen as an instance of the Materialize
Inferences pattern30 , and while it introduces an element of redundancy it also conveys the benefits
described above whereby the accessibility of data to Linked Data applications that do not employ
reasoning engines is maximised.
1. Usage and uptake is the vocabulary in widespread usage? Will using this vocabulary make
a data set more or less accessible to existing Linked Data applications?
4. Expressivity is the degree of expressivity in the vocabulary appropriate to the data set and
application scenario? Is it too expressive, or not expressive enough?
3. Use terms from RDFS and OWL to relate new terms to those in existing vocabularies.
4. Apply the Linked Data principles equally rigorously to vocabularies as to data sets URIs of
terms should be dereferenceable so that Linked Data applications can look up their defini-
tion [23].
5. Document each new term with human-friendly labels and comments rdfs:label and
rdfs:comment are designed for this purpose.
6. Only define things that matter for example, defining domains and ranges helps clarify how
properties should be used, but over-specifying a vocabulary can also produce unexpected infer-
ences when the data is consumed. Thus you should not overload vocabularies with ontological
axioms, but better define terms rather loosely (for instance, by using only the RDFS and OWL
terms introduced above).
A number of tools are available to assist with the vocabulary development process:
Neologism36 is a Web-based tool for creating, managing and publishing simple RDFS vo-
cabularies. It is open-source and implemented in PHP on top of the Drupal-platform.
2. Is the vocabulary well maintained and properly published with dereferenceable URIs?
A list of widely used vocabularies from which linking properties can be chosen is given in
Section 4.4.4 as well as in Section 2.4 of the State of the LOD Cloud document44 . If very spe-
cific or proprietary terms are used for linking, they should be linked to more generic terms using
rdfs:subPropertyOf mappings, as described in 2.5.3 and 4.4.4 as this enables client applications to
translate them to a recognised vocabulary.
47 http://productdb.org/isbn/9781853267802
48 http://productdb.org/ean/9781853267802
68 4. LINKED DATA DESIGN CONSIDERATIONS
Silk - Link Discovery Framework [111]. Silk provides a flexible, declarative language for
specifying mathing heuristics. Mathing heuristics may combine different string matchers,
numeric as well as geographic matchers. Silk enables data values to be transformed before they
are used in the matching process and allows similarity scores to be aggregated using various
aggregation functions. Silk can match local as well as remote datasets which are accessed using
the SPARQL protocol. Matching tasks that require a large number of comparisons can be
handled either by using different blocking features or by running Silk on a Hadoop cluster.
Silk is available under the Apache License and can be downloaded from the project website49
.
LIMES - Link Discovery Framework for Metric Spaces [44]. LIMES implements a fast and
lossless approach for large-scale link discovery based on the characteristics of metric spaces but
provides a less expressive language for specifying matching heuristics. Detailed information
about LIMES is found on the project website50 .
In addition to the tools above, which rely on users explicitly specifying the matching heuristic,
there are also tools available which learn the matching heuristic directly from the data. Examples of
such tools include RiMOM51 , idMash [77], and ObjectCoref52 .
The advantage of learning matching heuristics is that the systems do not need to be manually
configured for each type of links that are to be created between datasets. The disadvantage is that
machine learning-based approaches typically have a lower precision compared to approaches that
rely on domain knowledge provided by humans in the form of a matching description. The Instance
Matching Track within Ontology Alignment Evaluation Initiative 201053 compared the quality of
links that were produced by different learning based-tools. The evaluation revealed precision values
between 0.6 and 0.97 and showed that quality of the resulting links depends highly on the specific
linking task.
A task related to link generation is the maintenance of links over time as data sources change.
There are various proposals for notification mechanisms to handle this task, an overview of which
is given in [109]. In [87], the authors propose DSNotify, a framework that monitors Linked Data
sources and informs consuming applications about changes. More information about Link Discovery
tools and an up-to-date list of references is maintained by the LOD community at54 .
49 http://www4.wiwiss.fu-berlin.de/bizer/silk/
50 http://aksw.org/Projects/limes
51 http://keg.cs.tsinghua.edu.cn/project/RiMOM/
52 http://ws.nju.edu.cn/services/ObjectCoref
53 http://www.dit.unitn.it/p2p/OM-2010/oaei10_paper0.pdf
54 http://esw.w3.org/TaskForces/CommunityProjects/LinkingOpenData/EquivalenceMining
69
CHAPTER 5
Where structured data exists in queryable form behind a custom API (such as the Flickr or
Amazon Web APIs, or a local application or operating system API), the situation is a little more
complex, as a custom wrapper will likely need to be developed according to the specifics of the API in
question. However, examples such as the RDF Book Mashup [29] demonstrate that such wrappers
can be implemented in relatively trivial amounts of code, much of which can likely be componentised
for reuse across wrappers. The wrapper pattern is described in more detail in Section 5.2.6.
a person creates and maintains relatively small RDF files manually, e.g., when publishing
RDFS vocabularies or personal profiles in RDF
The majority of examples in this book is shown in the Turtle serialisation of RDF, for read-
ability; however, if data is published using just one serialisation format, this should be RDF/XML,
as it is widely supported by tools that consume Linked Data.
In the case of Big Lynx, serving a static RDF/XML file is the perfect recipe for publishing
the company profile as Linked Data. The code sample below shows what this company profile looks
like, converted to the Turtle serialisation of RDF.
1 @prefix r d f : < h t t p : / / www . w3 . o r g /1999/02/22 r d f s y n t a x n s # > .
2 @prefix r d f s : < h t t p : / / www . w3 . o r g / 2 0 0 0 / 0 1 / r d f schema # > .
3 @prefix d c t e r m s : < h t t p : / / p u r l . o r g / dc / t e r m s / > .
4 @prefix f o a f : < h t t p : / / x m l n s . com / f o a f / 0 . 1 / > .
5 @prefix s m e : < h t t p : / / b i g l y n x . co . uk / v o c a b / sme # > .
6
7 < h t t p : / / b i g l y n x . co . uk / company . r d f # company >
8 r d f : t y p e s m e :S m a l l M e d i u m E n t e r p r i s e ;
9 f o a f : n a m e " Big Lynx P r o d u c t i o n s Ltd " ;
10 r d f s : l a b e l " Big Lynx P r o d u c t i o n s Ltd " ;
11 d c t e r m s : d e s c r i p t i o n " Big Lynx P r o d u c t i o n s Ltd i s an i n d e p e n d e n t t e l e v i s i o n
p r o d u c t i o n company b a s e d n e a r Birmingham , UK, and r e c o g n i s e d w o r l d w i d e f o r
i t s pioneering w i l d l i f e documentaries " ;
12 f o a f : b a s e d _ n e a r < h t t p : / / s w s . geonames . o r g / 3 3 3 3 1 2 5 / > ;
13 sme:hasTeam < h t t p : / / b i g l y n x . co . uk / t e a m s / management > ;
14 sme:hasTeam < h t t p : / / b i g l y n x . co . uk / t e a m s / p r o d u c t i o n > ;
15 sme:hasTeam < h t t p : / / b i g l y n x . co . uk / t e a m s / web > ;
16
5.2. THE RECIPES 73
17 < h t t p : / / b i g l y n x . co . uk / t e a m s / management > ;
18 r d f : t y p e sme:Team ;
19 r d f s : l a b e l " The Big Lynx Management Team " ;
20 s m e : l e a d e r < h t t p : / / b i g l y n x . co . uk / p e o p l e / dave s m i t h > ;
21 s m e :i sT e a m O f < h t t p : / / b i g l y n x . co . uk / company . r d f # company > .
22
23 < h t t p : / / b i g l y n x . co . uk / t e a m s / p r o d u c t i o n > ;
24 r d f : t y p e sme:Team ;
25 r d f s : l a b e l " The Big Lynx P r o d u c t i o n Team " ;
26 s m e : l e a d e r < h t t p : / / b i g l y n x . co . uk / p e o p l e / mattb r i g g s > ;
27 s m e :i sT e a m O f < h t t p : / / b i g l y n x . co . uk / company . r d f # company > .
28
29 < h t t p : / / b i g l y n x . co . uk / t e a m s / web > ;
30 r d f : t y p e sme:Team ;
31 r d f s : l a b e l " The Big Lynx Web Team " ;
32 s m e : l e a d e r < h t t p : / / b i g l y n x . co . uk / p e o p l e / n e l l y j o n e s > ;
33 s m e :i sT e a m O f < h t t p : / / b i g l y n x . co . uk / company . r d f # company > .
This tells Apache to serve files with an .rdf extension using the correct MIME type for
RDF/XML. This implies that files have to be named with the .rdf extension.
74 5. RECIPES FOR PUBLISHING LINKED DATA
The following lines can also be added at the same time, to ensure the server is properly
configured to serve RDF data in its N3 and Turtle serialisations6 :
1 AddType t e x t / n3 ; c h a r s e t = u t f 8 . n3
2 AddType t e x t / t u r t l e ; c h a r s e t = u t f 8 . t t l
It should be noted that this technique can be applied in all publishing scenarios and should
be used throughout a Web site to aid discovery of data.
6 Guidance on correct media types for N3 and Turtle is taken from http://www.w3.org/2008/01/rdf-media-types
7 http://patterns.dataincubator.org/book/autodiscovery.html
5.2. THE RECIPES 75
17 < meta p r o p e r t y = " d c t e r m s : c r e a t o r " c o n t e n t = " N e l l y J o n e s " / >
18 < l i n k r e l = " r d f : t y p e " h r e f = " f o a f :D o c u m e n t " / >
19 < l i n k r e l = " f o a f : t o p i c " h r e f = " # company " / >
20 < / head >
21 < body >
22 <h1 a b o u t = " # company " t y p e o f = " s m e :S m a l l M e d i u m E n t e r p r i s e " p r o p e r t y = " f o a f : n a m e "
r e l = " f o a f : b a s e d _ n e a r " r e s o u r c e = " h t t p : / / s w s . geonames . o r g / 3 3 3 3 1 2 5 / " > Big Lynx
P r o d u c t i o n s Ltd < / h1 >
23 < d i v a b o u t = " # company " p r o p e r t y = " d c t e r m s : d e s c r i p t i o n " > Big Lynx P r o d u c t i o n s Ltd
i s an i n d e p e n d e n t t e l e v i s i o n p r o d u c t i o n company b a s e d n e a r Birmingham , UK,
and r e c o g n i s e d w o r l d w i d e f o r i t s p i o n e e r i n g w i l d l i f e d o c u m e n t a r i e s < / d i v >
24 <h2 >Teams < / h2 >
25 < u l a b o u t = " # company " >
26 < l i r e l = " sme:hasTeam " >
27 < d i v a b o u t = " h t t p : / / b i g l y n x . co . uk / t e a m s / management " t y p e o f = " sme:Team " >
28 < a h r e f = " h t t p : / / b i g l y n x . co . uk / t e a m s / management "
p r o p e r t y = " r d f s : l a b e l " >The Big Lynx Management Team< / a >
29 < s p a n r e l = " s m e :i sT e a m O f " r e s o u r c e = " # company " > < / s p a n >
30 <span r e l = " s m e :l e a d e r "
r e s o u r c e = " h t t p : / / b i g l y n x . co . uk / p e o p l e / dave s m i t h " > < / s p a n >
31 </ div >
32 </ l i >
33 < l i r e l = " sme:hasTeam " >
34 < d i v a b o u t = " h t t p : / / b i g l y n x . co . uk / t e a m s / p r o d u c t i o n " t y p e o f = " sme:Team " >
35 < a h r e f = " h t t p : / / b i g l y n x . co . uk / t e a m s / p r o d u c t i o n "
p r o p e r t y = " r d f s : l a b e l " >The Big Lynx P r o d u c t i o n Team< / a >
36 < s p a n r e l = " s m e :i sT e a m O f " r e s o u r c e = " # company " > < / s p a n >
37 <span r e l = " s m e :l e a d e r "
r e s o u r c e = " h t t p : / / b i g l y n x . co . uk / p e o p l e / mattb r i g g s " > < / s p a n >
38 </ div >
39 </ l i >
40 < l i r e l = " sme:hasTeam " >
41 < d i v a b o u t = " h t t p : / / b i g l y n x . co . uk / t e a m s / web " t y p e o f = " sme:Team " >
42 < a h r e f = " h t t p : / / b i g l y n x . co . uk / t e a m s / web " p r o p e r t y = " r d f s : l a b e l " >The Big
Lynx Web Team< / a >
43 < s p a n r e l = " s m e :i sT e a m O f " r e s o u r c e = " # company " > < / s p a n >
44 <span r e l = " s m e :l e a d e r "
r e s o u r c e = " h t t p : / / b i g l y n x . co . uk / p e o p l e / n e l l y j o n e s " > < / s p a n >
45 </ div >
46 </ l i >
47 </ ul >
48 < / body >
49 < / html >
This RDFa produces the following Turtle output (reformatted slightly for readability) when
passed through the RDFa Distiller and Parser 8 :
1 @prefix d c t e r m s : < h t t p : / / p u r l . o r g / dc / t e r m s / > .
2 @prefix f o a f : < h t t p : / / x m l n s . com / f o a f / 0 . 1 / > .
3 @prefix r d f : < h t t p : / / www . w3 . o r g /1999/02/22 r d f s y n t a x n s # > .
4 @prefix r d f s : < h t t p : / / www . w3 . o r g / 2 0 0 0 / 0 1 / r d f schema # > .
5 @prefix s m e : < h t t p : / / b i g l y n x . co . uk / v o c a b / sme # > .
6
7 < h t t p : / / b i g l y n x . co . uk / company . html >
8 http://www.w3.org/2007/08/pyRdfa/
76 5. RECIPES FOR PUBLISHING LINKED DATA
8 a < f o a f :D o c u m e n t > ;
9 d c t e r m s : c r e a t o r " N e l l y J o n e s " @en ;
10 d c t e r m s : t i t l e " About Big Lynx P r o d u c t i o n s Ltd " @en ;
11 f o a f : t o p i c < h t t p : / / b i g l y n x . co . uk / company . html # company > .
12
13 < h t t p : / / b i g l y n x . co . uk / company . html # company > a s m e :S m a l l M e d i u m E n t e r p r i s e ;
14 sme:hasTeam
15 < h t t p : / / b i g l y n x . co . uk / t e a m s / management > ,
16 < h t t p : / / b i g l y n x . co . uk / t e a m s / p r o d u c t i o n > ,
17 < h t t p : / / b i g l y n x . co . uk / t e a m s / web > ;
18 d c t e r m s : d e s c r i p t i o n " Big Lynx P r o d u c t i o n s Ltd i s an i n d e p e n d e n t t e l e v i s i o n
p r o d u c t i o n company b a s e d n e a r Birmingham , UK, and r e c o g n i s e d w o r l d w i d e f o r
i t s p i o n e e r i n g w i l d l i f e d o c u m e n t a r i e s " @en ;
19 f o a f : b a s e d _ n e a r < h t t p : / / s w s . geonames . o r g / 3 3 3 3 1 2 5 / > ;
20 f o a f : n a m e " Big Lynx P r o d u c t i o n s Ltd " @en .
21
22 < h t t p : / / b i g l y n x . co . uk / t e a m s / management >
23 a sme:Team ;
24 r d f s : l a b e l " The Big Lynx Management Team " @en ;
25 s m e :i sT e a m O f < h t t p : / / b i g l y n x . co . uk / company . html # company > ;
26 s m e : l e a d e r < h t t p : / / b i g l y n x . co . uk / p e o p l e / dave s m i t h > .
27
28 < h t t p : / / b i g l y n x . co . uk / t e a m s / p r o d u c t i o n >
29 a sme:Team ;
30 r d f s : l a b e l " The Big Lynx P r o d u c t i o n Team " @en ;
31 s m e :i sT e a m O f < h t t p : / / b i g l y n x . co . uk / company . html # company > ;
32 s m e : l e a d e r < h t t p : / / b i g l y n x . co . uk / p e o p l e / mattb r i g g s > .
33
34 < h t t p : / / b i g l y n x . co . uk / t e a m s / web >
35 a sme:Team ;
36 r d f s : l a b e l " The Big Lynx Web Team " @en ;
37 s m e :i sT e a m O f < h t t p : / / b i g l y n x . co . uk / company . html # company > ;
38 s m e : l e a d e r < h t t p : / / b i g l y n x . co . uk / p e o p l e / n e l l y j o n e s > .
Note how the URI identifying Big Lynx has changed to http://biglynx.co.uk
/company.html#company because the URI of the document in which it is defined has changed.
RDFa can be particularly useful in situations where publishing to the Web makes extensive use
of existing templates, as these can be extended to include RDFa output. This makes RDFa a com-
mon choice for adding Linked Data support to content management systems and Web publishing
frameworks such as Drupal9 , which includes RDFa publishing support in version 7.
Care should be taken when adding RDFa support to HTML documents and templates, to
ensure that the elements added produce the intended RDF triples. The complexity of this task
increases with the complexity of the HTML markup in the document. Frequent use of the RDFa
Distiller and Parser 10 and inspection of its output can help ensure the correct markup is added.
9 http://drupal.org/
10 http://www.w3.org/2007/08/pyRdfa/
5.2. THE RECIPES 77
5.2.3 SERVING RDF AND HTML WITH CUSTOM SERVER-SIDE SCRIPTS
In many Web publishing scenarios, the site owner or developer will have a series of custom server-
side scripts for generating HTML pages and may wish to add Linked Data support to the site.
This situation applies to the Big Lynx blogging software, which is powered by a series of custom
PHP scripts that query a relational database and output the blog posts in HTML, at URIs such as
http://biglynx.co.uk/blog/making-pacific-sharks.html.
Nelly considered enhancing these scripts to publish RDFa describing the blog posts, but was
concerned that invalid markup entered in the body of blog posts by Big Lynx staff may make this
data less consumable by RDFa-aware tools. Therefore, she decided to supplement the HTML-
generating scripts with equivalents publishing Linked Data in RDF/XML. These scripts run the
same database queries, and output the data in RDF/XML rather than HTML. This is achieved
with the help of the ARC library for working with RDF in PHP11 , which avoids Nelly having to
write an RDF/XML formatter herself. The resulting RDF documents are published at URIs such
as http://biglynx.co.uk/blog/making-pacific-sharks.rdf.
A key challenge for Nelly at this stage is to ensure the RDF output can be classed as Linked
Data, by including outgoing links to other resources within the Big Lynx data sets. This may involve
mapping data returned from the relational database (e.g., names of productions or blog post authors)
to known URI templates within the Big Lynx namespace.
To complete the process of Linked Data-enabling the Big Lynx blog, Nelly must
make dereferenceable URIs for the blog posts themselves (as distinct from the
HTML and RDF documents that describe the post). These URIs will take the form
http://biglynx.co.uk/blog/making-pacific-sharks, as introduced in Section 4.3.3.
As these URIs follow the 303 URI pattern (see Section 2.3.1), Nelly must create a script that
responds to attempts to dereference these URIs by detecting the requested content type (specified
in the Accept: header of the HTTP request) and performing a 303 redirect to the appropriate
document. This is easily achieved using a scripting language such as PHP. Various code samples
demonstrating this process are available at 12 .
In fact, Nelly decides to use a mod_rewrite13 rule on her server to catch all requests for blog
entries and related documents, and pass these to one central script.This script then detects the nature
of the request and either performs content negotiation and a 303 redirect, or calls the scripts that
serve the appropriate HTML or RDF documents. This final step is entirely optional, however, and
can be emitted in favour of more PHP scripts if mod_rewrite is not available on the Web server.
Using D2R Server to publish a relational database as Linked Data typically involves the
following steps:
1. Download and install the server software as described in the Quick Start15 section of the D2R
Server homepage.
2. Have D2R Server auto-generate a D2RQ mapping from the schema of your database.
3. Customize the mapping by replacing auto-generated terms with terms from well-known and
publicly accessible RDF vocabularies (see Section 4.4.4).
4. Set RDF links pointing at external data sources as described in Section 4.5.
5. Set several RDF links from an existing interlinked data source (for instance, your FOAF
profile) to resources within the new data set, to ensure crawlers can discover the data.
14 http://sites.wiwiss.fu-berlin.de/suhl/bizer/d2r-server/index.html
15 http://sites.wiwiss.fu-berlin.de/suhl/bizer/d2r-server/index.html#quickstart
5.2. THE RECIPES 79
6. Add the new data source to the CKAN registry in the group LOD Cloud as described in
Section 3.1.
In addition to D2R Server, the following tools enable relational databases to be published as
Linked Data:
OpenLink Virtuoso16 provides the Virtuoso RDF Views 17 Linked Data wrapper.
Triplify18 is a small plugin for Web applications, which allows you to map the results of SQL
queries into RDF, JSON and Linked Data.
The W3C RDB2RDF Working Group19 is currently working on a standard language to
express relational database to RDF mappings. Once this language is finished, it might replace the
solutions described above.
1. They assign HTTP URIs to the resources about which the API provides data.
2. When one of these URIs is dereferenced asking for application/rdf+xml, the wrapper rewrites
the clients request into a request against the underlying API.
3. The results of the API request are transformed to RDF and sent back to the client.
This can be a simple and effective mechanism for exposing new sources of Linked Data.
However, care should be taken to ensure adequate outgoing links are created from the wrapped data
set, and that the individual or organisation hosting the wrapper has the rights to republish data from
the API in this way.
RDF:Alerts dereferences not only given URIs but also retrieves the definitions of vocabulary
terms and checks whether data complies with rdfs:range and rdfs:domain as well as datatype
restrictions that are given in these definitions.The RDF:Alerts validator is available at http://
swse.deri.org/RDFAlerts/
Sindice Inspector be used to visualize and validate RDF files, HTML pages embedding
microformats, and XHTML pages embedding RDFa. The validator performs reasoning and
checks for common errors as observed in RDF data found on the web. The Sindice Inspector
is available at http://inspector.sindice.com/
Various other tools exist that enable more manual validation and debugging of Linked Data
publishing infrastructure. The command line tool cURL28 can be very useful in validating the
correct operation of 303 redirects used with 303 URIs (see Section 2.3.1), as described in the tutorial
Debugging Semantic Web sites with cURL29 .
The Firefox browser extensions LiveHTTPHeaders 30 and ModifyHeaders 31 provide convenient
GUIs for making HTTP requests with modified headers, and assessing the response from a server.
A more qualitative approach, that complements more technical debugging and validation, is
to test with a data set can be fulled navigated using a Linked Data browser. For example, RDF links
may be served that point from one resource to another, but not incoming links that connect the
second resource back to the first. Consequently, using a Linked Data browser it may only be possible
to navigate deeper into the data set but not return to the staring point. Testing for this possibility
with a Linked Data browser will also highlight whether Linked Data crawlers can reach the entirety
of the data set for indexing.
The following Linked Data browsers are useful starting points for testing:
Tabulator32 . If Tabulator takes some time to display data, it may indicate that the RDF documents
being served are too large, and may benefit from splitting into smaller fragments, or from
omission of data that may be available elsewhere (e.g., triples describing resources which are
not the primary subject of the document). Tabulator also performs some basic inferencing
over data it consumes, without checking this for consistency. Therefore, unpredictable results
28 http://curl.haxx.se/
29 http://dowhatimean.net/2007/02/debugging-semantic-web-sites-with-curl
30 https://addons.mozilla.org/af/firefox/addon/3829/
31 https://addons.mozilla.org/af/firefox/addon/967/
32 http://www.w3.org/2005/ajar/tab
82 5. RECIPES FOR PUBLISHING LINKED DATA
when using this browser may indicate issues with rdfs:subClassOf and rdfs:subPropertyOf
declarations in the RDFS and OWL schemas used in your data.
Marbles33 . This browser uses a two second time-out when retrieving data from the Web. Therefore,
if the browser does not display data correctly it may indicate that the host server is too slow in
responding to requests.
References to further Linked Data browsers that can be used to test publishing infrastructure
are given in Section 6.1.1. Alternatively, the LOD Browser Switch34 can be used to dereference URIs
from a data set within different Linked data browsers.
Beside validating that Linked Data is published correctly from a technical perspective, the
data should be made as self-descriptive as possible, to maximise its accessibility and utility. The
following section presents a checklist that can be used to validate that a data set complies with the
various Linked Data best practices. An analysis of common errors and flaws of existing Linked Data
sources is presented in [64] and provides a valuable source of negative examples.
1. Does your data set links to other data sets? RDF links connect data from different sources into
a single global RDF graph and enable Linked Data browsers and crawlers to navigate between
data sources. Thus your data set should set RDF links pointing at other data sources [16].
2. Do you provide provenance metadata? In order to enable applications to be sure about the
origin of data as well as to enable them to assess the quality of data, data source should publish
provenance meta data together with the primary data (see Section 4.3).
3. Do you provide licensing metadata? Web data should be self-descriptive concerning any
restrictions that apply to its usage. A common way to express such restrictions is to attach a
data license to published data, as described in 4.3.3. Doing so is essential to enable applications
to use Web data on a secure legal basis.
4. Do you use terms from widely deployed vocabularies? In order to make it easier for appli-
cations to understand Linked Data, data providers should use terms from widely deployed
vocabularies to represent data wherever possible (see Section 4.4.4).
33 http://www5.wiwiss.fu-berlin.de/marbles/
34 http://browse.semanticweb.org/
5.5. LINKED DATA PUBLISHING CHECKLIST 83
5. Are the URIs of proprietary vocabulary terms dereferenceable? Data providers often define
proprietary terms that are used in addition to terms from widely deployed vocabularies. In order
to enable applications to automatically retrieve the definition of vocabulary terms from the
Web, the URIs identifying proprietary vocabulary terms should be made dereferenceable [23].
7. Do you provide data set-level metadata? In addition to making instance data self-descriptive,
it is also desirable that data publishers provide metadata describing characteristic of complete
data sets, for instance, the topic of a data set and more detailed statistics, as described in 4.3.1.
8. Do you refer to additional access methods? The primary way to publish Linked Data on the
Web is to make the URIs that identity data items dereferenceable into RDF descriptions. In
addition, various LOD data providers have chosen to provide two alternative means of access
to their data via SPARQL endpoints or provide RDF dumps of the complete data set. If you
do this, you should refer to the access points in your voiD description as described in Section
5.3.
The State of the LOD Cloud document35 provides statistics about the extent to which deployed
Linked Data sources meet the guidelines given above.
35 http://lod-cloud.net/state
85
CHAPTER 6
3 http://www4.wiwiss.fu-berlin.de/bizer/ng4j/disco/
4 http://www.w3.org/2005/ajar/tab
5 http://marbles.sourceforge.net/
6 http://linksailor.com/
7 http://browse.semanticweb.org/
6.1. DEPLOYED LINKED DATA APPLICATIONS 87
Figure 6.1: The Marbles Linked Data browser displaying data about Tim Berners-Lee. The colored
dots indicate the data sources from which data was merged.
8 http://sig.ma/
9 http://iws.seu.edu.cn/services/falcons/documentsearch/
10 http://www.swse.org/
88 6. CONSUMING LINKED DATA
can enter keywords related to the item or topic in which they are interested, and the application
returns a list of results that may be relevant to the query.
However, rather than simply providing links from search results through to the source doc-
uments in which the queried keywords are mentioned, Linked Data search engines provide richer
interaction capabilities to the user which exploit the underlying structure of the data. For instance,
Falcons enables the user to filter search results by class and therefore limit the results to show, for ex-
ample, only persons or entities belonging to a specific subclass of person, such as athlete or politician.
Sig.ma, Falcons and SWSE provide summary views of the entity the user selects from the results
list, alongside additional structured data crawled from the Web and links to related entities.
The Sig.ma search engine applies vocabulary mappings to integrate Web data as well as
specific display templates to properly render data for human consumption. Figure 6.2 shows the
Sig.ma search engine displaying data about Richard Cyganiak that has been integrated from 20 data
sources. Another interesting aspect of the Sig.ma search engine is that it approaches the data quality
challenges that arise in the open environment of the Web by enabling its users to choose the data
sources from which the users aggregated view is constructed. By removing low quality data from
their individual views, Sig.ma users collectively create ratings for data sources on the Web as a whole.
Figure 6.2: Sig.ma Linked Data search engine displaying data about Richard Cyganiak.
6.1. DEPLOYED LINKED DATA APPLICATIONS 89
Figure 6.3: Google search results containing structured data in the form of Rich Snippets.
A search engine that focuses on answering complex queries over Web data is VisiNav 11 [54].
Queries are formulated by the user in an exploratory fashion and can be far more expressive than
queries that Google and Yahoo can currently answer. For instance, VisiNav answers the query "give
me the URLs of all blogs that are written by people that Tim Berners-Lee knows!" with a list of 54 correct
URLs. Google and Yahoo just return links to arbitrary web pages describing Tim Berners-Lee
himself.
While Sig.ma, VisiNav, SWSE and Falcons provide search capabilities oriented towards hu-
mans, another breed of services have been developed to serve the needs of applications built on top
of distributed Linked Data. These application-oriented indexes, such as Sindice 12 [108], Swoogle 13 ,
and Watson14 provide APIs through which Linked Data applications can discover RDF documents
on the Web that reference a certain URI or contain certain keywords.
The rationale for such services is that each new Linked Data application should not need
to implement its own infrastructure for crawling and indexing the complete Web of Data. Instead,
applications can query these indexes to receive pointers to potentially relevant RDF documents
which can then be retrieved and processed by the application itself. Despite this common theme,
these services have slightly different emphases. Sindice is oriented to providing access to documents
containing instance data, while the emphasis of Swoogle and Watson is on finding ontologies that
provide coverage of certain concepts relevant to a query.
A service that goes beyond finding Web data but also helps developers to integrate Web
data is uberblic 15 , which acts as a layer between data publishers and data consumers. The service
11 http://sw.deri.org/2009/01/visinav/
12 http://sindice.com/
13 http://swoogle.umbc.edu/
14 http://kmi-web05.open.ac.uk/Overview.html
15 http://uberblic.org/
90 6. CONSUMING LINKED DATA
consolidates and reconciles information into a central data repository, and provides access to this
repository through developer APIs.
It is interesting to note that traditional search engines like Google and Yahoo16 have also
started to use structured data from the Web within their applications. Google crawls RDFa and
microformat data describing people, products, businesses, organizations, reviews, recipes, and events.
It uses the crawled data to provide richer and more structured search results to its users in the form
of Rich Snippets17 . Figure 6.3 shows part of the Google search results for Fillmore San Francisco.
Below the title of the first result, it can be seen that Google knows about 752 reviews of the Fillmore
concert hall. The second Rich Snippet contains a listing of upcoming concerts at this concert hall.
Not only does Google use structured data from the Web to enrich search results, it has also
begun to use extracted data to directly answer simple factual questions18 . As is shown in Figure 6.4,
Google answers a query about the birth date of Catherine Zeta-Jones not with a list of links pointing
at Web pages, but provides the actual answer to the user: 25 September 1969. This highlights how
the major search engines have begun to evolve into answering engines which rely on structured data
from the Web.
Figure 6.4: Google result answering a query about the birth date of Catherine Zeta-Jones.
16 http://linkeddata.future-internet.eu/images/5/54/Mika_FI_Search_and_LOD.pdf
17 http://googlewebmastercentral.blogspot.com/2009/10/help-us-make-web-better-update-on-rich.html
18 http://googleblog.blogspot.com/2010/05/understanding-web-to-find-short-answers.html
19 http://www.data.gov/communities/node/116/apps
20 http://data.gov.uk/apps
21 http://data-gov.tw.rpi.edu/demo/USForeignAid/demo-1554.html
6.1. DEPLOYED LINKED DATA APPLICATIONS 91
Figure 6.5: US Global Foreign Aid Mashup combining and visualizing data from different branches of
the US government.
Linked Data applications that aim to bring Linked Data into the users daily work context
include dayta.me [3] and paggr [90]. dayta.me22 is a personal information recommender that aug-
ments a persons online calendar with useful information pertaining to their upcoming activities.
paggr23 provides an environment for the personalized aggregation of Web data through dashboards
and widgets.
An application of Linked Data that helps educators to create and manage lists of learning
resources (e.g., books, journal articles, Web pages) is Talis Aspire 24 [41]. The application is written
in PHP, backed by the Talis Platform25 for storing, managing and accessing Linked Data, and used
by tens of thousands of students at numerous universities on a daily basis. Educators and learners
interact with the application through a conventional Web interface, while the data they create is
stored natively in RDF. An HTTP URI is assigned to each resource, resource list, author and
publisher. Use of the Linked Data principles and related technologies in Aspire enables individual
22 http://dayta.me/
23 http://paggr.com/
24 http://www.w3.org/2001/sw/sweo/public/UseCases/Talis/
25 http://www.talis.com/platform/
92 6. CONSUMING LINKED DATA
lists and resources to be connected to related data elsewhere on the Web, enriching the range of
material available to support the educational process.
Figure 6.6: The HTML view of a Talis Aspire List generated from the underlying RDF representation
of the data.
DBpedia Mobile26 [7] is a Linked Data application that helps tourists to explore a city.
The application runs on an iPhone or other smartphone and provides a location-centric mashup
of nearby locations from DBpedia, based on the current GPS position of the mobile device. Using
these locations as starting points, the user can then navigate along RDF links into other data sources.
Besides accessing Web data, DBpedia Mobile also enables users to publish their current location,
pictures and reviews to the Web as Linked Data, so that they can be used by other applications.
Instead of simply being tagged with geographical coordinates, published content is interlinked with
a nearby DBpedia resource and thus contributes to the overall richness of the Web of Data.
A Life Science application that relies on knowledge from more than 200 publicly available
ontologies in order to support its users in exploring biomedical resources is the NCBO Resource
Index 27 [69]. A second example of a Linked Data application from this domain is Diseasome Map28 .
26 http://wiki.dbpedia.org/DBpediaMobile
27 http://bioportal.bioontology.org/resources
28 http://diseasome.eu/map.html
6.2. DEVELOPING A LINKED DATA MASHUP 93
The application combines data from various Life Science data sources in order to generate a "network
of disorders and disease genes linked by known disordergene associations, indicating the common
genetic origin of many diseases."29
A Linked Data mashup that demonstrates how specific FOAF profiles are discovered and
integrated is Researcher Map30 . The application discovers the personal profiles of German database
professors by following RDF links and renders the retrieved data on an interactive map [59].
A social bookmarking tool that allows tagging of bookmarks with Linked Data URIs to
prevent ambiguities is Faviki31 . Identifiers are automatically suggested using the Zemanta API32 ,
and Linked Data sources such as DBpedia and Freebase are used as background knowledge to
organize tags by topics and to provide tag descriptions in different languages.
Applications that demonstrate how Linked Data is used within wiki-environments include
Shortipedia33 [112] and the Semantic MediaWiki - Linked Data Extension 34 [8].
1. Discover data sources that provide data about a city by following RDF links from an initial
seed URI into other data sources.
2. Download data from the discovered data sources and store the data together with provenance
meta-information in a local RDF store.
3. Retrieve information to be displayed on the Big Lynx Web site from the local store, using the
SPARQL query language.
This simple example leaves out many important aspects that are involved in Linked Data
consumption. Therefore, after explaining how the simple example is realized, an overview will be
provided of the more complex tasks that need to be addressed by Linked Data applications (Sec-
tion 6.3).
29 http://diseasome.eu/poster.html
30 http://researchersmap.informatik.hu-berlin.de/
31 http://www.faviki.com/
32 http://www.zemanta.com/
33 http://shortipedia.org/
34 http://smwforum.ontoprise.com/smwforum/index.php/SMW+LDE
94 6. CONSUMING LINKED DATA
6.2.1 SOFTWARE REQUIREMENTS
In this example mashup, two open-source tools will be used to access the Web of Data and to cache
retrieved data locally for further processing:
1. LDspider [65], a Linked Data crawler that can process a variety of Web data formats including
RDF/XML, Turtle, Notation 3, RDFa and many microformats. LDspider supports different
crawling strategies and allows crawled data to be stored either in files or in an RDF store (via
SPARQL/Update35 ).
2. Jena TDB, an RDF store which allows data to be added using SPARQL/Update and provides
for querying the data afterwards using the SPARQL query language.
LDspider can be downloaded from http://code.google.com/p/ldspider/. Use of LDspider
requires a Java runtime environment on the host machine and inclusion of the LDspider .jar file
in the machines classpath.
Jena TDB can be downloaded from http://openjena.org/TDB/. The site also contains in-
structions on how to install TDB. For the example mashup, the TBD standard configuration will be
used and the store will be located at localhost:2020.
The -u parameter provides LDspider with the DBpedia Birmingham URI as seed URI. The
-follow parameter instructs LDspider to follow only owl:sameAs links and to ignore other link types.
-b restricts the depth to which links are followed. The -oe parameter tells LDspider to put retrieved
data via SPARQL/Update into the RDF store available at the given URI.
LDspider starts with dereferencing the DBpedia URI. Within the retrieved data, LDspider
discovers several owl:sameAs links pointing at further data about Birmingham provided by Geonames,
35 http://www.w3.org/TR/sparql11-update/
6.2. DEVELOPING A LINKED DATA MASHUP 95
Freebase, and the New York Times. In the second step, LDspider dereferences these URIs and puts
the retrieved data into the RDF local store using SPARQL/Update.
LDspider uses the Named Graphs data model to store retrieved data. After LDspider has
finished its crawling job, the RDF store contains four Named Graphs. Each graph is named with
the URI from which LDspider retrieved the content of the graph. The listing below shows a subset
of the retrieved data from DBpedia, Geonames, and the New York Times. The graph from Freebase
is omitted due to space restrictions. The RDF store now contains a link to an image depicting Birm-
ingham (Line 5) provided by DBpedia, geo-coordinates for Birmingham provided by Geonames
(Lines 16 and 17), as well as a link that we can use to retrieve articles about Birmingham from the
New York Times archive (Lines 26 and 27).
1 < h t t p : / / d b p e d i a . o r g / d a t a / Birmingham . xml >
2 {
3 d b p e d i a :B i r m i n g h a m r d f s : l a b e l " Birmingham " @en .
4 d b p e d i a :B i r m i n g h a m r d f : t y p e d b p e d i a o n t : C i t y .
5 d b p e d i a :B i r m i n g h a m d b p e d i a o n t : t h u m b n a i l
< h t t p : / / . . . / 2 0 0 pxBirmingham_UK_S k y l i n e . j p g > .
36 http://www4.wiwiss.fu-berlin.de/bizer/TriG/
96 6. CONSUMING LINKED DATA
6 d b p e d i a :B i r m i n g h a m d b p e d i a o n t : e l e v a t i o n " 140 " ^^ x s d : d o u b l e .
7 d b p e d i a :B i r m i n g h a m o w l :s a m e A s < h t t p : / / d a t a . n y t i m e s . com / N35531941558043900331 > .
8 d b p e d i a :B i r m i n g h a m o w l :s a m e A s < h t t p : / / s w s . geonames . o r g / 3 3 3 3 1 2 5 / > .
9 d b p e d i a :B i r m i n g h a m o w l :s a m e A s
< h t t p : / / r d f . f r e e b a s e . com / n s / g u i d . 9 2 0 2 . . . f 8 0 0 0 0 0 0 0 8 8 c 7 5 > .
10 }
11
12 < h t t p : / / s w s . geonames . o r g / 3 3 3 3 1 2 5 / a b o u t . r d f >
13 {
14 < h t t p : / / s w s . geonames . o r g / 3 3 3 3 1 2 5 / > gnames:name " C i t y and Borough o f
Birmingham " .
15 < h t t p : / / s w s . geonames . o r g / 3 3 3 3 1 2 5 / > r d f : t y p e g n a m e s : F e a t u r e .
16 < h t t p : / / s w s . geonames . o r g / 3 3 3 3 1 2 5 / > g e o : l o n g " 1.89823 " .
17 < h t t p : / / s w s . geonames . o r g / 3 3 3 3 1 2 5 / > g e o : l a t " 5 2 .4 8 0 4 8 " .
18 < h t t p : / / s w s . geonames . o r g / 3 3 3 3 1 2 5 / > o w l :s a m e A s
19 < h t t p : / / www . o r d n a n c e s u r v e y . co . uk / . . . # birmingham_00cn > .
20 }
21
22 < h t t p : / / d a t a . n y t i m e s . com / N35531941558043900331 >
23 {
24 n y t :N 3 5 5 3 1 9 4 1 5 5 8 0 4 3 9 0 0 3 3 1 s k o s : p r e f L a b e l " Birmingham ( England ) " @en .
25 n y t :N 3 5 5 3 1 9 4 1 5 5 8 0 4 3 9 0 0 3 3 1 n y t : a s s o c i a t e d _ a r t i c l e _ c o u n t " 3 " ^^ x s d : i n t e g e r .
26 n y t :N 3 5 5 3 1 9 4 1 5 5 8 0 4 3 9 0 0 3 3 1 n y t : s e a r c h _ a p i _ q u e r y
27 < h t t p : / / a p i . n y t i m e s . com / s v c / s e a r c h / . . . > .
28 }
37 http://www.w3.org/TR/rdf-sparql-query/#rdfDataset
38 http://www.w3.org/TR/rdf-sparql-protocol/
6.3. ARCHITECTURE OF LINKED DATA APPLICATIONS 97
The tokens starting with a question mark in the query are variables that are bound to values
form the different graphs during query execution. The first line of the query specifies that we want
to retrieve the predicates of all triples (?p) as well as the objects (?o) of all triples that describe
Birmingham . In addition, we want to retrieve the names of the graphs (?g) from which each triple
originates. We want to use the graph name to group triples on the web page and to display the URI
from where a triple was retrieved next to each triple. The graph pattern in Lines 3-5 matches all
data about Birmingham from DBpedia. The graph patterns in Lines 7-10 match all triples in other
graphs that are connected by owl:sameAs links with the DBpedia URI for Birmingham .
Jena TDB sends the query results back to the application as a SPARQL result set XML
document 39 and the application renders them to fit the layout of the web page.
The minimal Linked Data application described above leaves out many important aspects
that are involved in Linked Data consumption. These are discussed below.
ApplicationLayer
ApplicationCode
SPARQLorRDFAPI
DataAccess,
Integrationand WebData Vocabulary Identity Quality
Access Mapping Resolution Evaluation Integrated
StorageLayer Module Module Module Module WebData
HTTP
WebofData
Figure 6.7: Architecture of a Linked Data application that implements the crawling pattern.
links. In addition, relevant data can also be discovered via Linked Data search engines and
might be accessed via SPARQL endpoints or in the form of RDF data dumps.
2. Vocabulary Mapping. Different Linked Data sources may use different RDF vocabularies to
represent the same type of information. In order to understand as much Web data as possi-
ble, Linked Data applications translate terms from different vocabularies into a single target
schema. This translation may rely on vocabulary links that are published on the Web by vo-
cabulary maintainers, data providers or third parties. Linked Data applications which discover
data that is represented using terms that are unknown to the application may therefore search
the Web for mappings and apply the discovered mappings to translate data to their local
schemata.
3. Identity Resolution. Different Linked Data sources use different URIs to identify the same
entity, for instance, a person or a place. Data sources may provide owl:sameAs links pointing
100 6. CONSUMING LINKED DATA
at data about the same real-world entity provided by other data sources. In cases where data
sources do not provide such links, Linked Data applications may apply identity resolution
heuristics in order to discover additional links.
4. Provenance Tracking. Linked Data applications rely on data from open sets of data sources.
In order to process data more efficiently, they often cache data locally. For cached data, it is
important to keep track of data provenance in order to be able to assess the quality of the data
and to go back to the original source if required.
5. Data Quality Assessment. Due to the open nature of the Web, any Web data needs to be treated
with suspicion, and Linked Data applications should thus consider Web data as claims by
different sources rather than as facts. Data quality issues might not be too relevant if an
application integrates data from a relatively small set of known sources. However, in cases
where applications integrate data from the open Web, applications should employ data quality
assessment methods in order to determine which claims to accept and which to reject as
untrustworthy.
6. Using the Data in the Application Context. After completing tasks 1 to 5, the application has
integrated and cleansed Web data to an extent that is required for more sophisticated pro-
cessing. Such processing may in the most simple case involve displaying data to the user in
various forms (tables, diagrams, other interactive visualizations). More complex applications
may aggregate and/or mine the data, and they may employ logical reasoning in order to make
implicit relationships explicit.
The following sections describe these tasks outlined above in more detail, and they refer to
relevant papers and open-source tools that can be used to perform the tasks.
54 http://www4.wiwiss.fu-berlin.de/bizer/r2r/
55 http://code.google.com/p/google-refine/
56 http://lab.linkeddata.deri.ie/2010/grefine-rdf-extension/
6.3. ARCHITECTURE OF LINKED DATA APPLICATIONS 103
an application can store data about newly discovered instances in its repository or fuse data that is
already known about an entity with additional data from the Web.
1. Content-based Heuristics use information to be assessed itself as quality indicator. The metrics
analyze the information content or compare information with related information. Examples
include outlier detection methods, for instance, treat a sales offer with suspicion if the price is
more than 30% below the average price for the item, as well as classic spam detection methods
that rely on patterns of suspicious words.
3. Rating-based Heuristics rely on explicit ratings about information itself, information sources, or
information providers. Ratings may originate from the information consumer, other informa-
tion consumers (as, for instance, the ratings gathered by the Sig.ma search engine, see Section
6.1.1.2), or domain experts.
104 6. CONSUMING LINKED DATA
Once the application has assessed the quality of a piece of information, it has different options
for handling data conflicts and low quality data. Depending on the context, it is preferable to:
1. Rank Data. The simplest approach is to display all data, but rank data items according to their
quality score. This approach is currently used by the Linked Data search engines discussed
in Section 6.1.1.2. Inspired by Google, the search engines rely on variations of the PageRank
algorithm [91] to determine coarse-grained measures of the popularity or significance of a
particular data source, as a proxy for relevance or quality of the data.
2. Filter Data. In order to avoid overwhelming users with low-quality data, applications may decide
to display only data which successfully passed the quality evaluation. A prototypical software
framework that can be used to filter Web data using a wide range of different data quality
assessment policies is the WIQA framework57 .
3. Fuse data. Data fusion is the process of integrating multiple data items representing the same
real-world object into a single, consistent, and clean representation. The main challenge in
data fusion is the resolution of data conflicts, i.e., choosing a value in situations where multiple
sources provide different values for the same property of an object. There is a large body of
work on data fusion in the database community [35]. Linked Data applications can build
on this work for choosing appropriate conflict resolution heuristics. Prototypical systems that
support fusing Linked Data from multiple sources include DERI Pipes [73] and the KnoFuss
architecture [89].
A list of criteria to assess the quality of Linked Data sources is proposed at 58 . In [56], Hartig
presents an approach to handle trust (quality) values in SPARQL query processing. A method to
integrate data quality assessment into query planning for federated architectures is presented in [85].
There is a large body of related work on probabilistic databases on which Linked Data appli-
cations can build. A survey of this work is presented in [45]. A well-known system which combines
uncertainty and data linage is Trio System [2]. Uncertainty does not only apply to instance data but
to a similar extent to vocabulary links that provide mappings between different terms. Existing work
from databases that deal with the uncertainty of mappings in data integration processes is surveyed
in [75].
Users will only trust the quality assessment results if they understand how these results were
generated. Tim Berners-Lee proposed in [14] that Web browsers should be enhanced with an "Oh,
yeah?" button to support the user in assessing the reliability of information encountered on the Web.
Whenever a user encounters a piece of information that they would like to verify, pressing such
a button would produce an explanation of the trustworthiness of the displayed information. This
goal has yet to be realised; however, existing prototypes such as WIQA [28] and InferenceWeb [78]
57 http://www4.wiwiss.fu-berlin.de/bizer/wiqa/
58 http://sourceforge.net/apps/mediawiki/trdf/index.php?title=Quality_Criteria_for_Linked_Data_
sources
6.4. EFFORT DISTRIBUTION BETWEEN PUBLISHERS, CONSUMERS ANDTHIRD PARTIES 105
provide explanations about information quality as well as inference processes and can be used as
inspiration for work in this area.
CHAPTER 7
Bibliography
[1] Ben Adida and Mark Birbeck. Rdfa primer - bridging the human and data webs - w3c
recommendation. http://www.w3.org/TR/xhtml-rdfa-primer/, 2008. 15, 18
[2] Parag Agrawal, Omar Benjelloun, Anish Das Sarma, Chris Hayworth, Shubha Nabar, Tomoe
Sugihara, and Jennifer Widom. Trio: A system for data, uncertainty, and lineage. In Proceedings
of the VLDB Conference, pages 11511154, 2006. 104
[3] Ali Al-Mahrubi and et al. http://dayta.me - A Personal News + Data Recommender for Your
Day. Semantic Web Challenge 2010 Submission. http://www.cs.vu.nl/pmika/swc/
submissions/swc2010_submission_17.pdf, 2010. 91
[4] Keith Alexander. Rdf in json. In Proceedings of the 4th Workshop on Scripting for the Semantic
Web, 2008. 20
[5] Keith Alexander, Richard Cyganiak, Michael Hausenblas, and Jun Zhao. Describing linked
datasets. In Proceedings of the WWW2009 Workshop on Linked Data on the Web, 2009. 48
[6] Dean Allemang and Jim Hendler. Semantic Web for the Working Ontologist: Effective Modeling
in RDFS and OWL. Morgan Kaufmann, 2008. 57, 63
[7] Christian Becker and Christian Bizer. Exploring the geospacial semantic web with dbpedia
mobile. Journal of Web Semantics: Science, Services and Agents on the World Wide Web, 7:278286,
2009. DOI: 10.1016/j.websem.2009.09.004 86, 92
[8] Christian Becker, Christian Bizer, Michael Erdmann, and Mark Greaves. Extending smw+
with a linked data integration framework. In Proceedings of the ISWC 2010 Posters & Demon-
strations Track, 2010. 93
[10] David Beckett and Tim Berners-Lee. Turtle - terse rdf triple language. http://www.w3.
org/TeamSubmission/turtle/, 2008. 19
[11] Belleau, F., Nolin, M., Tourigny, N., Rigault, P., Morissette, J. Bio2rdf: Towards a mashup
to build bioinformatics knowledge systems. Journal of Biomedical Informatics, 41(5):70616,
2008. DOI: 10.1016/j.jbi.2008.03.004 37
112 BIBLIOGRAPHY
[12] Michael Bergman. Advantages and myths of rdf. http://www.mkbergman.com/wp-
content/themes/ai3/files/2009Posts/Advantages_M%yths_RDF_090422.pdf,
2009. 15
[19] Tim Berners-Lee. Long live the web: A call for continued open standards and neutrality.
Scientific American, 32, 2010. 110
[20] Tim Berners-Lee, R. Fielding, and L. Masinter. RFC 2396 - Uniform Resource Identifiers
(URI): Generic Syntax. http://www.isi.edu/in-notes/rfc2396.txt, August 1998. 7
[21] Tim Berners-Lee, James Hendler, and Ora Lassilia. The semantic web. Scientific American,
284(5):3444, Mai 2001. DOI: 10.1038/scientificamerican0501-34 5
[22] Tim Berners-Lee and Lalana Kagal. The fractal nature of the semantic web. AI Magazine,
Vol 29, No 3, 2008. 24, 62, 101, 107
[23] Diego Berrueta and Jon Phipps. Best practice recipes for publishing rdf vocabularies - w3c
note. http://www.w3.org/TR/swbp-vocab-pub/, 2008. 24, 58, 63, 83
[24] Alexander Bilke and Felix Naumann. Schema matching using duplicates. In Proceedings of
the International Conference on Data Engineering, 2005. DOI: 10.1109/ICDE.2005.126 102
[25] Mark Birbeck. Rdfa and linked data in uk government web-sites. Nodalities Magazine, 7,
2009. 36
[26] Paul Biron and Ashok Malhotra. Xml schema part 2: Datatypes second edition - w3c rec-
ommendation. http://www.w3.org/TR/xmlschema-2/, 2004. 16
BIBLIOGRAPHY 113
[27] Christian Bizer. Pay-as-you-go data integration on the public web of linked data. Invited
talk at the 3rd Future Internet Symposium 2010. http://www.wiwiss.fu-berlin.de/
en/institute/pwo/bizer/research/publications/%Bizer-FIS2010-Pay-As-
You-Go-Talk.pdf, 2010. 107
[28] Christian Bizer and Richard Cyganiak. Quality-driven information filtering using the wiqa
policy framework. Journal of Web Semantics: Science, Services and Agents on the World Wide Web,
7(1):110, 2009. DOI: 10.1016/j.websem.2008.02.005 103, 104
[29] Christian Bizer, Richard Cyganiak, and Tobias Gauss. The rdf book mashup: From web apis
to a web of data. In Proceedings of the Workshop on Scripting for the Semantic Web, 2007. 38, 70
[30] Christian Bizer, Tom Heath, and Tim Berners-Lee. Linked data - the story so far. Int. J.
Semantic Web Inf. Syst., 5(3):122, 2009. DOI: 10.4018/jswis.2009081901 5, 29
[31] Christian Bizer, Ralf Heese, Malgorzata Mochol, Radoslaw Oldakowski, Robert Tolksdorf,
and Rainer Eckstein. The impact of semantic web technologies on job recruitment pro-
cesses. In Proceedings of the 7. Internationale Tagung Wirtschaftsinformatik (WI2005), 2005.
DOI: 10.1007/3-7908-1624-8_72 36
[32] Christian Bizer, Jens Lehmann, Georgi Kobilarov, Sren Auer, Christian Becker, Richard
Cyganiak, and Sebastian Hellmann. Dbpedia - a crystallization point for the web of data.
Journal of Web Semantics: Science, Services and Agents on the World Wide Web, 7(3):154165,
2009. DOI: 10.1016/j.websem.2009.07.002 14, 33
[33] Christian Bizer and Andreas Schultz. The berlin sparql benchmark. International Journal on
Semantic Web and Information Systems, 5(2):124, 2009. 105
[34] Christian Bizer and Andreas Schultz. The r2r framework: Publishing and discovering map-
pings on the web. In Proceedings of the 1st International Workshop on Consuming Linked Data,
2010. 25, 102
[35] Bleiholder, J., Naumann, F. Data fusion. ACM Computing Surveys, 41(1):141, 2008.
DOI: 10.1145/1456650.1456651 104
[36] John Breslin, Andreas Harth, Uldis Bojars, and Stefan Decker. Towards semantically-
interlinked online communities. In Proceedings of the 2nd European Semantic Web Conference,
Heraklion, Greece, 2005. DOI: 10.1007/11431053_34 54
[37] D. Brickley and R. V. Guha. RDF Vocabulary Description Language 1.0: RDF Schema -
W3C Recommendation. http://www.w3.org/TR/rdf-schema/, 2004. 17, 24, 56
[38] Peter Buneman, Sanjeev Khanna, and Wang chiew Tan. Why and where: A characterization
of data provenance. In Proceedings of the International Conference on Database Theory, pages
316330. Springer, 2001. DOI: 10.1007/3-540-44503-X_20 52
114 BIBLIOGRAPHY
[39] Carroll, J., Bizer, C., Hayes, P., Stickler, P. Named graphs. Journal of Web Se-
mantics: Science, Services and Agents on the World Wide Web, 3(4):247267, 2005.
DOI: 10.1016/j.websem.2005.09.001 95, 103
[40] Gong Cheng and Yuzhong Qu. Searching linked objects with falcons: Approach, implemen-
tation and evaluation. International Journal on Semantic Web and Information Systems (IJSWIS),
5(3):4970, 2009. 87
[41] Chris Clarke. A resource list management tool for undergraduate students based on linked
open data principles. In Proceedings of the 6th European Semantic Web Conference, Heraklion,
Greece, 2009. DOI: 10.1007/978-3-642-02121-3_51 91
[42] Gianluca Correndo, Manuel Salvadores, Ian Millard, Hugh Glaser, and Nigel Shadbolt.
Sparql query rewriting for implementing data integration over linked data. In Proceed-
ings of the 2010 EDBT/ICDT Workshops, pages 111, New York, NY, USA, 2010. ACM.
DOI: 10.1145/1754239.1754244 102
[43] Cyganiak, R., Delbru, R., Stenzhorn, H., Tummarello, G., Decker, S. Semantic sitemaps:
Efficient and flexible access to datasets on the semantic web. In Proceedings of the 5th European
Semantic Web Conference, 2008. DOI: 10.1007/978-3-540-68234-9_50 48
[44] Axel Cyrille, Ngonga Ngomo, and Sren Auer. Limes - a time-efficient approach for
large-scale link discovery on the web of data. http://svn.aksw.org/papers/2011/
WWW_LIMES/public.pdf, 2010. 68
[45] Nilesh Dalvi, Christopher R, and Dan Suciu. Probabilistic databases: diamonds in the dirt.
Commun. ACM, 52:8694, July 2009. DOI: 10.1145/1538788.1538810 104
[46] Elmagarmid, A., Ipeirotis, P., Verykios, V. . Duplicate record detection: A survey. IEEE
Transactions on Knowledge and Data Engineering, 19(1):116, 2007.
DOI: 10.1109/TKDE.2007.250581 66, 102
[47] Jim Ericson. Net expectations - what a web data service economy implies for business.
Information Management Magazine, Jan/Feb, 2010. 1
[48] Euzenat, J., Scharffe, F., Zimmermann A. Expressive alignment language and implementa-
tion. Knowledge Web project report, KWEB/2004/D2.2.10/1.0, 2007. 102
[49] Euzenat, J., Shvaiko, P. Ontology Matching. Springer, Heidelberg, 2007. 66, 102
[50] Roy Fielding. Hypertext transfer protocol http/1.1. request for comments: 2616. http://
www.w3.org/Protocols/rfc2616/rfc2616.html, 1999. 7, 10
[51] Franklin, M.J., Halevy, A.Y., Maier, D. From databases to dataspaces: A new abstraction for
information management. SIGMOD Record, 34(4):2733, 2005.
DOI: 10.1145/1107499.1107502 25, 107
BIBLIOGRAPHY 115
[52] Alon Y. Halevy, Peter Norvig, and Fernando Pereira. The unreasonable effectiveness of data.
IEEE Intelligent Systems, 24(2):812, 2009. DOI: 10.1109/MIS.2009.36 106
[53] Harry Halpin, Patrick Hayes, James McCusker, Deborah Mcguinness, and Henry Thompson.
When owl:sameas isnt the same: An analysis of identity in linked data. In Proceedings of the
9th International Semantic Web Conference, 2010. DOI: 10.1007/978-3-642-17746-0_20 23
[54] Andreas Harth. Visinav: A system for visual search and navigation on web data. Web
Semantics: Science, Services and Agents on the World Wide Web, 8(4):348 354, 2010.
DOI: 10.1007/978-3-642-03573-9_17 89
[55] Andreas Harth, Aidan Hogan, Jrgen Umbrich, and Stefan Decker. Swse: Objects before
documents! In Proceedings of the Semantic Web Challenge 2008, 2008. 87
[56] Olaf Hartig. Querying trust in rdf data with tsparql. In Proceedings of the 6th European
Semantic Web Conference, pages 520, 2009. DOI: 10.1007/978-3-642-02121-3_5 104
[57] Olaf Hartig, Christian Bizer, and Johann Christoph Freytag. Executing sparql queries over
the web of linked data. In Proceedings of the International Semantic Web Conference, pages
293309, 2009. DOI: 10.1007/978-3-642-04930-9_19 97, 100
[58] Olaf Hartig and Andreas Langegger. A database perspective on consuming linked
data on the web. Datenbank-Spektrum, 10:5766, 2010. 10.1007/s13222-010-0021-7.
DOI: 10.1007/s13222-010-0021-7 98
[59] Olaf Hartig, Hannes Mhleisen, and Johann-Christoph Freytag. Linked data for building a
map of researchers. In Proceedings of the 5th Workshop on Scripting for the Semantic Web, 2009.
93
[60] Haslhofer, B. A Web-based Mapping Technique for Establishing Metadata Interoperability. PhD
thesis, Universitaet Wien, 2008. 102
[61] Tom Heath and Enrico Motta. Revyu: Linking reviews and ratings into the web of data.
Journal of Web Semantics: Science, Services and Agents on the World Wide Web, 6(4), 2008.
DOI: 10.1016/j.websem.2008.09.003 38
[62] Heath,T. . How will we interact with the web of data? IEEE Internet Computing, 12(5):8891,
2008. DOI: 10.1109/MIC.2008.101 86
[63] Martin Hepp. Goodrelations: An ontology for describing products and services offers on
the web. In Proceedings of the 16th International Conference on Knowledge Engineering and
Knowledge Management, Acitrezza, Italy, 2008. 38
[64] Aidan Hogan, Andreas Harth, Alexandre Passant, Stefan Decker, and Axel Polleres. Weaving
the pedantic web. In Proceedings of the WWW2010 Workshop on Linked Data on the Web, 2010.
82
116 BIBLIOGRAPHY
[65] Robert Isele, Andreas Harth, Jrgen Umbrich, and Christian Bizer. Ldspider: An open-source
crawling framework for the web of linked data. In ISWC 2010 Posters & Demonstrations Track:
Collected Abstracts Vol-658, 2010. 94, 100
[66] Robert Isele, Anja Jentzsch, and Christian Bizer. Silk server - adding missing links while
consuming linked data. In Proceedings of the 1st International Workshop on Consuming Linked
Data (COLD 2010), 2010. 102
[67] Ian Jacobs and Norman Walsh. Architecture of the World Wide Web, Volume One, 2004. http://
www.w3.org/TR/webarch/. 7, 9
[68] Anja Jentzsch, Oktie Hassanzadeh, Christian Bizer, Bo Andersson, and Susie Stephens.
Enabling tailored therapeutics with linked data. In Proceedings of the WWW2009 Workshop on
Linked Data on the Web, 2009. 5, 37
[69] Clement Jonquet, Paea LePendu, Sean Falconer, Adrien Coulet, Natalya Noy, Mark Musen,
and Nigam Shah. Ncbo resource index: Ontology-based search and mining of biomedical
resources. Semantic Web Challenge 2010 Submission. http://www.cs.vu.nl/pmika/
swc/submissions/swc2010_submission_4.pdf, 2010. 92
[70] Graham Klyne and Jeremy J. Carroll. Resource Description Framework (RDF): Concepts and
Abstract Syntax - W3C Recommendation, 2004. http://www.w3.org/TR/rdf-concepts/.
8, 15, 17
[71] Georgi Kobilarov, Tom Scott, Yves Raimond, Silver Oliver, Chris Sizemore, Michael
Smethurst, Christian Bizer, and Robert Lee. Media meets semantic web - how the bbc
uses dbpedia and linked data to make connections. In The Semantic Web: Research and Appli-
cations, 6th European Semantic Web Conference, pages 723737, 2009.
DOI: 10.1007/978-3-642-02121-3_53 34
[72] Donald Kossmann. The state of the art in distributed query processing. ACM Comput. Surv.,
32:422469, December 2000. DOI: 10.1145/371578.371598 98
[73] Danh Le-Phuoc, Axel Polleres, Manfred Hauswirth, Giovanni Tummarello, and Christian
Morbidoni. Rapid prototyping of semantic mash-ups through semantic web pipes. In Pro-
ceedings of the 18th international conference on World wide web, WWW 09, pages 581590.
ACM, 2009. DOI: 10.1145/1526709.1526788 104
[74] Madhavan, J., Shawn, J. R., Cohen, S., Dong, X., Ko, D., Yu, C., Halevy, A. Web-scale data
integration: You can only afford to pay as you go. Proceedings of the Conference on Innovative
Data Systems Research, 2007. 25, 106, 107
[75] Matteo Magnani and Danilo Montesi. A survey on uncertainty management in data integra-
tion. J. Data and Information Quality, 2:5:15:33, July 2010. DOI: 10.1145/1805286.1805291
104
BIBLIOGRAPHY 117
[76] Frank Manola and Eric Miller. RDF Primer. W3C, http://www.w3c.org/TR/rdf-
primer/, February 2004. 15, 18
[77] Philippe C. Mauroux, Parisa Haghani, Michael Jost, Karl Aberer, and Hermann De Meer.
idMesh: graph-based disambiguation of linked data. In Proceedings of the 18th international
conference on World wide web, pages 591600. ACM, 2009. DOI: 10.1145/1526709.1526789
68
[78] Deborah L. McGuinness and Paulo Pinheiro da Silva. Inference web: Portable and shareable
explanations for question answering. In Proceedings of the American Association for Artificial
Intelligence Spring Symposium Workshop on New Directions for Question Answering. Stanford
University, 2003. 104
[79] Deborah L. McGuinness and Frank van Harmelen. OWL Web Ontology Lan-
guage Overview - W3C Recommendation. http://www.w3.org/TR/2004/REC-owl-
features-20040210/, 2004. 17, 24, 56
[80] Noah Mendelsohn. The self-describing web - tag finding. http://www.w3.org/2001/
tag/doc/selfDescribingDocuments.html, 2009. 24, 29, 106
[81] Alistair Miles and Sean Bechhofer. Skos simple knowledge organization system - reference.
http://www.w3.org/TR/skos-reference/, 2009. 24, 56
[82] Paul Miller, Rob Styles, and Tom Heath. Open data commons, a license for open data. In
Proceedings of the WWW2008 Workshop on Linked Data on the Web, 2008. 53
[83] R. Moats. Rfc 2141: Urn syntax. http://tools.ietf.org/html/rfc2141, 1997. 10
[84] Knud Mller, Tom Heath, Siegfried Handschuh, and John Domingue. Recipes for semantic
web dog food - the eswc and iswc metadata projects. In Proceedings of the 6th Interna-
tional Semantic Web Conference and 2nd Asian Semantic Web Conference, Busan, Korea, 2007.
DOI: 10.1007/978-3-540-76298-0_58 37
[85] Felix Naumann. Quality-Driven Query Answering for Integrated Information Systems, volume
2261 / 2002 of Lecture Notes in Computer Science. Springer-Verlag GmbH, 2002. 104
[86] Joachim Neubert. Bringing the "thesaurus for economics" on to the web of linked data. In
Proceedings of the WWW2009 Workshop on Linked Data on the Web, 2009. 36
[87] Popitsch Niko and Haslhofer Bernhard. Dsnotify: Handling broken links in the web of data.
In Proceedings of the 19th International World Wide Web Conference, Raleigh, NC, USA, 2 2010.
ACM. DOI: 10.1145/1772690.1772768 68
[88] Andriy Nikolov and Enrico Motta. Capturing emerging relations between schema ontologies
on the web of data. In Proceedings of the First International Workshop on Consuming Linked
Data, 2010. 102
118 BIBLIOGRAPHY
[89] Andriy Nikolov, Victoria Uren, Enrico Motta, and Anne Roeck. Integration of seman-
tically annotated data by the knofuss architecture. In Proceedings of the 16th interna-
tional conference on Knowledge Engineering: Practice and Patterns, pages 265274, 2008.
DOI: 10.1007/978-3-540-87696-0_24 104
[90] Benjamin Nowack. Paggr: Linked data widgets and dashboards. Web Semantics: Science,
Services and Agents on the World Wide Web, 7(4):272 277, 2009. Semantic Web challenge
2008. 91
[91] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation
ranking: Bringing order to the web. Technical report, Stanford Digital Library Technologies
Project, 1998. 104
[93] P. F. Patel-Schneider, P. Hayes, and I. Horrocks. OWL Web Ontology Language Se-
mantics and Abstract Syntax - W3C Recommendation. http://www.w3.org/TR/owl-
semantics/, 2004. 23
[94] Axel Polleres, Franois Scharffe, and Roman Schindlauer. Sparql++ for mapping between rdf
vocabularies. In Proceedings of the 6th International Conference on Ontologies, DataBases, and
Applications of Semantics (ODBASE 2007), 2007. 102
[95] Eric Prudhommeaux and Andy Seaborne. SPARQL Query Language for RDF - W3C Rec-
ommendation, 2008. http://www.w3.org/TR/rdf-sparql-query/. 17, 96
[96] Bastian Quilitz and Ulf Leser. Querying distributed rdf data sources with sparql. In Proceedings
of the 5th European Semantic Web Conference, 2008. DOI: 10.1007/978-3-540-68234-9_39
101
[97] Dave Raggett, Arnaud Le Hors, and Ian Jacobs. Html 4.01 specification - w3c recommen-
dation. http://www.w3.org/TR/html401/, 1999. 7
[98] Leo Sauermann and Richard Cyganiak. Cool uris for the semantic web - w3c interest group
note. http://www.w3.org/TR/cooluris/, 2008. 11, 13, 14, 15, 43, 46
[99] Simon Schenk, Carsten Saathoff, Steffen Staab, and Ansgar Scherp. Semaplorerinteractive
semantic exploration of data and media based on a federated cloud infrastructure. Web Se-
mantics: Science, Services and Agents on the World Wide Web, 7(4):298 304, 2009. Semantic
Web challenge 2008. 101
[100] Len Seligman, Peter Mork, Alon Y. Halevy, Ken Smith, Michael J. Carey, Kuang Chen,
Chris Wolf, Jayant Madhavan, Akshay Kannan, and Doug Burdick. Openii: an open source
BIBLIOGRAPHY 119
information integration toolkit. In Proceedings of the SIGMOD Conference, pages 10571060,
2010. DOI: 10.1145/1807167.1807285 102
[101] John Sheridan and Jeni Tennison. Linking uk government data. In Proceedings of the
WWW2010 Workshop on Linked Data on the Web, 2010. 36
[102] Sebastian Hellmann Sren Auer, Jens Lehmann. Linkedgeodata - adding a spatial dimen-
sion to the web of data. In Proceedings of the International Semantic Web Conference, 2009.
DOI: 10.1007/978-3-642-04930-9_46 34
[103] Rob Styles, Danny Ayers, and Nadeem Shabir. Semantic marc, marc21 and the semantic
web. In Proceedings of the WWW2008 Workshop on Linked Data on the Web, 2008. 43
[104] Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. Yago: a core of semantic knowl-
edge. In Carey L. Williamson, Mary Ellen Zurko, Peter F. Patel-Schneider, and Prashant J.
Shenoy, editors, Proceedings of the 16th International Conference on World Wide Web, WWW
2007, Banff, Alberta, Canada, May 8-12, 2007, pages 697706. ACM, 2007. 34
[105] T. Berners-Lee et al. Tabulator: Exploring and analyzing linked data on the semantic web.
In Proceedings of the 3rd International Semantic Web User Interaction Workshop, 2006. 86
[106] Henry Thompson and David Orchard. Urns, namespaces and registries. http://www.w3.
org/2001/tag/doc/URNsAndRegistries-50, 2006. 10
[107] Giovanni Tummarello, Richard Cyganiak, Michele Catasta, Szymon Danielczyk, Renaud
Delbru, and Stefan Decker. Sig.ma: Live views on the web of data. Web Semantics: Science, Ser-
vices and Agents on the World Wide Web, 8(4):355 364, 2010. DOI: 10.1145/1772690.1772907
87
[108] Giovanni Tummarello, Renaud Delbru, and Eyal Oren. Sindice.com: Weaving the Open
Linked Data. In Proceedings of the 6th International Semantic Web Conference, 2007. 89
[109] Jrgen Umbrich, Boris Villazon-Terrazas, and Michael Hausenblas. Dataset dynamics com-
pendium: A comparative study. In Proceedings of the First International Workshop on Consuming
Linked Data (COLD2010), 2010. 68
[110] Van de Sompel, H., Lagoze, C., Nelson, M., Warner, S., Sanderson, R., Johnston, P. Adding
escience assets to the data web. In Proceedings of the 2nd Workshop on Linked Data on the Web
(LDOW2009), 2009. 37
[111] Julius Volz, Christian Bizer, Martin Gaedke, and Georgi Kobilarov. Discovering and main-
taining links on the web of data. In Proceedings of the International Semantic Web Conference,
pages 650665, 2009. DOI: 10.1007/978-3-642-04930-9_41 68
120 BIBLIOGRAPHY
[112] Denny Vrandecic, Varun Ratnakar, Markus Krtzsch, and Yolanda Gil. Shortipedia - aggre-
gating and curating semantic web data. Semantic Web Challenge 2010 Submission. http://
www.cs.vu.nl/pmika/swc/submissions/swc2010_submission_18.pdf, 2010. 93
Authors Biographies