Citation For Published Version
Citation For Published Version
Heery, R, Powell, A & Day, M 1997, 'Metadata', Library & Information Briefings, vol. 75, pp. 1-19.
Publication date:
1997
Link to publication
University of Bath
General rights
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners
and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.
Metadata
services which are now using metadata for resource discovery in a networked
environment.
Metadata
Within this Briefing we will be examining metadata in Decisions about formats will be influenced by which
the context of what is traditionally called biblio- of the above functions the metadata will perform.
graphic control but might more widely be understood Thus within a system it will sometimes be appropriate
as network information management. We will use to have a simple metadata format, for example to allow
for interoperability in searching across subject do- although in reality this will be limited in terms of
mains, while on occasion a richer format will be granularity and frequency of update.
required to enable selection of resources in a special- Geographical: covering all Web sites in a
ized domain. particular area, country or region.
Sectoral: this might be a subject area, a user
Much of this Briefing will concentrate on metadata as community like higher education, or a curatorial
it relates to networked resources and in particular to tradition like museums, libraries or archives.
World Wide Web resources. The opportunities pro- Selective: typically sectoral services will select
vided by the Web for new services and new publishing resources for description on the basis of quality
processes require new forms of resource description. criteria.
The volatile nature of Web documents and the con- Organizational/Intranet: organizations or
tinuing increase in the amount of information being individuals may want to allow searching of their
made available are driving services to seek alterna- own resources.
tives to high cost traditional cataloguing. Services
looking at incorporating the advantages of an auto- The indexes on which these services are based may be
mated approach to indexing are tending towards the derived from the automatically harvested full text of
use of simple resource description formats. Web resources, or they may be based on records
created manually. Pilot implementations are now be-
Resource discovery service models ginning to make use of metadata embedded in resources,
in particular Dublin Core embedded in HTML. In the
In the context of the Web, users are offered alternative future it seems likely that more metadata will be held
options for discovering resources, all of which are on Web sites independently of the HTML, or on third
based more or less on structured metadata. These party databases linked to the Web resource.
include:
Range of formats
Lists: lists of pointers to useful resources.
Searching: by keyword or controlled vocabulary. When examining the issues surrounding the use of
Browsing: alphabetically by subject keyword, or metadata within the Web environment, it is helpful to
using more formal subject classification schemes. consider the wider context of resource discovery.
Visualization: navigation of the Web site by Metadata formats vary according to a number of
spatial browsing techniques. criteria and there is increasing awareness of the
strengths and weaknesses of these various diverse
At present the predominant service for discovery of formats. Metadata ranges from generic simple Internet
Web resources is the search engine or search service resource descriptions to highly structured records
which may use one or more of these techniques. relating to complex objects such as databases. On the
Search engines can be categorized by their coverage one hand there is the full text indexing of the global
and selection policy, and by the method by which their search services (Excite, Lycos, etc.) where the com-
indexes are created. A number of search services have plete text of Web documents is indexed, there is no
been evaluated although the lack of information on fielded record, and the display record is an extract
policies available from the larger services make com- from the full text, typically the first few lines. On the
parisons difficult.4,5 other hand there are the complex tagged record of
MARC formats, or the analytical mark-up of SGML-
Coverage of search engines can be characterized based formats.
as:
Detailed reviews of current metadata formats have
Global: these would attempt to cover all Web sites, been carried out elsewhere.6,7 Here we will present a
Simple Rich
Simple Rich
The workshops represent a consensus building effort The Warwick workshop also looked at the implemen-
which has included participants from a range of back- tation of Dublin Core and requirements for
grounds (IETF, SGML, digital library research), extensibility, change control and implementation. The
domains (text, image, geographic information sys- Warwick framework emerged as a concept from the
tems) and professions (librarians, computer scientists, second workshop. This is a model for a container
content specialists). This consensus and the interna- architecture for packages of metadata, each package
tional acceptance of Dublin Core are probably the being metadata of a different type.9,10
most significant outcomes of the workshops, and have
largely been achieved through the leadership of OCLC. The third workshop, the CNI/OCLC Image metadata
workshop, considered use of the Dublin Core element
The objectives for Dublin Core set by the first work- set for describing images, in particular those images
shop were firstly to define a simple set of data elements which could be defined as document like objects.
so that authors and publishers of Internet documents Perhaps surprisingly the workshop reached the con-
could create their own metadata with no extensive clusion that images could be described using the
trainingthe Dublin Core approach being mid-way minimal Dublin Core elements with some minor ad-
between the detailed tagging of MARC or structured justments.
TEI headers and the automatic indexing of locator
services such as Alta Vista. Secondly, Dublin Core Discussion prior to the fourth workshop in Canberra
aimed to provide a basis for semantic interoperability resulted in agreement to extend the thirteen elements
between other, more complicated, formats. By means agreed in Dublin to fifteen, and these fifteen have been
of mapping from more complex formats, and by fil- defined and documented in an Internet-Draft. 11 Note
tering more complex formats, Dublin core facilitates that all Dublin Core elements are optional, so you do
searching across other disparate record formats. not have to embed all fifteen elements into each Web
page. They can also be repeated if necessary, for
An initial element set was agreed upon and certain example to indicate that a page has more than one
principles were established for further development author.
of the set, these being :
The semantics of Dublin Core elements can be modi-
Extensibility: the core set can be extended with fied using qualifiers, and use of qualifiers was central
further elements as it is acknowledged that many to discussions at the Canberra workshop.12 There are
publishers or metadata producers may wish to three kinds of qualifier: the TYPE qualifier which
augment this simple set with more specialized refines the meaning of an element; the SCHEME
data. qualifier which indicates that the element value con-
Optionality: all elements are optional. forms to some external and widely recognized scheme;
Repeatability: all elements are repeatable. and the LANGUAGE qualifier which indicates the
language of the element value. It has been agreed that
During the first workshop there was an explicit deci- the use of qualifiers should refine the element rather
sion not to define syntax at that stage, but first to reach than extend it. In general, the intention is that a Web
consensus on the semantics of a minimum element set. robot should be able to take the embedded Dublin
To tie Dublin Core semantics to any one particular Core metadata, throw away all of the qualifiers and
syntax (as in the MARC family of record formats) was still have something meaningful to add to its index.
seen as unhelpful. The second workshop, which took However, the widespread use of qualifiers could cause
place in the UK at the University of Warwick in April severe problems with interoperability.
1996 sponsored by UKOLN and OCLC, went on to
consider possible syntaxes. Embedding metadata in The marked confidence in Dublin Core has had sig-
resources using HTML was the obvious choice to nificant impact on standards-making activities such as
fulfil the immediate need of pilot implementations. USMARC discussions, Z39.50, and W3C initiatives;
it has also been chosen as the solution for early More recently Web-site management tools have be-
implementations within projects in Australia, Scandi- come available which hold all the pages for a site in a
navia, Europe and the US. database. A publish button causes the information in
the database to be written out as a set of HTML Web
Dublin Core creation and management pages. These tools have the immediate advantage of
standardizing the style of Web pages across a site, and
By embedding Dublin Core metadata into Web pages in future may become metadata aware. In the mean-
and then gathering it into searchable databases using time the use of these tools for managing metadata may
Web robots it will be possible to provide Web-based be possible using available macro facilities.
search services with improved precision over those
currently available. Sites interested in home grown solutions to the issues
of managing metadata may choose to hold the metadata
In order for Web page authors and Web-site adminis- separately, in a neutral format, and then convert it and
trators to be able to embed Dublin Core metadata into embed it into Web pages using server-side include
Web pages there need to be tools available.13 As an aid scripts. A more detailed description about one such
to creating Dublin Core META tags several Web system being implemented at UKOLN is available
based Dublin Core generators have been made avail- elsewhere.15
able on the Web. One of these is DC-dot, available
from the UKOLN Web-site.14 DC-dot first prompts
for the URL of the Web page that you want to describe.
It then retrieves that page from the Web and automati- WEB INDEXES
cally generates Dublin Core META tags to describe it.
The Dublin Core META tags are then displayed in
such a way that they can be updated and extended Harvesting
manually using a Web form. Once editing is complete
the tags can be copied into a Web page using cut-and- Once Dublin Core metadata is embedded into signifi-
paste to a text editor. Alternatively, DC-dot will convert cant numbers of HTML Web pages it needs to be
the Dublin Core into other formats, including collected into a Web index so that it can be made
USMARC, SOIF, XML, IAFA/ROADS, and send available using a search engine. This may be done on
these formats back to you via your Web browser or e- a site-wide basis, to form a local site search engine, or
mail. it may be done across a group of Web servers to form
a more comprehensive search engine encompassing,
However, the last few years have seen a general move for example, all the Web pages in a geographical
away from using simple text editors to create and region or subject area. The collection of metadata
maintain HTML pages towards the use of more so- from Web pages is usually done using a Web robot. A
phisticated authoring tools. These tools do not, Web robot can be thought of as an automated Web
in general, make it easy to add META tags to Web browser. Starting from a given URL or set of URLs it
pages. Even where tools do allow for the creation of visits each page in turn extracting the embedded
META tags there are longer term issues associated metadata and adding it into a database (Web index).
with embedding metadata by hand that must be con- For each page visited, the robot also extracts all the
sidered. What happens if the syntax for embedding embedded links in the page and adds them into a list of
metadata in HTML changes in the future? How easy URLs still to be visited. The robot needs to maintain
will it be to move embedded metadata into alternative this list of URLs in such a way that it does not visit the
metadata formats that are likely to become more same server too often in quick succession, thus over-
commonly used in the future, for example in PICS- loading it, but also needs to ensure that pages are
NG? revisited fairly regularly so that information in the
METADATA IN HTML
HTML allows arbitrary metadata to be embedded into the head section using the META tag. To make things
clearer, here is an example:
<HTML>
<HEAD>
<TITLE>UKOLN: UK Office for Library and Information Networking</TITLE>
<META NAME=”Keywords” CONTENT=”national centre, network information support, library
community, awareness, research, information services, public library networking,
bibliographic management, distributed library systems, metadata, resource discovery,
conferences, lectures, workshops”>
<META NAME=”Description” CONTENT=”UKOLN is a national centre for support in
network information management in the library and information communities. It provides
awareness, research and information services and is based at the University of Bath”>
</HEAD>
<BODY>
...
</BODY>
</HTML>
In this example, the TITLE tag and the two META tags give the title, some keywords and a short description
for the page. Note that the HTML specification does not say anything about what type of metadata should be
placed into the META tags. However, the Web robots used by some of the big Internet search engines (for
example Alta Vista) look for the two META tags shown in this example and use them to improve the
effectiveness of their searches. Words found in theses tags are given extra weight when they match user
queries and pages with these tags tend to appear higher up in search results than pages without them. Because
of this, these two META tags are in fairly common usage.
database does not become out of date. For large search metadata in pages.19 The eLib ROADS project, which
engines covering many Web sites it may be necessary provides the tools used by the other eLib subject
to run several Web robots on several machines, all services to construct databases of Internet resource
feeding metadata into the same database, in order to descriptions, will also use this software to construct
increase the rate at which Web pages can be indexed. robot-generated ROADS databases. There are other
projects around the world looking at similar areas.20
This is exactly how the big search engines, like Alta Some of these projects are described in more detail
Vista, function. However, their Web robots do not later in this Briefing.
currently look for embedded Dublin Core and thus
have to extract the available metadata in the form of Distributed searching
Keywords and Description META tags or try to auto-
matically generate metadata based on the text of the Having collected metadata using a Web robot, it needs
HTML page or simply build a full-text index. In many to be made available for searching. There are several
cases a combination of these three approaches is approaches to this. A fundamental concept is that of
taken. centralized verses distributed searching. A central-
ized search engine pulls all the metadata into a single
In the case of building a search engine for a single database. Although this database may be mirrored in
Web-site it may not be necessary to run a Web robot several places, users only have the opportunity of
to collect metadata. The Web index can be built searching one database at a time. Alta Vista is an
directly from the files on the Web server filestore. This example of a centralized Web index. A distributed
is the approach taken by the public domain CNIDR Web index is made up of a group of databases that may
Isite software.16 Isite is an integrated Internet publish- well be physically distributed across the Internet. In
ing software package including a text indexer, a search addition to sharing the load across multiple servers
engine and Z39.50 communication tools to access this approach also allows for localized management of
databases.17 It is worth noting that there are a couple of server databases. Searches may be sent in parallel to
problems in building an index based directly on files all the databases and the results merged, or may be
rather than by using a Web robot. Firstly, a filestore routed to appropriate databases in some way.
view of a Web server may include many pages that are
not visible on the Web (because they are not linked to There are various protocols available to facilitate
any other pages). It may well be undesirable to include distributed searching, including Z39.50, WHOIS++,
such pages in a Web index. Secondly, metadata that is and LDAP (described below). These protocols enable
embedded using server side includes (SSI) will not be a client to send a search request to a server and obtain
available to a program that simply reads a file from the results from several databases. Depending on the
Web filestore. protocol and the contents of the underlying database,
the client may be able to request more detailed infor-
Although none of the big search engines looks for mation about the search results (which may initially be
embedded Dublin Core metadata, there are some returned as a simple list of hits) and may also be able
projects that are developing robots that do. The Euro- to request that the full text of the object be returned. In
pean DESIRE project is building a partial European some cases the client may be a dedicated piece of
Web index, covering the Nordic countries, using a software, for example a Java applet or a Web browser
Web robot that is being enhanced to extract embedded plug-in, running on the end users local computer.
Dublin Core metadata.18 Similarly, the UK Electronic Often, however, the search client will be a CGI based
Libraries Programme (eLib) NewsAgent for Libraries gateway running on a Web server and accessed by the
project will obtain information content for the service end user as a Web based form.21,22
by the use of a Web robot that will look for embedded
Dublin Core and otherNewsAgentspecific The DESIRE European Web Index, following the
distributed model, is made available using several development of a limited number of top level net-
GILS compliant Z39.50 servers, one per country. working navigation tools in the UK to encourage the
Users indicate which of the servers they would like growth of local subject based tools and information
their search sent to as part of specifying the search. servers.24 Once eLib was in place, it funded several
Results from multiple servers are merged before being Access to Network Resources (ANR) projects and
displayed to the user. services.25 These include:
In the ROADS project, distributed ROADS databases ADAM: Art, Design, Architecture & Media
are made available using the WHOIS++ protocol. Information Gateway;
Searches across several ROADS databases (both ro- Biz/ed: Business Education on the Internet;
bot-generated and manually constructed) are possible EEVL: Edinburgh Engineering Virtual Library;
with searches currently being sent to each server in IHR-Info: Institute of Historical Research;
parallel. Future versions of the ROADS software will OMNI: Organizing Medical Networked
support the Common Indexing Protocol, which allows Information;
servers to share knowledge about their databases, and RUDI: Resources for Urban Design Information;
thus route queries between different servers in a more SOSIG: Social Science Information Gateway.
efficient manner.23 It should be noted that the Com-
mon Indexing Protocol is not specific to WHOIS++ These projects are creating large amounts of metadata
and could be used, in theory, to route queries between for network resources in their specialist areas. These
multiple LDAP servers or multiple Z39.50 servers. subject services, sometimes called subject-based in-
formation gateways, are one solution to the problem of
resource discovery on the Internet. The services use
specialist staff to select Internet resources ensuring
PROJECTS AND SERVICES quality control, and these are then described using
USING METADATA human-created metadata. The subject service approach
to resource discovery is based to some extent on the
traditional library model. Resources are chosen ac-
There are a growing number of projects and services cording to defined selection criteria and they will then
currently using metadata for resource discovery in a be manually catalogued for inclusion in a database.
networked environment. The following section com- This process ensures that only good quality resources
prises a brief description of some of these projects. are made available through the service and that suffi-
cient metadata is available to enable the adequate
Projects funded by the Electronic searching and retrieval of these resources. The result-
Libraries Programme ing service often provides access both by searching
and by browsing, either by a list of subject terms or by
Access to Network Resources projects a particular subject-classification. Several of the eLib
subject services are based on the software tools devel-
The UK Electronic Libraries programme (eLib), a oped by the ROADS project.
series of projects, demonstrators and services funded
by the Joint Information Systems Committee (JISC) ROADS: Resource Organization and Discovery in
of the UK Higher Education Funding Councils, was Subject-based services
formed in 1995 in response to recommendations made
by the authors of the Report of the Joint Funding ROADS is an eLib project, also under the ANR strand,
Councils Libraries Review Group in December and is a collaboration between the Institute of Learn-
1993the Follett Report. Amongst other things, the ing and Research Technology (ILRT) at the University
Report recommended that JISC should fund the of Bristol, the UK Office for Library and Information
Networking (UKOLN) at the University of Bath and discovery tools. This work is intended to make the
the Department of Computer Studies at Loughbor- Harvest Web robot Dublin Core aware and will even-
ough University.26 Its aim is to develop and implement tually be made available with the public domain
a user-orientated resource discovery system enabling version of the Harvest software.29
users to find and access networked resources. In short,
ROADS is developing discovery software for a net- European Union funded projects
worked discovery framework primarily with regard to
the requirements of the eLib ANR services. DESIRE: Development of a European Service for
Information on Research and Education
ROADS is very much concerned with metadataits
creation, organization and also how it can be searched The DESIRE Project is an extremely large project
and presented to users. ROADS templates, the metadata funded by the EU Telematics for Research Sector of
format chosen for use by the ROADS project, are the Fourth Framework Programme.30 The project is
based on IAFA/WHOIS++ templatesa format origi- investigating Web technology and the implementa-
nally designed for anonymous FTP archives. They are tion of pilot information services on behalf of European
based on simple (text based and human readable) researchers and is divided into ten work packages. The
attribute/value pairs of variable length. One major one with the most relevance to metadata issues is work
advantage of using ROADS templates is the possibil- package 3 (WP3), Resource discovery and index-
ity of searching across multiple subject services using ing,31 which has the general aim of supporting research
the WHOIS++ protocol.27 users of the Internet to locate information relevant to
their research. The work package partners include all
The nature of the ROADS project has resulted in its of the ROADS project partners, together with NetLab
participation in wider discussions of metadata and (University of Lund, Sweden) and the National Li-
Internet resource discovery. For this reason, ROADS brary of the Netherlands. It has two main strands:
partners have been involved with the Dublin Core
initiative and with deployment of WHOIS++. There is Subject services (subject-based information
also a strong focus on the semantic interoperability of gateways). Building on the subject service
metadata formats: producing metadata mappings or approach to Internet subject services in
crosswalks, looking at potential interaction with the conjunction with work done at NetLab on
Z39.50 protocol; the development of template regis- engineering (EELSEngineering Electronic
tries, cataloguing rules, etc. Library, Sweden) and the National Library of
the Netherlands(NBWNederlandse
NewsAgent Basisclassificatie Web), WP3 has looked at
quality-controlled subject-based information
NewsAgent for Libraries is another eLib project, this gateways based on library-type selection and
time in the Electronic Journals programme area. 28 The cataloguing skills. A demonstrator is planned for
aim of the project is to create a user-configurable European social science information, together
electronic news and current awareness service for with further services for engineering and fine
library and information professionalsthe informa- art.
tion content being taken from selected UK library and Automated indexing of WWW information sources.
information science journals and briefing materials WP3s work on providing tools and methods for
from five organizations. The service will obtain infor- the automatic indexing of the WWW information
mation content from a Web robot designed to look for is an extension of work carried out at NetLab and
embedded Dublin Core and otherNewsAgent spe- the National Technological Library of Denmark
cificmetadata. As part of the project, UKOLN have (DTV) on the Nordic Web Index (NWI). A
developed a replacement for the HTML summarizer European Web Index (EWI) will be developed as
that is available as part of the Harvest suite of resource part of WP3 to provide a harvesting and indexing
service for the academic sector in Europe and to one involving the development and installation of the
establish a single uniform service with the aim of demonstration system at the sites of the project part-
indexing all European Internet documents ners and participating publishers. The first phase,
relevant to the academic area. however, consists of a series of seven work packages
investigating background issues for BIBLINK. Work
Several reports have been produced as part of the package 1, for example, made recommendations re-
project. NetLab have produced a state-of-the-art re- garding what particular formats should be accepted
view of indexing and data collection methods used in from publishers, deciding to look at SGML DTDs like
robot-based Internet search services32 and a functional Simplified SGML for Serial Headers (SSSH) for
specification for a European Web Index.33 WP3 has complex records and the use of Dublin Core as a
also resulted in a three-part report on a Specification minimum element set for data exchange.36 Work Pack-
for resource description methods which included a age 2 reviewed the important area of unique identifiers
survey of current metadata formats, a study of quality for electronic publications, including the Uniform
selection criteria for Internet subject services, and an Resource Name (URN), the Serial Item and Contribu-
evaluation of the use of subject classification schemes tion Identifier (SICI) and the Digital Object Identifier
for providing access to Internet resources.34 (DOI).37 Other work packages have looked at the
transmission of data between libraries and publishers,
BIBLINK: Linking Publishers and National Bib- conversion processes to investigate interoperability
liographic Services between publishers metadata and MARC formats,
and the important area of authentication.
The BIBLINK project is funded by the Telematics
Applications Programme of the European Commis-
Other metadata-related projects and
sion and aims to create an electronic link between
services
publishers of electronic material and national biblio-
graphic agencies.35 The project is led by the British
Library, and its partners include the national libraries Nordic Metadata Project
of France, The Netherlands, Norway and Spain, the
Universitat Oberta de Catalunya in Barcelona, and The Nordic Metadata Project is funded by
UKOLN. The intention of the project is that the NORDINFO, the Nordic Council for Scientific Infor-
bibliographic experts of the national libraries of Eu- mation, and has six participating organizations.38 The
rope, with cooperation of partners in the book industry, Nordic countries are used to sharing information about
will be able to examine what type of descriptive printed materials, but there is an awareness that shar-
metadata would be required for catalogues of elec- ing information about electronic documents has been
tronic publications and to investigate the possibility of complicated by the inadequacy of current resource
establishing electronic links for the transfer of this discovery mechanisms. The project is using Dublin
metadata from publishers to national bibliographic Core, and amongst other things, is investigating the
agencies. BIBLINK intends to produce an interactive following:
demonstration system which would enable selected
electronic publishers to transmit metadata to national The production of conversion tables and programs
bibliographic agencies, where this data would then be to convert Dublin Core to Nordic MARC formats.
enriched and converted to specific MARC formats An experimental converter can currently produce
(primarily UNIMARC and UKMARC) for use by NORMARC, FINMARC and USMARC records.
national libraries. The level of data required is the Other Nordic formats will be added to the
minimum amount sufficient to support traditional converter, together with a MARC to DC converter,
Cataloguing in Publication (CIP) type functions. if required. It is intended that the software should
also be able to be easily adapted to convert DC to
There are two distinct phases in BIBLINK, the second non-Nordic MARC formats.
The production of tools for the creation of Dublin distributed catalogue.40 AHDS, in conjunction with
Core metadata to encourage an improvement in the UKOLN, initiated Resource Discovery Workshops in
quality and quantity of metadata that is made early 1997 so that specific requirements in all relevant
available. A Nordic Metadata DC production disciplines could be integrated into a system giving
template was published at the start of 1997 and has access to a distributed, interdisciplinary and mixed-
since been modified to conform with the changes media collection of digital resources.41 It is recognized
to HTML syntax agreed at the DC 4 Workshop in that each service provider may have its own preferred
Canberra. formats for storing metadata; for example, the Oxford
Working with the DESIRE project to make the Text Archive will be using TEI headers. The AHDS is
Nordic Web Index robot metadata aware so that it looking at a solution where a core set of metadata,
can recognize and extract embedded Dublin Core. based on Dublin Core, could be used to provide top-
level access to the distributed AHDS resource, while
The range of activities being carried out by the Nordic individual service providers maintain their own spe-
Metadata Projectmetadata creation, harvesting and cific metadata for their own collections. It is possible
interoperabilitywill be of great interest to others that the subject-specific metadata created by service
who are considering the implementation of metadata- providers could be used to generate automatically
based systems. (through metadata mappings/crosswalks) a subset of
core metadata which could then be used in a top-
Arts and Humanities Data Service level catalogue.
The Arts and Humanities Data Service (AHDS) is The MathN Broker
funded by JISC for the collection, description and
preservation of the electronic resources that result A service currently using metadata is the MathN
from and are used by research and teaching in the Brokera mathematical pre-print service based at the
humanities.39 It consists of an executive based at University of Osnabrück, Germany.42 The service
Kings College London, and five service providers, grew out of a Fachinformation project run by the
located throughout the UK: DMV, the German Mathematical Society. The service
gives electronic access to PostScript versions of pre-
Archaeology Data Service (A consortium, led by prints stored on about 40 departmental Web servers in
the University of York); Germany.43 The Harvest software is used for indexing,
History Data Service (The Data Archive, but this has limitations when used with PostScript. For
University of Essex); this reason, the pre-print service indexes metadata
Oxford Text Archive (Oxford University which was originally stored in what Roland Schwänzl
Computing Services); has described as a preliminary Warwick Container
Performing Arts Data Service (Glasgow for HTML coded MetaData, using a format known as
University); the MathDMV-Preprint Core.44 Since the beginning
Visual Arts Data Service (Surrey Institute of Art of 1997 the service has used Dublin Core elements
and Design). embedded in HTML META tags. The metadata can
include subject classifications from the Mathematics
AHDS will provide a unified catalogue giving access Subject Classification (MSC), the Physics and As-
to its service providers holdings and possibly to other tronomy Classification Scheme (PACS) and the ACM
scholarly collections. For this reason, the AHDS has Computing Classification System (CCS), together
examined the needs of arts and humanities scholars with subject keywords and abstracts. The metadata is
with regard to information discovery and resource provided by authors using a Web page with a FORMS
description with the intention of identifying shared interface called the Mathematics Metadata Markup
metadata requirements which could be used in a editor (MMM).
Protocols
Z39.50
old URL are likely to get a failure indicating that to scientific publishers. For example: S0165-
it is no longer available. There are significant 3806(96)00403-8.
political and commercial interests which act as
barriers to establishing services which will resolve URN (Uniform Resource Name)
identifiers to URLs.
Uniform Resource Names (URNs) are intended to
serve as persistent, globally unique resource identifi-
ISSN (International Standard Serial Number)
ers that fit into the larger Internet information
architecture composed of, additionally, Uniform Re-
The ISSN is a standardized international numeric code
source Characteristics (URCs) and Uniform Resource
which enables the identification of serial publications,
Locators (URLs). URNs are for identification, URCs
for example periodicals, newspapers, annuals or se-
for including metadata and URLs for locating re-
ries. Serials can be in printed form, on other medium
sources. URNs are designed to make it easy to map
(microform, floppy disk, CD-ROM or CD-i), or can be
other identification schemes into URN-space. The
accessible online. An ISSN is normally represented as
exact format of URNs is still under discussion but it is
the string ISSN followed by two sets of four digits:
likely that, for example, an ISBN may be represented
for example, ISSN 0374-0536.
as a URN as follows: urn:isbn:0-395-36341-1.
ISBN (International Standard Book Number)
DOI (Digital Object Identifier)
The ISBN system is an international standard number- The Digital Object Identifier (DOI) system is being
ing system for monographs. It has traditionally been developed on behalf of the Association of American
used for books, but has been expanded to include other Publishers (AAP). The DOI system is based around a
new media such as videocassettes and electronic me- directory, which stores an objects DOI and its associ-
dia. An ISBN is normally represented as the string ated location (URL). Queries sent to the directory
ISBN followed by ten digits separated into four result in the DOI being looked up and the location
parts: for example, ISBN 82-7111-124-8. returned to the client. In Web terminology, this is a
standard Hypertext Transfer Protocol (HTTP) redi-
SICI (Serial Item and Contribution Identifier) rect. A DOI has two parts, a globally unique part called
the Publisher ID and a publisher assigned part called
The SICI is a variable length code that uniquely the Item ID. For example: 10.153/34571.
identifies serial issues (items) and articles within a
serial (contributions). The SICI is a complex identifier PURL (Persistent Uniform Resource Locator)
split into three parts: the item segment (based on the
ISSN of the serial); the contribution segment (which PURLs have been developed and deployed by OCLC
identifies an article or other contribution within the as a naming and resolution service for general Internet
serial); and the control segment. For example: 0730- resources. Functionally, a PURL is an URL. However,
9295(199206)11:2<168:CRFAOC>2.0.TX;2-#. instead of pointing directly to the location of an
Internet resource, a PURL points to an intermediate
PII (Publisher Item Identifier) resolution service. The PURL Resolution Service
associates the PURL with the actual URL and returns
Elsevier Science developed the PII to identify journal that URL to the client. The client can then complete
articles independently from their packaging unit, be- the URL transaction in the normal fashion. As with the
cause they may be published in different ways DOI this is achieved using an HTTP redirect. For
(database, CD-ROM, paper, World Wide Web, etc.). example: http://purl.oclc.org/OCLC/PURL/
It is primarily intended for document items of interest INET96.
Dublin Core Dublin Core Metadata Element ROADS Resource Organization and
Set. A metadata format defined on Discovery in Subject-based
the basis of international services. eLib funded
consensus which has defined a project developing software for
minimal information resource use by Internet subject services.
description, generally for use in a
Web environment. SGML Standard Generalized Markup
Language. An international
EAD Encoding Archival Description. standard (ISO 8879) for the
An SGML-based metadata format description of marked-up
developed for the description of electronic text.
archives.
SOIF Summary Object Interchange
GILS Government Information Locator Format. A metadata format
Service. Metadata format created developed for use with the
by the US Federal Government in Harvest architecture.
order to provide a means of
locating information generated by SSI Server Side Includes. A
government agencies. mechanism for dynamically
generating parts of Web pages.
Granularity The level of detail at which
indexing takes place. TEI Text Encoding Initiative. An
attempt to define, using SGML,
Harvest A system providing a set of the encoding of literary and
software tools for the gathering, linguistic texts in electronic
indexing and accessing of Internet form. TEI headers are an SGML-
information. Uses SOIF. based metadata format used for
the documentation of these texts.
IAFA Internet Anonymous FTP Archive
templates templates. Metadata format Warwick An architecture for the exchange
designed for anonymous FTP Framework of distinct metadata packages
archives, now adapted for use in involving the aggregation of meta-
ROADS project. data packages into containers.
23. Allen, J. and Mealling, M., The Architecture of the Common DESIRE, February-May 1997. Available from: <URL:http:/
Indexing Protocol (CIP). Internet-Draft, 9 June 1997. /www.ukoln.ac.uk/metadata/DESIRE/specification.html>
Available from: <URL: ftp://ds.internic.net/internet-drafts/
draft-ietf-find-cip-arch-00.txt> 35. BIBLINK. Available from: <URL:http://www.ukoln.ac.uk/
metadata/BIBLINK/>
24. Joint Funding Councils Libraries Review Group, Report
[The Follett Report]. Bristol: Higher Education Funding 36. Heery, R., Metadata formats. Work Package 1 of Telematics
Council for England, December 1993, Section 265. for Libraries project BIBLINK (LB 4034), November 1996.
Available from: <URL:http://www.ukoln.ac.uk/metadata/
25. Electronic Libraries Programme, Project details. Available
BIBLINK/wp1/d1.1/>
from:<URL:http://www.ukoln.ac.uk/services/elib/projects/>
37. Høgås, H., van der Werf, T. and Powell, A., Identification.
26. ROADS. Available from: <URL:http://www.ukoln.ac.uk/
Work Package 2 of Telematics for Libraries project BIBLINK
roads/>
(LB 4034), May 1997. Available from: <URL:http://
27. Knight, J. and Hamilton, M., Overview of the ROADS www.ukoln.ac.uk/metadata/BIBLINK/wp2/d2.1/>
software. LUT CS-TR 1010. Loughborough: Loughborough
38. Nordic Metadata Project. Available from: <URL:http://
University of Technology, March 1996. Available from:
linnea.helsinki.fi/meta/>
<URL:http://www.roads.lut.ac.uk/Reports/arch/arch.html>
39. Arts and Humanities Data Service. Available from:
28. NewsAgent for Libraries. Available from: <URL:http://
<URL:http://ahds.ac.uk/>
www.sbu.ac.uk/~litc/newsagent/>
40. Dempsey, L., and Greenstein, D., Proposal to identify shared
29. Harvest Web Indexing. Available from: <URL:http://
metadata requirements. 15 January 1997. Available from:
www.tardis.ed.ac.uk/harvest/>
<URL:http://www.kcl.ac.uk/projects/ahds/jobs/
30. DESIRE. Available from: <URL:http://www.nic.surfnet.nl/ proposal.html>
surfnet/projects/desire/desire.html>
41. Miller,P., Resource Discovery Workshops: a guide to
31. DESIRE WP3 Resource Discovery and Indexing. Available implementation and participation. 23 May 1997. Available
from: <URL:http://www.ub2.lu.se/desire/> from: <URL:http://www.york.ac.uk/~apm9/focus01.html>
32. Koch, T., Ardö, A., Brümmer, A. and Lundberg, S., The 42. MathN Broker. Available from: <URL:http://www.
building and maintenance of robot based Internet search mathematik.uni-osnabrueck.de/harvest/brokers/MathN/>
services: a review of current indexing and data collection
43. Plümer, J. and Schwänzl, R., A mathematics preprint index:
methods. Draft D3.11 (version 3) for Work Package 3 of
DC in an application. [Paper for: 4th Dublin Core Metadata
Telematics for Research project DESIRE, September 1996.
Workshop, Canberra, 3-5 March 1997]. Available from:
Available from: <URL:http://www.ub2.lu.se/desire/radar/
<URL:http://www.dstc.edu.au/DC4/roland/>
reports/D3.11/>
44. Scheme Definition: DMV MetaData for Mathematical
33. Lundberg, S., Ardö, A., Brümmer, A. and Koch, T., The
Papers, Version 1.2. Available from: <URL:http://
European Web Index: an Internet search service for the
www.mathematik.uni-osnabrueck.de/ak-technik/
European higher education, research and development
DMVPreprint-Core.html>
communities. Deliverable 3.1 for Work Package 3 of Telematics
for Research project DESIRE, 1996. Available from: 45. Dempsey, L., Distributed library and information systems: the
<URL:http://www.nic.surfnet.nl/surfnet/projects/desire/ significance of Z39.50. Managing Information, 1(6), June
deliver/WP3/D3-1.html> 1994, 41-43.
34. Specification for resource description methods. Deliverable 46. Turner, F., An overview of the Z39.50 Information Retrieval
for Work Package 3 of Telematics for Research project standard. IFLA Universal Dataflow and
Telecommunications Core Programme, Occasional Paper, 3, August 1996. Available from: <URL:http://www.dlib.org/dlib/
July 1995, rev. January 1997. Available from: <URL:http:// july96/07dempsey.html>
www.nlc-bnc.ca/ifla/VI/5/op/udtop3.htm>
Dempsey, L., and Heery, R., Metadata: a current view of practice
and issues. Journal of Documentation (forthcoming).
47. Raggett, D., Le Hors, A. and Jacobs, I.(eds.), HTML 4.0
Specification. W3C Working Draft. 18 July 1997. Dempsey, L. and Weibel, S., The Warwick Metadata Workshop:
Available from: <URL:http://www.w3.org/TR/WD-html40- a framework for the deployment of resource description. D-Lib
970708/> Magazine, July/August 1996. Available from: <URL:http://
www.dlib.org/dlib/july96/07weibel.html>
48. XML White Paper. Microsoft Corporation, June 23,1997.
Available from: <http://www.microsoft.com/standards/xml/ Heery, R., Review of metadata formats. Program, 30(4), Octo-
xmlwhite.htm> ber 1996, 345-373.
49. Resnick, P. and Miller, J., PICS: Internet access controls Lynch, C., Searching the Internet. Scientific American, 276(3),
without censorship. Communications of the ACM, 39 (10), March 1997, 44-48. Also available from: <URL:http://
October 1996, 87-93 www.sciam.com/0397issue/0397lynch.html>
FURTHER READING Weibel, S., The World Wide Web and emerging Internet resource
discovery standards for scholarly literature. Library Trends,
43(4), Spring 1995, 627-644.
Metadata is one of those subjects that has a rapidly growing
literature and is also an area which has regular changes of focus Weibel, S., Metadata: The Foundations of Resource Descrip-
and emphasis. As can be seen by the references in this Briefing, a tion. D-Lib Magazine, July 1995. Available from: <URL:http://
large amount of information on metadata topics is available on the www.dlib.org/dlib/July95/07weibel.html>
Internet and specifically through the World Wide Web. For these
reasons it may be useful to note the following Web sites devoted
to keeping up-to-date with the subject:
Dempsey, L., ROADS to Desire: some UK and other European The authors would like to thank Lorcan Dempsey for commenting
metadata and resource discovery projects. D-Lib Magazine, July/ on a draft version of this Briefing.