Introduction To XML and Its Applications
Introduction To XML and Its Applications
Laura Papaleo
Department of Informatics and Computer Science, University of Genova
Via Dodecaneso, 35 16100 Genova, Italy
E-mail: [email protected] – [email protected]
1. Introduction
XML is hugely important. It has been defined as “the holy grail of computing,
solving the problem of universal data interchange between dissimilar systems”
(Dr. Charles Goldfarb). XML is basically a handy format for everything from
configuration files to data and documents of almost any type. The first version of
XML became a W3C Recommendation in 1998, while its fifth edition has been
declared recommendation last year, in 2008 [1].
XML is significant, but it is a hard subject to describe briefly in a chapter,
because it describes a whole family of technologies and specifications. In 10
years, its success has been incredible and it has represented the foundation of a
multitude of applications.
This chapter has the goal to present the XML meta-language, trying to give an
overview of the most significant parts. We will describe the syntax to create
XML documents and how we can structure them by defining specific grammars
(DTDs and XML Schemas). We will also show how to render XML documents
using CSS style sheets and how to transform and render them with a family of
XML-based languages (XSL, XSLT and XPath). The end of the chapter will be
dedicated to provide a snapshot of the “life” around XML, to let the reader
understand the immense impact of XML in the actual technological world.
1
2 L. Papaleo
standard since 1986, ISO 8879:1986. SGML has been defined as a powerful and
extensible markup language with the main goal to semantic markup any type of
content. This functionality is particularly useful for cataloging and indexing data
[7]. SGML can be used to create an infinite number of markup languages and has
a host of other resources as well.
Historically it has been used by experts and scientific communities. However,
SGML is really complex and expensive: adding SGML capability to an
application could double its price. Thus, from the Web point of view, the
commercial browsers decided not to support SGML.
Both SGML and XML are widely-used for the definition of device-
independent, system-independent methods of storing and processing texts in
electronic form. Comparing the two languages, basically, XML is a
simplification or derivation of SGML, developed thinking at the emerging Web
technologies [7].
Compared with HTML, XML has some important characteristics. First of all
XML is extensible so it does not contain a fixed set of tags. Additionally, XML
documents must be well-formed according to a strict set of rules, and may be
formally validated (using DTDs or XML Schemas), while HTML documents can
contain errors and still the browsers render the pages as well as possible. Also,
XML focuses on the meaning of data, not its presentation.
It is important to understand that XML is not a replacement for HTML. In
most web applications, XML is used to transport data, while HTML is used to
format and display the data for the Web. Additionally, thanks to XML, HTML
evolved into XHTML [9] which is basically a reformulation of HTML (version
4) in XML 1.0.
completely the planned goals. In 2008, the fifth edition of XML has been
approved as W3C recommendation and the working groups on XML are still
active.
Fig. 1 – an example of XML document (left) and the associate tree structure (right)
The first line of the prolog is the declaration (see Fig. 2) and it serves to let
the machine understanding that what follows is XML, plus additional
information such as the encoding. Other components that can be inserted in the
prolog of an XML document as, for example, the associated schemas (either
DTDs or XML Schema – see Section 6 and Section 7) or the attached style sheets
(in CSS or XSL – see Section 8 and 9).
6 L. Papaleo
Fig. 2 – an example of XML document. The prolog and the body are outlined as the root element.
The image also show the syntax for opening and closing a tag and the syntax for an attribute.
XML document body is made up of elements (see Fig. 2). Each element is
defined using two basic components data and markup. Data represents the actual
content, thus the information to be structured. Markup, instead, are meta-
information about data that describes it. What follows is an example for an
element, structuring the information regarding a message body:
For a reader who is familiar with HTML, the XML elements will be easy to
understand, since the syntax is very similar. The markup are tags in the form
<tagName>…</tagName>. Elements can also contain other elements and can
have attributes. For the attributes, again, the syntax is very simple. Each attribute
can be specified only in the element start tag and it has a name and a value and
the value is enclosed strictly in double quotation-mark. Fig. 2 shows an example
of XML document outlining the prolog, the body, the elements and the attributes.
Empty elements do not have the closing tag </tagName>, instead, they have
a “/” at the end. Code (4.1) represents an empty tag with attributes.
<message>
From Robert (4.2)
<to>Mario</to>
</message>
In the years, different discussions are arisen within scientific and technical
communities on when and why to encode information into attributes or as content
in elements. There is not a specific rule and the choice depends on the designer.
However, XML attributes are normally used to describe elements, or to provide
additional information about elements. So, basically, metadata (data about data)
should be stored as attributes, and that data itself should be stored as elements.
When necessary, comments can be inserted into an XML document. Their
syntax is the same as for comments in HTML and it is the following:
Once defined, an entity can be recalled in the content of the document using
the syntax “&myname;” and the following is a piece of XML code showing how
to use the entity myname.
8 L. Papaleo
Unlike HTML, which allows to create documents with errors in the structure
which will be still rendered in a browser, XML has strict rules and a XML
document must be correctly structured in order to be machine-understandable.
The XML specification prohibits XML parsers from trying to fix and understand
malformed documents. All a conforming parser is allowed to do is report the
error.
Thus, a XML document must be well-formed. According to W3C, a well-
• contains a unique opening and closing tag that enclose the whole document,
• has all the elements with the closing tag, or empty elements correctly written
called the root element
• has all the tags and attributes names written accordingly to the case-sensitive
rule, that is, for example that the tag <name> cannot be closed with
</Name>. In other words, elements and attribute names may be any case
• has all the elements properly nested, i.e. there must be an opening and a
chosen, as long as they are consistent.
closing tag and the tags cannot overlap. For example if the tag <name> has
These are the most important constraints for the well formdness, but they are
far to be a complete list: the XML Specifications [1] provides all the necessary
details.
Well formed XML documents simply markup content with descriptive tags.
This means that there is not the necessity to describe or explain what the chosen
tags mean. We will see in Section 6 and Section 7 how DTDs and XML Schemas
can define the meaning of the tags and can force the structure.
Introduction to XML and its Applications 9
5. Namespaces
In XML, element names are defined by developers. This means that different
organizations can use the same tag to markup content with different semantics.
But XML has been invented also to allow interoperability and data exchange
among different organizations so there must exist a way to combine several XML
sources without ambiguity.
XML namespaces are used for providing uniquely named elements and
attributes in an XML instance [13]. They are defined by a W3C recommendation
called Namespaces in XML. As defined by the W3C, an XML namespace is a
collection of XML elements and attributes identified by an Internationalized
Resource Identifier (IRI); this collection is often referred to as an XML
vocabulary [14].
Using namespaces, name conflicts can be solved thus allowing the correct
integration among data. This means that, if each vocabulary has given a
namespace then the ambiguity between identically named elements or attributes
can be resolved.
Namespaces are declared as an attribute of an element by using the xmlns
name attribute in the start tag of the element. It is not mandatory to declare
namespaces only at the root element; rather it could be declared at any element in
the XML document. A namespace has a scope which begins at the element where
it has been declared and applies to the entire content of that element, unless
overridden by another namespace declaration with the same prefix name [15]. A
namespace is declared as follows, that can be read as binding the prefix
"myname" with the namespace http://www.whatever.com:
collects names of properties (e.g. FOAF [18]) or can describe a set of functions,
as it is the case for the XPath 2.0 Data Model [19].
Element declarations describe the allowable set of elements within the document,
and specify whether and how declared elements (and character data) may be
contained within each element.
Recall that (Section 4) elements in XML documents can enclose other
elements, can be empty, can contain content or can be mixed (containing content
and other elements). In a DTD the possible declarations for elements are the
•
a mixed element
<!ELEMENT element-name ANY>, for defining an element for which no
further details are provided
Term Meaning
, Separates members of a sequence list and indicates
sequential use of all members
| Separates members of a choice list and requires use of one
and only one member
+ Indicates a required and repeatable occurrence
* Indicates an optional and repeatable occurrence
? Indicates an optional occurrence
Table 1 – A table indicating the special character to formalize repetitions and order when defining
elements in a DTD
where we define that the element book can contain only other elements (not
directly content) and, specifically a title (title), then one or more authors
(author+) and successively one or more chapters (chapter+).
In the DTD, XML element attributes are declared with an ATTLIST declaration.
Attribute-list declarations name the allowable set of attributes for each declared
element, including the type of each attribute value, if not an explicit set of valid
value(s) [22]. An attribute declaration has the following syntax:
12 L. Papaleo
There are the following attribute types: CDATA (Character set of data), ID,
IDREF and IDREFS, NMTOKEN and NMTOKENS, ENTITY and ENTITIES,
NOTATION and NOTATIONS, listings and NOTATION-listings. These data types
are listed in Table 2
Value Explanation
CDATA The value is character data
(eval|eval|..) The value must be an enumerated value
ID The value is an unique id
IDREF The value is the id of another element
IDREFS The value is a list of other ids
NMTOKEN The value is a valid XML name
NMTOKENS The value is a list of valid XML names
ENTITY The value is an entity
ENTITIES The value is a list of entities
NOTATION The value is a name of a notation
xml: The value is predefined
Table 2 – A table showing the possible type of an attribute in a DTD declaration.
The document type declaration, which is situated after the XML declaration, is a
mechanism for naming the document type to which a document complies and for
including its definition. Valid XML documents must declare the document type
the follow so that editors, browsers or converters can read the DTD to understand
the template structure.
Well-formed documents can also include a document type declaration and
include markup declarations in its external subset but are not required to do so.
The document type declaration names the document type by making reference to
the root element of the document. It can make reference to an external DTD,
Introduction to XML and its Applications 13
called the external DTD subset, include the DTD internally in the internal DTD
subset or use both. Document type declarations take the general form [22]:
An XML document which must be compliant with respect to a DTD has the
attribute standalone in the XML declaration set to yes. This means that the
very first line of a document which follows a specific DTD will be the following:
DTD
<!ELEMENT message (from,to+,body) >
<!ELEMENT from #PCDATA >
<!ELEMENT to #PCDATA >
<!ELEMENT body #PCDATA >
<!ATTLIST message reply (yes|no) "no" >
• XML Schema Part 1: Structures [24] and XML Schema Part 2: Datatypes
the XML Schema facilities and of the language
[25] which provide the complete normative description of the XML Schema
language.
To describe all the characteristics of the XML Schema language is out of the
scope of this chapter, since a book by itself would be necessary. Here, we will
outline the properties of the language providing explicative examples. The reader
can refer to the online specifications [26] or to the available books on this topics
[6,22] to have more details.
An XML schema (called also XSD) is an XML document. It starts with the
document declaration and continues by opening the root element <schema> and
by defining the specific namespace. Within this root element all the
specifications are defined. The schema ends closing the root element
</schema>, as any well-formed XML document. Thus, in an XSD file (a simple
text file with extension “.xsd”), the skeleton is the following:
<?xml version=”1.0”?>
<xsd:schema xmlns:xsd=http://www.w3.org/2001/XMLSchema>
[… body of the schema …]
</xsd:schema>
Introduction to XML and its Applications 15
The body of the schema contains element declarations. There exist four main
<xsd:element name="book">
<xsd:complexType>
[… complex type definition …]
</xsd:complexType>
</xsd:element>
If necessary, XSD documents allow to derive new simple types from existing
types, by using the xsd:simpleType element. It basically defines a subtype.
The name attribute assigns a name to the new type, by which it can be referred to
in a xsd:element type attributes. Different type of elements can be used to
<xsd:simpleType name="animal">
<xsd:restriction base="xsd:string">
<xsd:enumeration value="dog"/>
<xsd:enumeration value="cat"/>
</xsd:restriction>
</xsd:simpleType>
<xsd:element name=”webby” type="animal" />
XML Schema
<?xml version=”1.0”?>
<xsd:schema
xmlns:xsd=http://www.w3.org/2001/XMLSchema>
<xsd:element name="message" type="messageType"/>
<xsd:complexType name="messageType">
<xsd:sequence>
<xsd:element name="from" type="xsd:string"
minOccurs="1" maxOccurs="1"/>
<xsd:element name="to" type="xsd:string"
minOccurs="1" maxOccurs="unbounded"/>
<xsd:element name="author" type="xsd:string" />
<xsd:element name="body" type="xsd:string"
minOccurs="1" maxOccurs="1"/>
</xsd:sequence>
<xsd:attribute
name="reply" type="xsd:boolean" default="no"/>
</xsd:complexType>
</xsd:element>
</xsd:schema>
that must be applied to elements in the reference file. A rule consists of two parts:
a selector and a declaration, with the syntax:
The selector is the reference to the element in the file that must be rendered
and the declaration is that part of the rule that sets forth what the effect will be.
The following is an example of rule applied to an element H1 in a HTML page.
H1 {color:black;}
CSS2 (Cascading Stylesheet Level 2) defines around 120 properties and for
all of them, different values can be assigned. For a tutorial on CSS/CSS2 see, for
example [28] or the last specification at [29].
CSS/CSS2 style sheet can be easily added to HTML documents using the
link element which create a link to the external style sheet. In XML it is
possible to attach external style sheets by means of the xml-stylesheet
processing instruction, which must be placed in the prolog of the XML
document. The syntax is the following:
Just as with the link element of HTML, there can be multiple xml-
stylesheet processing instructions, meaning that is it possible to attach
multiple style sheets to an XML document. The possible attributes are type,
medium and title, so that each stylesheet can have a local name (title), can be
applied if the display medium is of a given type (print, screen..) and it has a
specific type (usually text/css).
To show how attach a CSS style sheet to an XML document, we provide here
a very simple example. Given the following XML code, it will be rendered in a
browser as showed in Fig. 4-(a).
(a) (b)
Fig. 4 – (a) an example of an XML document and how it is rendered in a browser. No specific
formatting is applied. (b) the same XML document with a CSS style sheet applied.
By adding the processing instruction for including a CSS style sheet (named
style1.css), the resulting XML code will be
The CSS file contains the following simple rules, one for the element
exercise (thus it applies to the element and all its children), one for the element
title and another for the element body. The results of applying style1.css to
the XML document is showed in Fig. 4-(b).
exercise { font-family:Arial }
title { display:block; color:red;
font-size:14pt;
font-weight:bold }
body { color:black; font-size:12px }
Note that, even if the tags are no more visible in the browser window, the
entire XML document is freely readable looking at the code of the page. Also,
the way in which the information are presented follows strictly the order in which
they have been modeled in the XML document (as in the case of simple HTML
pages). Suppose, for example, that the initial XML document was related to a list
of books for a library, it is impossible to show them in an alphabetic order, if the
20 L. Papaleo
books have not been inserted in that order. Additionally, if part of the content has
been modeled inside attributes, there is no way to access to the attributes values
and to show them in the rendered page. These are some of the limitations of CSS
(CSS2) in the context of XML documents. An immediate observation is that
XML is not a replacement of HTML, thus, for creating web pages, HTML (or –
better- XHTML) is more than enough. XML exists for structuring data and
means for modifying, transforming and interrogating this data are necessary.
In the following section we will show how XSL and XSLT support this type
of functionalities.
• the XML Path Language (XPath), [19] which is an expression language used
transforming XML documents into other XML documents (even XHTML)
9.1 XSL Transformations (XSLT) and the XML Path Language (XPath)
defines the rules for transforming an XML document and the chosen XSLT
processor does the work and produces the output.
XSLT relies on a technology called XPath. The XPath language allows XSLT
identify nodes (elements, attributes, and other objects) in XML documents, as
well as it provides functions for performing calculations [33].
To understand how XSLT works, we start from a XML document and we
apply a XSLT template to transform the content of this document in a HTML
page. The input XML document is the following:
It is a very simple document, with only the root element TitleBook which
contains directly the content, with no other sub-elements. The objective of the
XSLT transformation we are going to produce, is to take the content in the
element TitleBook and to put it inside an H1 tag of a HTML page.
Recall that an XSLT transformation file is, first of all, an XML document,
thus it follows the same syntax of any other XML. Also, in order to “use” the
XSLT language we have to define the appropriate namespace. The skeleton of
transformation file will be:
The first line is the XML declaration, the second defines the root element
<xsl:stylesheet> and the XSLT namespace (prefix xsl:) with the official
W3C URI http://www.w3.org/1999/XSL/Transform. The third line,
instead, provides a directive on the output method, namely HTML. As any well-
formed XML, the style sheet ends with the root element closing tag, in this case,
</xsl:stylesheet>.
The other necessary directives are few and simple. First of all we need to
intercept the root element and we need to apply a stylesheet template on it, taking
the content associate and rewriting it as content of the HTML tag H1.
For doing this, we use the following code:
22 L. Papaleo
Once the root element (node in the source tree) has been selected, we
“extract” the content using another XSLT directive <xsl:value-of…>. It has
an attribute select which contains another XPath expression. In the case of the
example, we extract the content inside the element TitleBook. Finally, by
putting the necessary HTML tags and the h1 tag “around” the extracted content,
we create the web page.
Taking a XSLT processor (even the modern browsers support this feature)
and giving in input both the XML and the XSLT documents, it interprets the
Introduction to XML and its Applications 23
XSLT directive and creates the new HTML page. Actually, XML, XSLT, and
XPath are correctly supported by the following browsers: Mozilla Firefox 3,
Internet Explorer 6+, Google Chrome, Opera 9 and Apple Safari 3+.
The operational schema of an XSLT transformation is the one presented in
Fig. 5.
Fig. 5 – one or more XML documents with one or more XSLT transformations are passed to the
XSLT processor which builds the output document.
<xsl:template match="chapter">
<xsl:for-each select="paragraph">
<xsl:value-of select=".">
</xsl:for-each>
</xsl:template>
Table 4 presents a set of elements for XSLT to provide an idea of the main
functionalities
24 L. Papaleo
<?xml version="1.0"?>
<fo:root xmlns:fo="http://www.w3.org/1999/XSL/Format">
<fo:layout-master-set>
<fo:simple-page-master master-name="A4">
<!-- Page template goes here -->
</fo:simple-page-master>
</fo:layout-master-set>
<fo:page-sequence master-reference="A4">
<!-- Page content goes here -->
</fo:page-sequence>
</fo:root>
•
page geometries and page selection patterns
fo:page-sequence which contains the definition of information for a
sequence of pages with common static information
The interpretation of the two main elements above it the following: when the
formatter reads the XSL-FO document, it creates a page based on the first
template in the fo:layout-master-set. Then it fills it with content from the
fo:page-sequence. When it's filled the first page, it instantiates a second page
based on a template, and fills it with content. The process continues until the
formatter runs out of content [22].
The page templates are called page masters. Each defines a general layout for
a page including its margins, the sizes of the header, footer, and body area of the
page, and so forth. XSL-FO 1.0 defines exactly one kind of page master, the
fo:simple-page-master, which represents a rectangular page. The
fo:layout-master-set contains one or more fo:simple-page-master
elements that define master pages.
For example, we present in the following portion of XSL-FO code a
fo:layout-master-set containing one fo:simple-page-master. It
contains a single region, the body, into which all content will be placed.
<fo:layout-master-set>
<fo:simple-page-master master-name="…"
page-height=".." page-width=".." […]>
<fo:region-body/>
</fo:simple-page-master>
</fo:layout-master-set>
(i) An optional fo:title element containing inline content that can be used as
the title of the document.
(ii) Zero or more fo:static-content elements containing the for every page
(iii) One fo:flow element containing data to be placed on each page in turn (in
case of pagination)
The following is an example of code for defining the pages sequence:
<fo:page-sequence master-reference="chaps">
<fo:static-content flow-name="…">
<fo:block text-align="outside" …>
Chapter
<fo:retrieve-marker
retrieve-class-name="chapNum"/>
<fo:leader leader-pattern="space" />
<fo:retrieve-marker retrieve-class-name="chap"/>
<fo:leader leader-pattern="space" />
Page
<fo:page-number font-style="normal" />
of
<fo:page-number-citation ref-id='end'/>
</fo:block>
</fo:static-content>
<fo:flow flow-name="…">
<fo:block>
<!-- Output goes here -->
</fo:block>
</fo:flow>
</fo:page-sequence>
In the example, the sequence of pages is defined for chapters in a book and
the portion of document gives directives for the rendering of the chapter
numbers, the page number and other information.
DTDs and XML Schemas. We reviewed also how XML documents can be
rendered, using CSS style sheets and how they can be transformed and rendered
using XSL/XSLT, which are a powerful XML-based languages for creating
directives to deal with XML documents.
It can be easily understood, by simply searching on the web the “XML” word,
which is the incredible impact of XML in the scientific and industrial scenarios.
It has been proved to be a powerful means to allow interoperability and to
improve communications among business entities, which has emerged to be a
real necessity thanks also the evolution of the Web. Looking at the W3C home
page, it is clear how many different technologies have been developed upon pr
around XML. Several working groups and activities have been defined and are
active in many different topics related to XML.
In the context of this book, the Semantic Web Activity is maybe the most
interesting [38]. The Semantic Web is a web of data, as the reader will discover
in the other chapters of this book. This activity includes different
• The Web Ontology Language OWL [39] is a semantic markup language for
represented in XSLT), for extracting this data from the document.
• SPARQL query language for RDF, which can be used to express queries
Framework) and is derived from the DAML+OIL Web Ontology Language.
• Simple Object Access Protocol (SOAP): A protocol that is object based and
management. [10].
•
write interactive multimedia presentations.
Scalable Vector Graphics (SVG), is a language for describing two-
•
dimensional graphics and graphical applications in XML
XML Query (XQuery), is a standardized language for combining documents,
•
databases, Web pages and almost anything else.
WSDL: Web Service Description Language. An XML format for describing
XML web services, including the type definitions, messages and actions used
by that service. The WSDL document should tell applications all they need to
know to invoke a particular web service
Acknowledgments
This work was supported by the University of Genova. I would like to thank all
the authors of existing books and online tutorials on XML who, being also on the
web, allow to spread the knowledge on this powerful technology.
11. References
1. Extensible Markup Language (XML) 1.0 Fifth Edition (2008). W3C Recommendation, Eds.
T. Bray, J.Paoli, C. M. Sperberg-McQueen, E.Maler, F.Yergeau, www.w3.org/TR/REC-xml/
2. XML, Wikipedia, the free encyclopedia (last access 2009), http://en.wikipedia.org/wiki/XML
3. St. Laurent, Simon. (1999). XML: A Primer. Foster City: M&T Books.
4. Bryan, Martin. (1998). "An Introduction to the Extensible Markup Language (XML)."
Bulletin of the American Society for Information Science 25, 1. 11-14.
5. Introduction to XML, W3Schools (last access 2009), www.w3schools.com/XML/
6. Møller Anders and Schwartzbach Michael I, An Introduction to XML and Web Technologies,
Addison-Wesley, ISBN: 0321269667 January 2006
7. A.Attipoe and P.Vijghen (1999). “XML/SGML: On the Web and Behind the Web."
InterChange: Newsletter of the International SGML/XML Users' Group Volume 5, Issue 3,
pages 25-29
8. Jon Bosak, (2003), The Birth of XML: A Personal Recollection, http://java.sun.com/xml/
9. XHTML™ 1.0 The Extensible HyperText Markup Language (Second Edition) A
Reformulation of HTML 4 in XML 1.0 (2002) W3C Recommendation www.w3.org/TR/html/
10. P.M. Rust and H.S. Rzepa (1999), Chemical Markup, XML, and the World Wide Web. Basic
Principles, J. Chem. Inf. Comput. Sci., 39, 928-942
11. SOAP Version 1.2 Part 0: Primer (Second Edition) (2007), W3C Recommendation
www.w3.org/TR/soap/
12. Resource Description Framework (RDF), (2004) W3C Recommendation, www.w3.org/RDF
13. XML namespace, Wikipedia, the free encyclopedia (last access 2009)
http://en.wikipedia.org/wiki/XML_namespace
Introduction to XML and its Applications 31
40. Gleaning Resource Descriptions from Dialects of Languages (GRDDL), (2007) W3C
Recommendation www.w3.org/TR/grddl/
41. SPARQL Query Language for RDF, (2008), W3C Recommendation www.w3.org/TR/rdf-
sparql-query/