WT Unit 2
WT Unit 2
Working with XML: Document type Definition, XML schemas, Document object model, XSLT, DOM
and SAX.
XML is the acronym of eXtensible Markup Language. The XML standard was developed by W3C.
The primary purpose of this standard was to provide a way to store self describing data easily and the
motivation for the development of XML was the deficiencies of HTML.
Problems with HTML – is that it was defined to describe the layout of information without
considering its meaning. To describe a particular kind of information, it would be necessary to have tags
indicating the meaning of the elements content.
XML is used to describe structured data or information. HTML documents describe how data
should appear on the browser’s screen. They carry no information about the data. XML documents on the
other hand, describe the meaning of data. XML documents may also refer to presentation information.
XML is used as a primary means to manipulate and transfer structured data over the web. The content
and structure of XML documents are accessed by a software module, called software processor. Like
HTML documents, data are marked up by tags in XML documents. HTML supports a predefined set of
tags whereas XML allows us to define new tags and use them in XML documents to satisfy application
requirement. As more new tags can be defined, XML is said to be extensible.
XML is
easy to understand;
non-proprietary plain-text:
o human readable,
o software independent,
o hardware independent;
(relatively) easy to write a parser for;
widespread: very well supported by both commercial and open source software.
Processing instructions start with <? And end with ?>. They contain special instructions to pass
parameters to the application. These parameters instruct the application about how to interpret the XML
document.
Eg: <?xml-stylesheet href=”simple.xsl” type=”text/xsl” ?>
This processing instruction states that the XML document should be transformed using the style sheet
simple.xsl.
Comments in XML strats with <!—and ends with --> like HTML. Everything within theses character
sequences will be ignored by the parsers and will not be parsed. Any character sequence, including
markup, is allowed inside a comment, except “--”
Document Type Defintion used to specify the logical structure of the XML document.
Body portion contains textual data marked up by tags. It must have one element called, document
element or root element. The root element must be top-level element in the document hierarchy and there
can be only one root element. The root element contains other elements which, in turn, contain other
elements and so on.
<?xml version=”1.0” encoding=”utf-8” ?>
<contact>
<person>
<name>B S Roy</name>
<number>9998765234</number>
</person>
<person>
<name>sairam</name>
<number>9998995212</number>
</person>
</contact>
Element names can contain letters, digits, and some other special characters, they cannot start with
a number or punctuation mark, they must not contain the string XML (in any case), they should not
contain white spaces.
Attributes are used for describing and providing more information about elements. They appear in
the starting tag of the element. An element can have multiple attributes.
Syntax: <element-name attr-name=”attr-value” …… > …………. </element>
Example: <employee gender=”male”> ……. </employee>
Predefined entities:
Some characters are reserved by the XML syntax itself. Hence, they cannot be used directly. To use them,
some replacement-entities are used, which are listed below:
not allowed replacement-entity character description
character
Well-formed XML
An XML document is said to be well-formed if it adheres to the following syntax rules.
1. All XML documents begin with an XML declaration.
2. XML comments must be enclosed in between <!-- an -->. Comments text cannot contain two
adjacent dashes.
3. An XML element name must begin with a letter or an underscore and can include digits, hyphens,
periods.
4. XML names are case-sensitive. There is no length limitation for XML names.
5. Every XML document defines a single root element, whose opening tag must appear on the first
line of XML code.
6. Every XML element must have a closing tag.
7. All tags must be properly nested.
8. Attributes must always be quoted.
Student.xml
<?xml version=”1.0” encoding=”utf-8”?>
<class>
<student regid=’501’>
<name>Vamsika</name>
<contactno>9885409528</contactno>
<email>[email protected]</email>
<address>
<street>Bank Colony</street>
<city>Bhimavaram</city>
<state>AndhraPradesh</state>
<zip>534201</zip>
</address>
</student>
<student regid=’502’>
<name>Srithan</name>
<contactno>9886756452</contactno>
<email>[email protected]</email>
<address>
<street>Kukatpally</street>
<city>Hyderabad</city>
<state>Telangana</state>
<zip>500021</zip>
</address>
</student>
</class>
This document effectively defines an XML tag set. This example shows that an XML-based markup
language can be defined without a DTD or XML schema, but the above XML document is an informal
definition with no structure rules.
XML documents
An XML document is said to be valid, if it is well-formed, comply with rules specified in DTD/schema.
2. External DTD: DTD is stored in an external file. To include external DTD, the syntax is
<!DOCTYPE rootname SYSTEM “external dtd file” >
Example : <!DOCTYPE class SYSTEM “student.dtd”>
Syntactically, a DTD is a sequence of declarations. Each declaration has a form of a markup declaration.
<!keyword ……. >
Declaring Elements
Each element declaration in a DTD specifies the structure of one category of elements. The
declaration provides name of the element along with specification of the structure of that element.
XML document is like a general tree. An element is a node in that tree. It can be either a leaf node
or an internal node.
If the element is a leaf node, its syntactic description is its character pattern.
If the element is an internal node, its syntactic description is a list of its child elements, each of
which can be a leaf node or an internal node.
The form of an internal node declaration i.e., an element declaration for elements that contain
element is
<!ELEMENT element-name (list of names of child elements)>
Example : <!ELEMENT memo(from, to, date, re, body)>
A modifier is added to the child element specification to specify the number of times that a child
element may appear. Child element specification modifiers are
o + - one or more occurrences
o - zero or more occurrences
o ? – zero or one occurrence.
Example : <!ELEMENT person(parent+, age, spouse?, sibling*)>
Leaf nodes of a DTD specify the data types of the content of their parent nodes. These data types
can be
o PCDATA – stands for parsable character data. PCDATA is a string of any printable
characters except less-than (<) and ampersand (&)
o CDATA – unparsed character data.
o EMPTY – used to specify that the element has no content.
o ANY – used to specify when the element may contain literally any content.
Example for leaf node : <!ELEMENT element-name (#PCDATA)>
The default value in an attribute declaration can specify either an actual value or a requirement for the
value of the attribute in the XML document. Possible default values for attribute are
A value – The quoted value, which is used if none is specified in an element.
#FIXED value – The quoted value, which every element will have and which cannot be changed.
#REQUIRED – No default value is given; every instance of the element must specify a value.
#IMPLIED – No default value is given; the value may or maynot be specified in an element.
Declaring Entities:
To reference entities in the XML document, they should be defined, then they will become general
entities. The entities which are referenced only in DTDs are called parameter entities.
The form of an entity declaration is <!ENTITY [%] entity_name “entity_value”>
Optional % sign specifies that the entity is a parameter entity rather than a general entity.
Example : If a document includes large number of references to the full name of ―Nara Chandra Babu
Naidu‖, then an entity can be defined to represent his complete name.
<!ENTITY ncbn “Nara Chandra Babu Naidu”>
Then the reference &ncbn; specifies complete name in the XML document.
A Sample DTD for the student
Student.dtd
<?xml version="1.0" encoding="utf-8"?>
<!ELEMENT class (student+)>
<!ELEMENT student(name, contactno, email, address)>
<!ELEMENT name (#PCDATA) #REQUIRED>
<!ELEMENT contactno (#PCDATA)>
<!ELEMENT email (#CDATA)>
<!ELEMENT address(street, city, state, zip)>
<!ELEMENT street (#PCDATA)>
<!ELEMENT city (#PCDATA)>
<!ELEMENT state (#PCDATA)>
<!ELEMENT zip (#PCDATA)>
Namespaces
It is often convenient to construct XML documents that include tagsets that are defined for and used
by other documents. For this purpose W3C has developed a standard for XML namespaces.
An XML namespace is a collection of element and attribute names used in XML documents. The name
of a namespace usually has the form of Uniform Resource Identifier (URI). A namespace for the elements
and attributes of the hierarchy rooted at a particular element is declared as the value of the attribute xmlns.
The form of a namespace declaration for an element follows:
<element-name xmlns:[prefix] = URI>
The optional prefix is the name that must be attached to the names in the declared namespace. A
prefix is used for two reasons.
1. The URI is too long to be typed on every occurrence of every name from the namespace.
2. A URI includes characters that are illegal in XML.
Note that the element for which a namespace is declared is usually the root of a document.
Eg: <html xmlns = ―http://www.w3.org/1999/xhtml‖>
XML Schemas
XML Schema, recommended by W3C is an XML based schema language. It is a language used to
create XML-based languages and data models. XML schema document defines elements and attribute
names for a class of XML documents.
XML Schema Definition is a specific XML schema document written using XML schema and its filename
extension is ―.xsd‖.
XML Schema Instance : The XML documents that try to follow the rules specified by the XML schema
document are said to be instances of that schema. If they strictly conform to the schema, they are valid
instances.
Limitations of DTD : Although DTDs are well accepted and used very frequently, they have several
limitations, some of them are
There is no built-in data type in DTDs.
No new data type can be created in DTDs.
No support for namespaces.
DTDs provide very limited support for modularity and reuse.
Not possible to put restrictions on text content.
Little control over mixed content (text + elements)
DTDs are written in strange format and are difficult to validate.
DTDs are written in a syntax unrelated to XML, so they cannot be analyzed with XML processor.
Strengths of Schema : Schemas are XML-based alternatives to DTDs in that they are used to create classes
of XML documents that conform to the schema.
XML Schema is a much more powerful language than DTD.
Supports large number of built-in data types.
XML schemas are namespace centric.
Extensible to future additions.
Element Declaration:
The primary building blocks of any XML documents are elements. In a schema, elements are
declared by the element tag. General example of element declaration is
<xs:element name=’element-name’ type=’element-type’>
Elements must have a name, and the value of this attribute is the element name that will appear in
the XML document. For element types, XML schema supports a number of built-in data types, such as
string, integer, boolean and date. Users may also create their own custom types using the simpleType and
complexType tags.
The declaration method is different depending on whether the element has a child element or not.
When no child element is present, the element name is designated with the name attribute, and the data
type is designated using the type attribute.
An element is limited by its type. Schema authors can use their own defined types or used the
built-in types. Depending on the content model, elements are categorized as simple type or complex type.
Attributes
default – specifies the default content to be used when no content is supplied.
<xs:element name=”passed” type:’xs:boolean’ default=’false’ />
fixed – used to ensure that the element’s content is always set to a particular value.
<xs:element name=”institution” type:’xs:string’ fixed=’VIT’ />
minOccurs – specifies the minimum number of times an element can occur. Default value is 1.
<xs:element name=”middlename” type:’xs:string’ minOccurs=’0’ />
maxOccurs – specifies the maximum number of times an element can occur. Default value is 1.
<xs:element name=”option” type:’xs:string’ maxOccurs=’6’ />
Declaring Complex Elements: these elements can contain child elements, text or both and can also have
attributes. Complex types can be limited to having no content, meaning they are empty, but they may
have attributes. For complex elements, element-type is a complex type(user-defined).
A complex type is defined using the complexType schema element. The general form of complexType
definition is
<xs:complexType>
Skeleton of the complex type
</xs:complexType>
Example: <xs:complexType name=’personType’>
<xs:sequence>
<xs:element name=’firstName’ type=’xs:string’ />
<xs:element name=’lastName’ type=’xs:string’ />
</xs:sequence>
</xs:complexType>
A declaration of an element of such a type will then look like this
<xs:element name=’employee’ type=’personType’ />
In the above example sequence specifies a model group. The Model Group specifies settings method for
the occurrence order of the child element. In the Model Group, use the sequence element to output
occurrences in the order written, and use the choice element to output the occurrence of any given element.
Example for choice group:
<xs:choice>
<xs:element name=’dob’ type=’xs:date’ />
<xs:element name=’age’ type=’xs:integer’ />
</xs:choice>
Defining Attributes: attribute element is used under XML Schema to define attributes. Attributes are
themselves declared as simple types as follows.
Referencing:
An element is defined with unique name, then by using ref attribute one can reference it
Eg: <xsd:element ref=’dob’ maxOccurs=’1’ />
This declaration references an existing element , dob, which was declared elsewhere in the schema.
Eg: declare a userdefined name firstname, for strings of fewer than 11 characters.
<xsd:simpleType name=”firstName”>
<xs:restriction base=”xs:string”>
<xs:maxLength value=”10” />
</xs:restriction>
</xsd:simpleType>
Restrictions on numbers:
• minInclusive -- number must be ≥ the given value
• minExclusive -- number must be > the given value
• maxInclusive -- number must be ≤ the given value
• maxExclusive -- number must be < the given value
• totalDigits -- number must have maximum value digits
• fractionDigits -- number must have maximum value digits after the decimal point
Restrictions on strings:
• length -- the string must contain exactly value characters
• minLength -- the string must contain at least value characters
• maxLength -- the string must contain no more than value characters
• pattern -- the value is a regular expression that the string must match
• whiteSpace -- not really a ―restriction‖--tells what to do with whitespace
– whiteSpace="preserve" Keep all whitespace
– whiteSpace="replace" Change all whitespace characters to spaces
– whiteSpace="collapse" Remove leading and trailing whitespace, and
replace all sequences of whitespace with a single space
<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" >
CSS files the form of a CSS style sheet for an XML document is simple. It is just a list of element names,
each followed by a brace-delimited set of elements. The following shows a CSS stylesheet for students
XML doc.
The connection of an XML document to a CSS style sheet is established with the processing
instruction xml-stylesheet, which specifies the particular type of the stylesheet via its type attribute and the
name of the file that stores the stylesheet via href attribute.
Example: <?xml-stylesheet type=”text/css” href=”student.css” ?>
XSLT is a functional-style programming language. XSLT includes functions, parameters, names to which
values can be bound, selection constructs, and conditional expressions for multiple selections.
To apply an XSLT document to an XML document, add a link to the XML document which points
to the actual XSLT file and lets the browsers do the transformation. This linking is placed after
XML declaration.
Example: <?xml version=”1.0” ?>
<?xml-stylesheet type=”text/xsl” href=”file.xsl” ?>
<root>
</root>
An XSLT document mainly consists of one or more templates. Each template has associated with a
section of XSLT code, which is executed when a match to the template is found in the XML
document. Therefore each template describes a function, which is executed whenever the XSLT
processor finds a match to the template’s pattern.
An XSLT processor sequentially examines the input XML document, searching for parts that match
one of the templates in the XSLT document.
<xsl:template> : A style sheet document must include at least one template element. It defines a way to
reuse templates in order to generate the desired output for nodes of a particular type/context. Templates
can occur any number of times.
Attributes
Name Description
name Name of the element on which template is to be applied.
match Pattern which signifies the element(s) on which template is to be applied.
Template included to match the root node of the XML document : <xsl:template match=”/” >
Stylesheets have templates for descendents of root node also like : <xsl:template math=”year” >
<xsl:value-of> : tag puts the value of the selected node as per XPath expression, to the output document
being generated. It uses select attribute to specify the element of the XML document, whose contents are
to be copied.
Example : <xsl:value-of select=”author” />
Select attribute can specify any node of the XML document.
<xsl:for-each> : XML document includes a collection of elements. the XSLT template used for one XML
element can be used repeatedly with the for-each element, which uses a select attribute to specify an
element in XML data.
<xsl:sort> : this element specifies a simple way to sort the elements of the XML document before sending
them or their content to the output document.
The select attribute specifies the node that is used for the key of the sort.
Data-type attribute specifies whether the key is to be sorted as text or numberically.
Order attribute specifies sorting order (ascending or descending)
Example : <xsl:sort select=”year” data-type=”number” />
<xsl:if> : this element specifies a conditional test against on the content of nodes. It has a test attribute
which specifies the condition in the xml data to test.
Example : <xsl:if test="marks > 90">
<xsl:choose> : This tag specifies a muliple conditional tests against on the content of nodes in conjunction
with the <xsl:otherwise> and <xsl:when> elements.
<xsl:choose>
<xsl:when test="marks > 90"> High </xsl:when>
<xsl:when test="marks > 85"> Medium </xsl:when>
<xsl:otherwise> Low </xsl:otherwise>
</xsl:choose>
<xsl:apply-templates> : this element applies appropriate templates to the descendent nodes of the current
node
Examples:
<?xml version="1.0" ?> Output:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/">
<html>
<body>
<h2>Students</h2>
<xsl:for-each select="class/student">
Regdid : <xsl:value-of select="@regid" /> <br />
Name : <xsl:value-of select="name"/> <br />
Contact No: <xsl:value-of select="contactno"/> <br />
</xsl:stylesheet>
<xsl:template match="/">
<html>
<body>
<h2>Students</h2>
<table border="1">
<tr bgcolor="#9acd32">
<th>rollno</th>
<th>Name</th>
<th>ContactNo</th>
<th>E-mail address</th>
</tr>
<xsl:for-each select="class/student">
<tr>
<td><xsl:value-of select="@regid" /></td>
<td><xsl:value-of select="name"/></td>
<td><xsl:value-of select="contactno"/></td>
<td><xsl:value-of select="email"/></td>
</tr>
</xsl:for-each>
</table>
</body>
</html>
</xsl:template>
</xsl:stylesheet>
XML Processors
The word parser comes from compilers. In a compiler, a parser is the module that reads and
interprets the programming language. In XML, a parser is a software component that sits between the
application and the XML files. It reads a text-formatted XML file or stream and converts it to a document
to be manipulated by the application. As we know, Well-formed documents respect the syntactic rules
where as Valid documents not only respect the syntactic rules but also conform to a structure as described
in a DTD.
DOM Parsers :
A DOM document is an object containing all the information of an XML document
It is composed of a tree (DOM tree) of nodes , and various nodes that are somehow associated with
other nodes in the tree but are not themselves part of the DOM tree.
DOM parser is tree-based (or DOM obj-based). The idea is to build a hierarchical syntactic
structure of the document. The nodes of the tree are represented as objects that can be accessed and
processed or modified by the application
There are 12 types of nodes in a DOM Document object
Document node
Element node
Text node
Attribute node
Processing instruction node etc.
<?xml version="1.0"?>
<?xml-stylesheet type="text/css" href=“test.css"?>
<!-- It's an xml-stylesheet processing instruction. -->
<!DOCTYPE shapes SYSTEM “shapes.dtd">
<shapes>
……
<squre color=“BLUE”>
<length> 20 </length>
</squre>
……
</shapes>
A DOM parser creates an internal structure in memory which is a DOM document object
When parsing is complete, the complete DOM representation of the document is in memory and
can be accessed in a number of different ways, including tree traversals of various kinds as well as
random accesses.
Client applications get the information of the original XML document by invoking methods on this
Document object or on other objects it contains
Client application seems to be pulling the data actively, from the data flow point of view
Advantages:
1. Access to random parts of the document are possible. It is good when random access to widely
sparated parts of a document is required
2. It supports both read and write operations.
3. If the application must perform any rearrangement of the document, that can most easily be done
if the whole document is accessible at the same time.
4. Because the parser sees the whole document before any processing takes place, this approach
avoids any processing of a document that is later found to be invalid.
//Build Document
Document document = builder.parse(new File("employees.xml"));
Employee id : 222
First Name : Alex
Last Name : Gussin
Location : Russia
Employee id : 333
First Name : David
Last Name : Feezor
Location : USA
SAX parsers:
It does not first create any internal structure
Client does not specify what methods to call
Client just overrides the methods of the API and place his own code inside there
1) DOM parser loads whole xml document in memory while SAX only loads small part of XML file in
memory.
2) DOM parser is faster than SAX because it access whole XML document in memory.
3) SAX parser in Java is better suitable for large XML file than DOM Parser because it doesn't require
much memory.
4) DOM parser works on Document Object Model while SAX is an event based xml parser.
Important Questions
1. Design & Develop an XML schema for student information management. Include every feature
available with schema.
2. Write about Document Type Definition.
3. Differentiate between DTD & XML Scheme with an example.
4. Write about SAX Parser in detail.
5. Write about DOM Parser in detail.
6. Design & Develop an XML DTD for Employee Database. Include every feature available with
DTD.
7. Write about XML Schema.
8. What is XML? Explain the differences between XML and HTML.
9. XML is not a replacement for HTML. Discuss.
10. Discuss how XML simplifies data sharing and data transport.
11. Discuss how XML separates data from HTML
12. Write an example XML document and explain it.
13. Explain XML document object model.
14. Discuss about DOM and SAX.