0% found this document useful (0 votes)
343 views241 pages

OReilly - SAX2

David Brownell has been involved with SAX since shortly after the XML 1 specification went final. He's currently involved in maintaining the SAX APIs and the GNUJAXP implementation. Brownell is a software engineer.

Uploaded by

Lucas Couto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
343 views241 pages

OReilly - SAX2

David Brownell has been involved with SAX since shortly after the XML 1 specification went final. He's currently involved in maintaining the SAX APIs and the GNUJAXP implementation. Brownell is a software engineer.

Uploaded by

Lucas Couto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 241

SAX2

,Title.19243 Page 1 Monday, January 7, 2002 5:01 PM


,Title.19243 Page 2 Monday, January 7, 2002 5:01 PM
SAX2
David Brownell
Beijing Cambridge Farnham Kln Paris Sebastopol Taipei Tokyo
,Title.19243 Page 3 Monday, January 7, 2002 5:01 PM
SAX2
by David Brownell
Copyright 2002 OReilly & Associates, Inc. All rights reserved.
Printed in the United States of America.
Published by OReilly & Associates, Inc., 1005 Gravenstein Highway North,
Sebastopol, CA 95472.
OReilly & Associates books may be purchased for educational, business, or sales
promotional use. Online editions are also available for most titles (safari.oreilly.com).
For more information, contact our corporate/institutional sales department: (800) 998-
9938 or [email protected].
Editor: Simon St.Laurent
Production Editor: Mary Brady
Cover Designer: Ellie Volckhausen
Printing History:
January 2002: First Edition.
Nutshell Handbook, the Nutshell Handbook logo, and the OReilly logo are registered
trademarks of OReilly & Associates, Inc. The association between the image of a
pampas cat and the topic of SAX2 is a trademark of OReilly & Associates, Inc.
Many of the designations used by manufacturers and sellers to distinguish their
products are claimed as trademarks. Where those designations appear in this book,
and OReilly & Associates, Inc. was aware of a trademark claim, the designations
have been printed in caps or initial caps.
While every precaution has been taken in the preparation of this book, the author
and publisher assume no responsibility for errors or omissions, or for damages
resulting from the use of the information contained herein.
ISBN: 0-596-00237-8
[M]
,Copyright.19116 Page iv Monday, January 7, 2002 5:01 PM
About the Author
David Brownell is a software engineer. Hes been involved with SAX since
shortly after the XML 1.0 specification went final, and is currently involved
in maintaining the SAX APIs and the GNUJAXP implementation. When he
worked at Sun, he started the Java XML engineering effort, including SAX
support, as a natural follow-up to the servlet-based web software
infrastructure.
Colophon
Our look is the result of reader comments, our own experimentation, and
feedback from distribution channels. Distinctive covers complement our dis-
tinctive approach to technical topics, breathing personality and life into
potentially dry subjects.
The animal on the cover of SAX2 is a pampas cat. Not much is known about
this cat, as extensive studies on it have never been done. For instance, the
pampas cat is believed to be mainly nocturnal and to live primarily on the
ground. However, some pampas cats that reside in zoos have been
observed as being somewhat active in daylight, as well as comfortable
spending time in trees.
The pampas cat is native to South America. Its natural habitat allows for
quite a range, as some species are found in grasslands, some in mountain
regions, and some in swampy areas. It is not a large cat, weighing only
between 7 and 8 pounds when fully grown. Its features are marked by a
wide face and pointed ears. The color and markings on its fur are deter-
mined by the area in which it lives. For example, in the Andes Mountains,
the cat is found with gray fur and red stripes; Brazilian cats are reddish-
brown with black stripes; and cats living in Argentina are a light brown
shade and have faint markings. The stripes always appear on the cats legs
and torso, and the cat has very long hair, which can grow up to three inches
long. When the cat is frightened, these hairs stand on end, which makes the
cat seem larger, as well as more menacing. This no doubt serves as a deter-
rent to predators.
The pampas cat population was dwindling for a time, as the cat was hunted
for its skin. However, in 1980 laws were passed against this, so that partic-
ular threat to the species seems to have passed. Now the main danger is the
growing human population, which is infringing on the pampas cats home
in the plains and forests.
Mary Brady was the production editor and proofreader and Melanie Wang
was the copyeditor for SAX2. Colleen Gorman, Matt Hutchinson, and Claire
,Colophon.18984 Page 1 Monday, January 7, 2002 5:01 PM
Cloutier provided quality control. Derek Di Matteo and Philip Dangler pro-
vided production support. Joe Wizda wrote the index.
Ellie Volckhausen designed the cover of this book, based on a series design
by Edie Freedman. The cover image is a 19th-century engraving from Mam-
malia. Emma Colby produced the cover layout with QuarkXPress 4.1, using
Adobes ITC Garamond font.
Melanie Wang designed the interior layout based on a series design by
Nancy Priest. The print version of this book was created by translating the
DocBook XML markup of its source files into a set of gtroff macros, using a
filter developed at OReilly & Associates by Norman Walsh. Steve Talbott
designed and wrote the underlying macro set on the basis of the GNU troff
gs macros; Lenny Muellner adapted them to XML and implemented the
book design. The GNU groff text formatter Version 1.11.1 was used to gen-
erate PostScript output. The text and heading fonts are ITC Garamond Light
and Garamond Book. The illustrations that appear in the book were pro-
duced by Robert Romano and Jessamyn Read, using Macromedia FreeHand
9 and Adobe Photoshop 6. This colophon was written by Mary Brady.
Whenever possible, our books use a durable and flexible lay-flat binding.
,Colophon.18984 Page 2 Monday, January 7, 2002 5:01 PM
Ta ble of Contents
Preface ...................................................................................................... vii
1. The Simple API for XML .............................................................. 1
Types of XML APIs ............................................................................... 2
Why Choose SAX? ................................................................................. 3
Why Not to Choose SAX? ..................................................................... 8
A Short History of SAX ......................................................................... 9
Packages in the SAX2 API .................................................................. 14
Some Popular SAX2 Parser Distributions .......................................... 14
Installing a SAX2 Parser ..................................................................... 17
What XML Are We Talking About? .................................................... 19
2. Introducing SAX2 ....................................................................... 23
Pr oducers and Consumers ................................................................. 24
Beginning SAX .................................................................................... 25
Basic ContentHandler Events ............................................................. 33
Pr oducer-Side Validation .................................................................... 44
Exception Handling ............................................................................ 49
Namespaces and SAX2 ....................................................................... 56
3. Producing SAX2 Events ............................................................ 67
Pull Mode Event Production with XMLReader .................................. 67
Bootstrapping an XMLReader ............................................................ 76
Conguring XMLReader Behavior ..................................................... 81
v
3 January 2002 10:10
vi Table of Contents
The EntityResolver Interface .............................................................. 88
Other Kinds of SAX2 Event Producers .............................................. 91
4. Consuming SAX2 Events ....................................................... 103
Mor e About ContentHandler ........................................................... 103
The LexicalHandler Interface ........................................................... 111
Exposing DTD Information ............................................................. 115
Turning SAX Events into Data Structures ........................................ 122
XML Pipelines ................................................................................... 129
5. Other SAX Classes ..................................................................... 140
Helper Classes .................................................................................. 140
SAX1 Support .................................................................................... 147
6. Putting It All Together ............................................................. 150
Rich Site Summary: RSS .................................................................... 150
XML and Messaging ......................................................................... 165
Including Subdocuments ................................................................. 174
A. SAX2 API Summary ................................................................ 181
B. SAX2 and the XML Infoset .................................................... 201
Index ...................................................................................................... 219
3 January 2002 10:10
Preface
Think of this book as if it were really called Everything You Wanted to
Know About SAX. It provides a quick tutorial, while also serving as a com-
plete refer ence that explains how to use this popular XML API effectively
and efciently. Youll nd motivations for every programming interface
and see how to build components for your application (or specialized
envir onment) on top of SAX.
The information in this book is based on the current version of the Java
language support for SAX2. For any further updates to SAX2, see the SAX
web site at http://www.saxpr oject.org.
Who Should Read This Book?
If you are programming with XML in Java, or starting to do that, and you
want to learn how to use SAX2 to its fullest, this book is for you. It
assumes that you are familiar with Java programming and have a basic
understanding of XML, including DTDs. You may have some exposure to
DOM, an alternative parser API, but you need more efcient, or more
complete, access to XML than you can get with such a generic tree struc-
tur e API. Although theres a lot of interest in XML from server-side pro-
grammers, and this book includes some examples targeted at servlet-based
systems, SAX2 is addressed to Java developers working on all scales, from
embedded systems to enterprise applications.
Although versions of the SAX API have been provided for developers who
use C/C++, Pascal, Perl, and Python, this book is not addressed to such
developers except in the broad sense that good SAX programming idioms
transcend the particular language used to express them.
vii
3 January 2002 10:06
viii Preface
This book is for Java programmers working with XML who need an ef-
cient way of reading or generating XML documents. The simple API for
XML (SAX)s event-based approach provides an extremely streamlined set
of tools for Java programmers.
Organization of This Book
This book is divided into six chapters and two appendixes.
Chapter 1, The Simple API for XML, orients you in terms of API alterna-
tives, SAX history, software choices, and basic SAX functionality.
Chapter 2, Intr oducing SAX2, intr oduces the core technical details of SAX,
showing the basic event producer and consumer APIs needed by almost
all code that uses SAX.
Chapter 3, Pr oducing SAX2 Events, focuses on producing SAX events,
showing the rest of the parser APIs as well as several nonparser models
for using SAX.
Chapter 4, Consuming SAX2 Events, concentrates on consuming SAX
events, showing the other event consumer interfaces (both core and
extension), and exploring the SAX pipeline model in a bit more detail.
Chapter 5 Other SAX Classes, presents SAX helper classes that werent yet
pr esented, including legacy SAX1 support.
Chapter 6 Putting It All Together, provides more extensive examples than
the earlier chapters.
The two appendixes are refer ence material. Appendix A, SAX2 API Sum-
mary, summarizes every class or interface in the API and should be a use-
ful quick refer ence. Appendix B, SAX2 and the XML Infoset, shows how
the XML Infoset concepts map to SAX APIs and should be a useful refer-
ence when you need to determine which APIs to use to access particular
structural data.
Conventions Used in This Book
method()
Method names sometimes include their interfaces or classes.
method parameter or variable
Values bound to parameter or variable names are presented in the
same way.
3 January 2002 10:06
class, interface, or package.name
Names identifying interfaces and classes are identied in the same
way as package names.
XML attribute or element
All XML markup in the body of the text is presented consistently.
How to Contact Us
Please address any comments and questions concerning this book to the
publisher:
OReilly & Associates, Inc.
1005 Gravenstein Highway North
Sebastopol, CA 95472
(800) 998-9938 (in the United States or Canada)
(707) 829-0515 (international/local)
(707) 829-0104 (fax)
We have a web page for this book, where we list errata, examples, or any
additional information. You can access this page at:
http://www.or eilly.com/catalog/sax2/
To comment or ask technical questions about this book, send email to:
bookquestions@or eilly.com
For more infor mation about our books, conferences, software, Resource
Centers, and the OReilly Network, see our web site at:
http://www.or eilly.com
Acknowledgments
This book would not have been possible without the contribution and
support of many individuals. David Megginson, who led the original SAX
and SAX2 development processes, deserves particular thanks from every-
one who uses SAX. Numerous other people have contributed to SAX
thr ough the xml-dev mailing list, far too many to mention here by name.
Simon St.Laurent at OReilly provided good feedback and regular encour-
agement during the creation of this book. Along with review comments
fr om Murray Altheim and David Megginson, he helped identify rough
spots and improve the content of this book in many ways.
Preface ix
3 January 2002 10:06
3 January 2002 10:06
1
The Simple
API for XML
In this chapter:
Types of XML APIs
Why Choose SAX?
Why Not to Choose
SAX?
A Shor t Histor y of
SAX
Packages in the SAX2
API
Some Popular SAX2
Parser Distributions
Installing a SAX2
Parser
What XML Are We
Talking About?
When XML started, Java was best known as a fun new language that made
developing programs for the World Wide Web easy. XML was intended to
be the data foundation for the next generation of web infrastructure tools,
and it clearly needed the same kind of support that Java offer ed. The Java
pr ogramming envir onment included ways to fetch data over the Web with
URLs, which was a novel notion at that. It even had support for Unicode,
so working with languages used anywhere on the Web would be easy.
Since both those capabilities were important for working with XML, there
was already a very active community of XML developers using Java when
the XML 1.0 Recommendation was nalized in early 1998. More XML
parsers were available at that time for Java than for the more widely
adopted C programming language!
Those parsers quickly came to share one feature: applications werent
restricted to some particular products API. The Simple API for XML, SAX,
was well under way; it was the rst API usable with all the popular Java
parsers. SAX helped make Java a premier e language for developing XML-
based applications.
Since then the adoption of XML has exploded, as has the use of Java in
web-oriented (and other) applications. Todays Java programmer has an
embarrassingly large selection of XML-related APIs to choose from, and
SAX has retained its role as a premier XML API. In this chapter we look at
why this is true, and learn mor e about where SAX came from and its cur-
rent state.
1
3 January 2002 10:06
2 Chapter 1: The Simple API for XML
Types of XML APIs
An Application Programming Interface (API) is a set of interfaces and
classes used to expose particular functionality to a variety of applications.
Some APIs are specic to particular products. Better Java APIs, like SAX,
use the interface facility to work with multiple products: they are dened
so that multiple implementations can coexist. Such implementations
behave the same except for differ ences allowed by the API. For example,
one might be faster, while another might leverage private interfaces to
some subsystem. (In that case an application could use the fast implemen-
tation most of the time and the slower one only when those added fea-
tur es ar e needed.) APIs differ in how they expose functionality, which
af fects how well applications work.
For the purposes of this book, there are two kinds of APIs to XML. Well
call one a parser API and the other a high-level API. Parser-level APIs
model documents in terms of XML notions such as elements, attributes,
and character data, and hide all the details of actually turning XML text
into information that applications can use. High-level APIs generally focus
on non-XML notions, usually to make XML itself seem only an implemen-
tation artifact that might be easily replaced by other data interchange tech-
nology, or more rar ely, by another document technology. This spectrum is
not as wide as you might expect. Parser-level APIs are well-suited for
working with XML-centric applications, higher-level APIs try to focus on
particular visions of the data being encoded with XML, and some APIs
hop between levels.
SAX is a parser-level API that is rather unique in avor because it leaves
all data structure choices up to higher levels. This feature helps scale XML
applications, since it lets SAX be an unobtrusive and extremely effective
building block. SAX models documents in terms of a stream of event call-
backs; this has been called an active API. The events are mor e structur ed
than the stream of mouse and keyboard events you may know from AWT
or Swing. Events are sent to application handlers for basic XML content
(such as elements and characters) in exactly the order theyre found in the
document, as shown in Figure 1-1. Thats the same order in which youd
traverse a tree model of that markup: youd start an element, look at its
childr en, and then end the element.
Most other parser-level APIs provide a generic, object tree data structure
mirr oring the parse tree; these have been called passive APIs. Figure 1-2
shows some XML text, and its transformation to such a data structure.
3 January 2002 10:06
startElement(A)
characters(text)
startElement(B),
endElement(B)
endElement(A)
Parser
Events
XML
<A>text<B/></A>
Figur e 1-1. A str eaming parser API
A
B
text
<A>text<B/></A>
Figur e 1-2. An object tree parser API
Examples of such generic data structures include W3Cs DOM and more
Java-friendly variants such as DOM4J and jDOM. Such generic tree struc-
tur es ar e usually built with the output of an underlying SAX parser, and so
ar e slightly higher level than SAX. However, these generic structures place
signicant constraints on application structure and scalability. A popular
alter native appr oach involves building custom data structures from that
same SAX output, instead of generic ones, larger than some fraction of
available virtual memory. Because the data structures are custom built,
they tend to be faster and more task-appr opriate than structures built by
generic APIs.
Why Choose SAX?
SAX gives you the exibility to approach application design with your
own trade-offs and goals in mind. High-level APIs often make many of
those trade-offs for you, but not necessarily in ways that are best for your
pr oblems. In particular, SAX lets you design lightweight, task-oriented
XML solutions, which can t into small systems or scale up to large ones.
Just having such options can be an important reason to choose SAX over
generic APIs that work only at a high level. While initial deployment plat-
for ms might be richly featured, this wont necessarily be true for all the
Why Choose SAX? 3
3 January 2002 10:06
4 Chapter 1: The Simple API for XML
systems you need to support, or for the ones your customers want you to
support.
Compar ed to other parser-level APIs, SAX has two unique structural fea-
tur es: its efcient event-stream processing model and its data structure
exibility. These give you more contr ol over the results of your parse.
Stream-Based Processing
SAX is the API to use when you need to stream-pr ocess XML to conserve
memory and, in most cases, CPU time. In SAX, handler interfaces call
application (or library) code for each signicant chunk of XML information
as its parsed. These chunks include character data, elements, and
attributes. Each event passes information to your code, which can save it
or ignore it as appr opriate. These handlers see document information as a
str eam of such event calls, in document order. Applications can process
data incrementally, rather than in one big chunk, and they can discard
infor mation as soon as its not needed.
SAX parsers have several key advantages:
SAX parsers can be small and fast because they are minimal. SAX
pr ovides the most essential XML data, and no more.
SAX parsers are well suited for use in resource-constrained environ-
ments. This includes not just small systems or classic embedded ones
(wher e cost prevents using of much memory or fast CPUs), but also
inside servers (which may have huge amounts of memory and fast
CPUs, but need good scaling properties to share them with many
clients) such as security gateways. Good security practice avoids large
bodies of code, since assurance is so hard to achieve.
Because SAX is a streaming API, it promotes pipelined processing,
wher e I/O occurs while you use the CPU to do work. You will natu-
rally structure applications (or at least their SAX components) to use
ef cient single-pass algorithms and incremental processing.
As soon as XML data starts to become available (perhaps over a net-
work), SAX parsers start to provide it to applications. While process-
ing element or character data, the network or the lesystem
pr efetches the next data. Such overlapped processing lowers laten-
cies and makes good use of limited CPU cycles. With most other
APIs, your application wont even see data until the whole document
3 January 2002 10:06
has been fetched and parsed; you cant process documents larger than
available memory. This causes major trouble when you work with
large documents, as discussed in the next section.
SAX gives you exible control over how you handle errors and faults.
Fatal errors arent the only kind of reportable fault, and diagnostic
infor mation is readily accessible.
You easily provide application-specic error reports with the standard
mechanism. Its also easy to terminate parsing early: just throw an
appr opriate exception when you nd the <great:widget> element you
need, or when some unrecoverable error turns up.
Its easy to dene custom SAX event producers.
That is, you can use SAX when your inputs arent literal XML text.
This is a powerful technique that helps you work with data at the
level of parsed XML information (the XML Infoset), and postprocess
SAX events or late-bind data into XML text format. Such early/late-
binding exibility is a powerful architectural tool.
You may be fortunate enough to be able to design the XML repr esenta-
tions of your application tasks to facilitate such work-ow streams. When
you do this, you may see substantial perfor mance and scalability gains
over alternative design approaches. You might even be able to pull the
SAX event stream model up into higher-level work ows in your system
so that more processing can be stream-based.
For example, you could structure your XML as a sequential list of reason-
ably sized tasks. Several kinds of data import/export problems are well
suited to this approach, although you may nd you need to be aware of
the I/O costs of random access as you transform data to and from inter-
change formats.
Data Structure Flexibility
In contrast to higher-level APIs, or most design tools, SAX allows you to
populate whatever data structures you choose. It lets you use custom data
APIs, optimized for your application, or more general-purpose APIs. This
exibility operates at two broad system levels: architectur e and design.
Suchs exibility is requir ed to scale applications up (or down) and to
update applications as systems evolve.
Application architectur e components affect how systems interact with each
other and with external systems. SAX doesnt constrain these components,
which include data interchange formats and messaging paradigms,
Why Choose SAX? 5
3 January 2002 10:06
6 Chapter 1: The Simple API for XML
because it lets you use XML in any way you (or your systems partners)
need. In contrast, settling early on higher-level XML APIs will constrain
application architectur es in many ways, often affecting XML structures
used for interoperability. For example, many SOAP toolkits expect an RPC
paradigm using W3C-style XML schemas, and many data-binding
appr oaches demand a particular schema system and API toolset. The hope
is that if you accept those system constraints, you win more than they
cost. When that doesnt work, perhaps because the constraints dont suit
your application, youll appreciate the exibility of SAX.
The design level affects application internals rather than the broader inter-
faces, which relate to architectur e. Design constraints affect runtime and
implementation costs. If youre adding XML support to an existing system,
design-level concerns may dominate your planning. SAX lets you use your
curr ent optimized data structures or dene new ones. Since such design
issues will often dominate perfor mance measur ements (given reasonable
architectur es), pr eserving exibility can be very important.
With SAX, you dont need to use generic (and largely untyped) data struc-
tur es. You will normally store data directly into specialized data structures
as SAX delivers it from its XML repr esentation. This facilitates important
architectur e-level optimizations. Being able to use custom data structures
means you can leverage the strong data-typing facilities in Java and detect
many kinds of bugs early, while recovery is possible and cheap. Custom
data decisions are the ideal way to work with large documents, for other
cases where scale is a major concern, and anywhere that data structure
decisions need to be driven by application issues rather than one size ts
all generic tools.
Memor y Consumption with SAX and DOM
To illustrate this design impact, well pick on DOM as a repr esentative
design choice for an API with a generic XML data structure. Youll often
have reasons to use both SAX and DOM, even in the same application, so
youll need to know when to use each API. The strength of DOM is that
its a widely understood and available generic model; it can be good for
pr oof of concept solutions. However, it has a high price in terms of ex-
ibility and resource consumption. Later, we look at ways to reduce those
DOM costs with help from SAX and ways that DOM and SAX repr esenta-
tions of XML data can be interconverted.
For documents with a typical markup density, many DOM implementa-
tions in Java use about 10 bytes of memory to repr esent each byte of XML
3 January 2002 10:06
text. (Few take less, some take more.) Yes, that midsize three-megabyte
document can easily balloon up to 30 megabytes of memory on your
server!
*
When using DOM with large documents, memory shortages are
common, both for virtual memory and for space in the Java heap. Short-
ages are made worse if you then need to convert data from a generic
DOM repr esentation into custom structures, because you need an extra
copy of the data while you build the more appr opriate data structure. This
clearly limits application scalability.
On the other hand, with SAX you dont pay for any memory unless you
choose to do so. You can ignore most of that three-megabyte document
right up front; the API structure makes it natural to capture only signicant
data (whatever that may be in your application). This reduces memory
allocation pressur e, as well as overhead from garbage collection. Best,
SAX parsers let you use data structures that are appr opriate for your appli-
cation from the very beginning. In fact, they all but requir e you to do that!
Other Reasons to Prefer SAX
SAX has always dened its concurrency behaviors, making it safe to use
SAX in multithreaded applications. Since DOM does not specify those
behaviors, multithreaded applications (such as most web services) accept
implementation dependencies if they choose to use DOM.
SAX2 provides almost complete support for the XML Infoset, exposing the
logical structure of XML data. (See Appendix B.) This means its substan-
tially more complete than most other XML APIs, and certainly more com-
plete than any other widely available API. You are unlikely to need
important information from an XML document that SAX cant provide. This
contrasts with DOM, which doesnt have standard APIs to expose much of
this information. SAX is great way to turn a str eam of such Infoset data
into other kinds of data.
At its core, SAX is indeed a very simple API for XML processing; such sim-
plicity is a key virtue. You can write useful XML applications code with
only a handful of method calls and still know that the rest of the XML
Infoset data is available when you need it. Its not like DOM, in which
syntax artifacts that mask the core data model of XML are common. DOM
takes a more monolithic approach than SAX. A book that covers DOM as
* Some applications certainly revolve around large documents. One translation of the Old
Testament is over 3 megabytes in size; one dictionary is over 50 megabytes. Dumps of
databases can be gigabytes in size.
Why Choose SAX? 7
3 January 2002 10:06
8 Chapter 1: The Simple API for XML
completely as this book covers SAX would need to be several times larger
even if it didnt cover the latest version (Level 3).
On top of that, because SAX makes you actually think about the best way
to repr esent your data, its more fun to work with than tools that claim to
solve those issues for you! (They usually cant.) Its also a great way to
lear n your way around XML and Java.
Why Not to Choose SAX?
No API solves problems by itself, and SAX avoided the kitchen sink syn-
dr ome better than many others. So there are times it will be clear SAX isnt
the whole answer for some particular application-processing stage, even
when you have the option to choose it. It will often still be the right way
to get data into or out of another processing stage, particularly since many
other APIs can interface with SAX. Also, building custom data import/
export tools with SAX is fairly easy.
Pr obably the biggest single issue with SAX is that by itself it doesnt pro-
vide random access to XML data. Its event stream is forward-only: you
cant go backwards or reorder it without your own record of the events.
Such data structure policy would be handled by application layers on top
of SAX, and youll need such layers if you use random access models such
as XPath. Typically, applications use SAX to construct data structures that
ar e either customized for their particular random access requir ements or
generic (typically DOM-like). You might create Person objects and index
them by name, perhaps in some sort of hash table or using some kind of
database as a backing store. In some applications its acceptable to just re-
scan small to midsize XML documents on demand; it can be inexpensive
when modern operating systems have already cached the data.
If youre looking for an API that helps you write a low-level XML text edi-
tor and lets you work with malformed XML while it preserves semantically
meaningless information,
*
then SAX isnt what you want. Similarly, parsing
less than an entire XML document isnt standardized by SAX (or by the
XML specication). Such processing requir es an API that works at the
level of potentially malformed tokens. SAX (and any other application
pr ogramming inter face not targeted at text editors) makes hiding such
* For example, whitespace outside element content, attribute order, or singly versus doubly
quoted strings.
3 January 2002 10:06
details a primary goal. SAX works well for structural editors, which pre-
vent creation of malformed XML and hide semantically meaningless infor-
mation.
Its important to note that SAX is intentionally limited. Its the core of a
library of XML support, and that S in its name really does mean simple;
complex functionality is for layers on top of SAX and is not part of SAX
itself. Even basic facilities like XML text output (printing) are layer ed over
SAX. While open source code to handle such functions is often available
on the Internet, you may still need to nd and choose between such
libraries. SAX is somewhat of a close to the metal low-level API, though
its more exible than most such APIs.
A Shor t Histor y of SAX
The ofcial SAX web site is at http://www.saxpr oject.org. You will nd a
mor e complete history there, with updates for anything that happened
after this book went to print, as well as the current software release and
its documentation.
SAX1
SAX 1.0 development started in late December 1997, shortly after publica-
tion of the last review draft of the XML 1.0 specication. The initial impe-
tus was to permit Java applications to be independent of which parser
they used, and to promote uniformity in the data models available to
applications without imposing some particular data repr esentation. At that
time, several such Java parsers existed (notably lfred, Lark, MSXML, and
XP), each with their own APIs and feature sets. That approach would
clearly be counterproductive and had already caused complications for
one early XML browser, Jumbo, used with a Chemical Markup Language
built with XML. (See http://www.xml-cml.or g for more infor mation about
CML and Jumbo.)
Discussion proceeded quickly. The development primarily took place on
the open Internet xml-dev mailing list. There was no bureaucracy since it
was organized and run by one person, David Megginson, the original
author of lfred. Essential contributions were made by developers of
other Java XML parsers, including Tim Bray, editor of the XML 1.0 Recom-
mendation and author of Lark, and James Clark, Technical Lead of the
XML 1.0 Recommendation and author of XP. In ter ms of openness, the
A Shor t Histor y of SAX 9
3 January 2002 10:06
10 Chapter 1: The Simple API for XML
pr ocess was similar to those used historically by the Internet Engineering
Task Force (IETF) and by many current open source development pro-
jects. Unlike the process for most recent Java/XML API standards, helping
to dene SAX requir ed no nondisclosure agr eements, or reassignment of
intellectual property rights, and had a transparent process. Public list
archives are available, if you want to see how (or why!) some things
tur ned out the way that they did
The initial draft API was published in January 1998, less than a month
after initial discussions started. It featured key characteristics still seen
today: it was event based, and distinguished interface and implementation
without insisting that implementations commit to the overhead of a
pr ovider glue layer. To impr ove its coolness factor, it used the
or g.xml.sax package name, since Jon Bosak owned the xml.org DNS
domain name and gave approval for that use.
*
Best, it was indeed a Sim-
ple API for XML. Discussions continued fast and furious. More developers
helped improve these early proposals, including the author of this book.
The SAX1 API was nalized in May 1998, just three months after XML itself
was nalized, and was generally well received. Most Java XML parsers
quickly adapted to it, and new ones quickly adopted it. At one point, it
was possible to nd no less than a dozen open source SAX1 parsers.
Today, new XML projects tend to build on top of the standard APIs such
as SAX, rather than underneath them, since most widely used parsers do
support SAX.
SAX2
When SAX1 was nished, there wer e featur es it did not address. That was
to be expected because of the 80/20 rule. Satisfying the 80% of application
requir ements that involved only simple functionality meant that only a
small handful of applications needed more complex functionality; that
handful was much less than 20% of the application space. Notably, any-
one who tried to use SAX to round-trip XML data found that important
parts were omitted. (Round-tripping a SAX event stream means turning it
back into XML text and parsing the result, without losing any data.) Simi-
larly, anyone using SAX to construct a DOM tree found that there was a
mismatch: DOM also expected more infor mation to be provided. Although
many applications were happy not to see that additional data, it was still a
* This domain was subsequently transferred to the OASIS group, which later took over
operations for the xml-dev mailing list. However, SAX still remains independent of OASIS.
SAX is currently maintained using SourceForge.net project resources.
3 January 2002 10:06
confor mance issue. Moreover, since DTD declarations were not available,
it wasnt practical to maintain arbitrary valid documents with a SAX1-only
parser. On top of all that, it wasnt possible to tell if a parser could vali-
date, nor could you change whether or not it was validating; this all but
pr evented parser-neutral application conguration and setup. As develop-
ers learned their way around XML, the 80/20 line shifted so more func-
tionality was needed.
So discussions continued, but at a much slower pace. In late 1998 some
draft interfaces were posted, which later became the basis of the two stan-
dard SAX2 extensions. (Not many parsers worked with those interfaces
until they were eshed out later in the SAX2 process.) Discussions later in
the next year focused on ways to let such additional extension handlers
and other new features be added without changing core APIs, by sup-
porting parser congurability.
The nal catalyst for SAX2 was probably the realization that without
parser-level API support, the XML Namespaces specication would prob-
ably not be adopted soon with any really standard semantics. Application-
specic implementations tended to have bugs in their interpretation of the
namespaces specication. (That specication has turned out to cause a
surprising amount of confusion.) To make a long story short, further dis-
cussions happened, and SAX2 was nalized in May 2000. SAX1 parsers
wer e initially wrapped in adapters that layered the namespace processing,
making it easy to convert to use the core SAX2 APIs. The rst parser to
natively support the full set of SAX2 APIs, including the extension inter-
faces, was lfred2, in the second half of 1999. By the second half of 2000,
such support was available in the current releases of most other widely
used parsers.
This book focuses on the current SAX2 release, which includes minor bug
xes as well as more robust bootstrapping and clarications, and explana-
tions for the API documentation.
SAX2 Extensions
As mentioned earlier, one of the original reasons to extend SAX1 was that
the SAX core didnt expose information needed by various applications
and, of course, DOM. Not everyone needs or wants that information. A
cautionary example is exposing comments, which were never intended to
be used (or seen) by applications; they were grandfather ed into XML APIs
thr ough horrible accidents involving old HTML browsers and DOM. How-
ever, lack of such such data was a problem for some applications. That
A Shor t Histor y of SAX 11
3 January 2002 10:06
12 Chapter 1: The Simple API for XML
80/20 rule kept such features at a relatively low priority. The fact that
exposing this information called for changes to parser internals ensured
that it couldnt be part of the SAX2 core. (Because information such as a
comment was discarded in SAX1, this information couldnt be layered, in
the same way that or g.xml.sax.helpers.ParserAdapter does for namespace
support.)
The resolution was to decouple development of the SAX2 declaration and
lexical handlers from the SAX core and to make them optional. The SAX-
extension interfaces were not nalized until December 2000, well after
the SAX2 core was nalized; at this writing, many of the deployed SAX2
parsers still only support the beta test versions of those interfaces. In prac-
tice, most SAX2 parsers do support these two handlers, which are mostly
used to develop infrastructure tools. Applications value the simple
natur e of SAX, which lets them focus primarily on a single event handler
inter face included in the core of SAX.
In the future, most SAX2 extensions will be able to be layered indepen-
dently of SAX2 parsers. Only very few additional kinds of information
appear to need standardized support from inside such parsers.
*
Today,
most new XML technologies are dened as layers above the XML Infoset,
so they can (and should!) be implemented as layers above SAX2-based
parsers rather than within them.
Is SAX2 a Standard?
In a word, yes: SAX is a shining example of a de facto standard API. You
will have a hard time nding an XML parser written in Java that doesnt
support SAX. In contrast to the recent spate of standards originated by a
for mal de jure standards body (notably the International Standards Organi-
zation, or the ISO), or to specications pushed by vendors or a vendor-
dominated consortium, SAX2 is a standard in the more classic sense. It
was hammered into shape by users, quenched in the re of real-world
use, and adopted as a tool after it proved its worth. This partially moti-
vates its small size and clear focus: it had a clear mission, and little of the
mission creep pressur e often caused by standards organization politics.
This also explains why its legal status may seem to be unique; SAX is in
the public domain, not copyrighted or controlled by any corporation or
consortium.
* See the SAX we site for more infor mation about these. Also, some probable new exten-
sions are noted in Appendix B.
3 January 2002 10:06
You may be familiar with other examples of technology standards that
wer e developed similarly. For example, the sockets network API widely
used for TCP was popularized at the University of California at Berkeley;
fr om ther e it migrated into other Unix systems and then into Microsoft
Windows. Similar processes occured for other core Unix APIs and the
standard C Library functions. (Some have entered de jure standardization
pr ocesses thr ough the ANSI or IEEE POSIX processes, or have been
adopted by vendor consortiums like the one that produced the UNIX98
API set.)
The same sort of process has been happening with SAX. From its initial
base in Java, its been imported into many XML-programming tool sets in
Python, Perl, Pascal, and JavaScript. There are several differ ent C/C++ ver-
sions, and Microsoft has even provided SAX-like COM interfaces. Each
new environment has made changes and adaptations. Some have
remained truer to the original API (in Java) than others, but it looks as if
this growth will only continue. In the best and most classic sense, SAX is a
standard.
Suns Java API for XML Processing ( JAXP)
As of JDK 1.4, SAX2 has been incorporated into the Java2 Standard Edition
(J2SE) through Suns Java Community Process. Its part of Version 1.1 of
Suns Java API for XML Processing ( JAXP). A JAXP implementation is bun-
dled with JDK 1.4 releases, and is available separately for use with other
Java-compliant platforms ( JDK 1.1 and later). The Java2 Enterprise Edition
( J2EE) has recognized JAXP for some time, and web applications that use
servlets have long been using XML and SAX.
Fr om the perspective of this book, JAXP is just a vehicle to get SAX2 inter-
faces (and the Crimson parser) into the hands of more Java developers.
Sun incorporated these standard APIs directly into their API set, exactly as
one would desire. In this case, the real community process had completed
befor e Suns own process started. The stamp of recognition provided by
Sun facilitated further adoption; some organizations are uncomfortable
with software that has no such recognition.
JAXP 1.1 also incorporates DOM Level 2 and some other APIs, including
TRAX, a wrapper for XSLT-based transformations. TRAX offers limited SAX
support; it supports producing partial SAX event streams as output or
sometimes inputs. Its worth noting that if you use DOM, JAXP solves a
critical portability problem for you. JAXP has the rst standard Java solu-
tion for vendor-independent bootstrapping with DOM. DOM Level 3 plans
A Shor t Histor y of SAX 13
3 January 2002 10:06
14 Chapter 1: The Simple API for XML
to address that problem, but JAXP will have solved it years before Level 3
becomes widely available. If you dont use JAXPs DOM bootstrap APIs,
you must use vendor-specic APIs to get a document object thats popu-
lated with the content of any XML text. Starting down the path of such
vendor-specic APIs quickly leads to nonportable code. SAX has never
had that problem, because it has always included vendor-neutral boot-
strapping APIs. (Although JAXP denes additional SAX bootstrapping
APIs, this book discourages their use.)
Packages in the SAX2 API
SAX2 consists of three packages:
or g.xml.sax
This package contains the core SAX interfaces and exceptions, plus
two concrete classes. The package has SAX1 as well as SAX2 support.
or g.xml.sax.helpers
As suggested by its name, this package exists to help applications use
the SAX core. Most of its classes are utility implementations of core
inter faces (except the parser interfaces), but the portable bootstrap
APIs are also part of this package. SAX2 added to what SAX1 started.
or g.xml.sax.ext
This package holds the SAX2 extension handlers, described earlier,
and will probably grow to hold other SAX2 extensions in the future.
Only the or g.xml.sax package is guaranteed to be part of every SAX distri-
bution. The helpers dont depend on the extensions, to preserve this
notion of core plus options. In practice, its a rare distribution that
doesnt include all of these packages, and SAX2 parsers without support
for the extension handlers are likely to support only XML subsets. JAXP
1.1 (and hence JDK 1.4) includes all three packages.
Some Popular SAX2 Par ser
Distr ibutions
Today a variety of high-quality SAX2 parsers are available. Increasingly,
they are packaged with Java programming environments, so you may not
need to fetch one yourself unless you need upgrades (or bug xes), or are
constructing such a programming environment yourself (perhaps packag-
ing an embedded system or a standalone application). You should be able
to bootstrap any SAX parser. As a rule, if an XML parser is part of your
Java programming environment, it already supports SAX and probably
3 January 2002 10:06
SAX2. The documentation should say whether SAX2 is supported. If it
only mentions SAX1, you can upgrade to get most of the core SAX2 fea-
tur es; see the section SAX1 Support, in Chapter 5, for more infor mation.
If your programming environment doesnt include a SAX parser, youll
need to get and install one. This section provides a brief summary of
some of the most widely available open source SAX2 parsers.
*
These
packages all include SAX2, DOM Level 2, and JAXP 1.1 support, and can
validate XML for you. They also have full support for the standard SAX2
extensions. If you dont happen to download documentation that includes
the SAX2 documentation, itll be available from the same site as the parser.
All of these perfor m well in most applications, as long as you avoid the
memory penalties of DOM.
Curr ent versions of all these parsers do quite well on the open source
SAX/XML conformance tests, available at http://xmlconf.sour ceforge.net/
java/. Those tests verify that these processors report essential information
requir ed of a SAX1 processor, and evaluate how well they support the
XML 1.0 specication. SAX2 conformance testing isnt yet as well
advanced, though some tests are now available.
In addition to a SAX2 parser, you will likely want to have some SAX2/XML
utilities that are layer ed on top of that parser. The packages described
her e include a DOM implementation, which is normally provided as a
clean layer over SAX2. You might also consider other more Java-friendly
packages such as DOM4J (http://www.dom4j.or g) or JDOM
(http://www.jdom.or g), both of which are layer ed over SAX2, as well as
other APIs that provide more data-structur e options. When youre lear ning
SAX, having access to the source code of tools and applications built with
SAX can help you learn the API, at least if its high-quality source that uses
the SAX APIs correctly.
lfred2
One of the original XML parsers mentioned earlier, lfr ed, has long been
recognized for its simplicity, small size, and good perfor mance. As XML
parsers go, it is easy to read and understand. With a differ ent maintainer
(your humble author), this parser was updated to be the rst with full
native SAX2 support, and to substantially improve its conformance to the
XML specication. This updated version is called lfred2, and versions
* Pr oprietary SAX2 parsers exist, such as one from Oracle that is commonly used in Oracle-
hosted server-side applications. More infor mation is available on the Oracle web site,
http://www.oracle.com/xml/.
Some Popular SAX2 Par ser Distr ibutions 15
3 January 2002 10:06
16 Chapter 1: The Simple API for XML
have been incorporated in a variety of applications where its simplicity,
size, and conformance are compelling features. It is now part of the GNU
Classpath Extensions project and forms the core of the GNU JAXP library.
The updated version has taken SAX2 further than most other parsers. It
has a highly modular structure; the refer ence distribution is able to use an
optional stream validator that uses the SAX2 events. The model of an
XML pipeline of such events is a natural and powerful way to think about
SAX; the SAX2 pipeline package in this distribution lets applications com-
pose arbitrary processing modules in series or parallel. This style of SAX2
pr ocessing is emphasized in this book, and some of the examples show
how to use these advanced components. Validation and DOM support
remain completely modular, and use SAX event pipelines, so lfred can
still be distributed as a lightweight nonvalidating parser without those
components. Likewise, the validation and DOM support dont need lfred
to work.
The current version of lfred is licensed under the GNU General Public
License (GPL), with the library exception clause to ensure that it can be
used in proprietary applications (notably, embedded systems) that arent
themselves licensed under the GPL. That license is used with many GNU
libraries, such as the GCC Java (GCJ) runtime libraries. lfred includes a
gnujaxp.jar le that needs installation.
See http://www.gnu.or g/software/classpathx/jaxp/ for information about
the current distribution of lfred.
Cr imson
Sun, through Java Project X in its Java division, was one of the earliest
major Java vendors to support SAX and XML namespaces. This parser was
the rst to demonstrate that XML could be validated without a signicant
penalty. It was dozens of times faster than its competitors and offer ed
mor e XML conformance. History buffs may like to know that its validation
was based on some of the SGML/HTML validation code from the HotJava
web browser, the original Java-and-the-Web showpiece software package.
This XML code ties directly to some of the earliest Java software seen out-
side of JavaSoft.
Crimson is a version of the Java Project X software, updated to support
SAX2, DOM Level 2, and JAXP 1.1 (for which it is the refer ence implemen-
tation). It was submitted to the Apache XML project to help trigger a best
of breed XML parser.
3 January 2002 10:06
Crimson is licensed under the Apache Software License. The Crimson
parser has been incorporated into Suns JDK 1.4 release as its standard
XML parser. It is separately distributed as the refer ence parser for JAXP, so
most JAXP distributions include it. This book describes Crimson Version
1.1.3 (matching JDK 1.4), dated October 2001, which includes jaxp.jar and
crimson.jar les that need installation.
See http://java.sun.com/xml/ for information about this distribution.
Xer ces
Xerces is a family of XML parsers in the Apache XML project; in this book,
we refer only to the Java version, not the C/C++ version. It has evolved
fr om the second generation of IBMs XML for Java (XML4J) parser, and
much of its development and maintenance is still handled by IBM. It is
relatively large, and is monolithic rather than modular. It also supports
many nonstandard extensions. For example, validation against W3Cs XML
schemas is part of the parser, rather than a layered feature.
Xer ces v2 is a third-generation project. Goals of that project include a
mor e maintainable and modular design. It includes an internal XML event
pipeline model, which is strikingly similar to that used in lfred to layer
validation and DOM support, except that it doesnt use SAX2 to repr esent
the XML Infoset data.
Xerces is licensed under the Apache Software License. This book
describes Xerces Version 1.4.3, dated August 2001, which includes a
xer ces.jar le that needs installation.
See http://xml.apache.or g/xerces for information about this distribution.
Installing a SAX2 Par ser
Unless you use JDK 1.4 (which bundles SAX2 and the Crimson parser) or
some other environment thats already set up with SAX2 support (such as
any up-to-date web application server), you will need to update your Java
pr ogramming envir onment so that you can use SAX. Consult the docu-
mentation that comes with your parser and Java Virtual Machine for spe-
cic details. Assuming the SAX interfaces and your SAX parser are
distributed in a single JAR le called xml.jar (youll need to know and use
the correct full pathname, including the directory), youll probably use
one of these approaches shown in the following list.
Installing a SAX2 Par ser 17
3 January 2002 10:06
18 Chapter 1: The Simple API for XML
Add to extensions directory
If you use JDK 1.2 or later for your runtime environment,
*
you can
install the JAR le into the jr e/lib/ext subdir ectory of your Java distri-
bution. This is the preferr ed solution during development, since its
the simplest and least error-pr one.
On Windows, you may need to add this to two differ ent locations:
one for the development environment as well as one for the runtime
envir onment.
Update class path on command line
This solution works with JDK 1.2 and later. Whenever you invoke a
pr ogram that needs the SAX support (such as java, javac, or javadoc)
pass the cp xml.jar parameter to add SAX to the class path.
Add to CLASSPATH in envir onment
This is the original way to add software to your Java environment,
and it works on a JDK 1.1based system and on many Java implemen-
tations that arent derived from Suns JDK. You may prefer this tech-
nique if you have to make several differ ent Java execution environ-
ments cooperateperhaps one for each IDE and test environment
used for application development. You could also have your applica-
tion use its bundled JVM when its deployed, rather than whatever the
end user happened to have around.
The details vary from operating system to operating system, and from
installation to installation, because you may need to ensure that your
CLASSPATH includes libraries internal to the JVM. Put the CLASSPATH
assignment into your login script (autoexec.bat or your environment
variables, .pr ole, .login, or other le). On Windows, youll likely
need to reboot after you modify autoexec.bat, to ensur e that all new
JVM instances see the new conguration.
You may end up with a variety of SAX2 parsers in your environment.
Sometimes which parser you use will be important, but you should avoid
cr eating such problems. See the section Bootstrapping an XMLReader in
Chapter 3 for information about making sure your e using a particular
parser; there are several mechanisms, including setting system properties
and adding META-INF/services/ resources to your class path. If you work
within some application environment (perhaps a web server), you may
want to look for specialized conguration mechanisms. Also, if you have
* Most current graphical development tools, called IDEs, bundle this software for Java.
3 January 2002 10:06
SAX1 support in your environment, you can easily upgrade it; see the sec-
tion SAX1 Support in Chapter 5.
Note that because SAX lets applications hand character streams to parsers
with java.io.Reader, you cant use JDK 1.0 with SAX. You need JDK 1.1,
which is a more complete and stable release in any case. Since the Java
envir onments that arent based on Suns code generally treat JDK 1.1 as
the conformance target,
*
that should cause no real trouble. SAX itself
doesnt requir e mor e recent APIs, but some of the tools you use with SAX
might have such requir ements. For portability, the example code in this
book avoids use of APIs added in JDK 1.2 and later. The main impact of
this restriction is that in a few cases youll be able to get minor perfor-
mance improvements by using the collections APIs.
What XML Are We Talking About?
Over the past years, there has been an explosive growth in the number of
XML-r elated standards. Talking about XML has become confusing, because
those three letters can mean so many differ ent things. Some people actu-
ally mean what Ive called Greater XML. Think of it this way: Boston is
signicant city, but people who dont live there may often name Boston to
refer to other nearby towns (Arlington, Cambridge, and so on). What
theyr e really talking about is the Greater Boston Metropolitan Area, or
sometimes even just Eastern Massachusetts.
In much the same way, many people now talk about XML when they
really mean one of dozens of related technologies built around the
nucleus of XML. Some of these may even be part of the original XML
vision as SGML for the Web. Using XML to develop documents using a
DTD like DocBook (http://www.docbook.or g ) is clearly part of that origi-
nal open systems vision. However, its also been trendy to market new
and improved! software as based on XML. Such ambiguities can be con-
fusing and can even implicitly promote vendor lock-in, rather than liberate
customer data from vendor control. The simplicity at the core of XML isnt
friendly to lock-in strategies, but complex application layers on top of
XML can certainly cause closed systems.
So when someone says that SAX is a great API for XML processing, exactly
what part of Greater XML does that mean? Briey, parts built with the
* Many Macintosh developers cant use JDK 1.2 yet. The Microsoft JVM also does not sup-
port JDK 1.2 APIs.
What XML Are We Talking About? 19
3 January 2002 10:06
20 Chapter 1: The Simple API for XML
cor e XML specications. The following lists shows the parts that this
book uses in most of its examples.
XML 1.0 (Second Edition)
http://www.w3.or g/TR/REC-xml
This text document format is the core of XML. SAX2 parsers work
with this format and turn it into a stream of events that present the
XML Infoset. However, as well see, SAX can be quite useful without
even parsing XML text. (The second edition incorporates a variety of
bug xes and a few functional changes, which were previously pub-
lished as a separate list of errata.)
XML includes Document Type Declarations, or DTDs. These provide
several processing facilities, most of which you can rely on even
when you dont use a validating parser. All XML parsers must support
DTDs; theyre what schema technologies attempt to improve on.
Unicode support has been part of XML from the earliest days. Java
pr ogrammers may tend to overlook the signicance of that fact, since
its always been part of Java too. But its actually a big deal that XML
moves web technologies rmly away from ASCII toward Unicode, in
all programming environments (not just Java)not everyone needs to
be a native English speaker to make best use of Internet technologies.
XML has even been called a virus for Unicode.
XML Infoset
http://www.w3.or g/TR/xml-infoset/
The Infoset is best explained as an abstract model for what XML rep-
resents: information like elements, attributes, and character data. The
Infoset exposes XML structure, not meaningful data. Applications
transfor m Infoset data into forms that are suited to their particular
tasks, normally behind a veil of application objects, unless they
manipulate the text like a text editor.
The SAX2 event APIs present Infoset-level data; the lower-level alter-
native is to work directly with text. (See Appendix B for details about
Infoset support in SAX2.) Other XML infrastructure, such as XInclude,
generally transforms or augments Infoset data. Higher-level APIs gen-
erally hide such XML structures.
XML Namespaces
http://www.w3.or g/TR/REC-xml-names/
Namespaces are an optional convention for XML 1.0 documents.
Namespaces distinguish elements and attributes so that names can be
reused when necessary. For example, in document markup a <table>
3 January 2002 10:06
pr obably refers to a tabular presentation of data, but in a furnitur e cat-
alog it might also refer to something rather differ ent. XML names-
paces distinguish those cases with name prexes; unlike straight
XML with DTDs, those prexes are expected to change in differ ent
contexts (such as differ ent parts of that furnitur e catalog). This makes
combining namespaces and DTDs complicated.
One of the most visible differ ences between SAX1 and SAX2 is that
SAX2 has integrated support for XML namespaces to promote their
widespr ead adoption.
Over time, some other simple layers (and conventions) may become
appr opriate to view as part of the core of XML. The XML Base specica-
tion (http://www.w3.or g/TR/xml-base/ ) might be an example of such a
facility; it explains how to use an xml:base attribute to augment normal
pr ocessing of relative URIs found in text.
*
Various internationalization
rules and policies are also likely to t into that core. One example is W3C
work on the Character Model for the World Wide Web
(http://www.w3.or g/TR/charmod/ ), which promotes uniform handling of
sequences used to repr esent some non-ASCII characters. Another is cur-
rently called XML Blueberry, which will modify XML 1.0 to allow use of
new Unicode characters in element and attribute names. Those characters
support languages not previously supported (before Unicode 3.1) and also
impr ove support for languages such as Japanese.
Many of the increasingly substantial layers over XML, such as schemas
(ther e ar e many schema approaches, with one from W3C), schema APIs
and tools (which may focus on non-XML data models, distant from
downtown XML), Remote Procedur e Calls (RPCs; again, many
appr oaches including one from W3C), XPath (and its outgrowths), and
XSLT are prime examples of technologies that deserve to be viewed as
technology choices in their own right. They are other cities in the
metr opolis of Greater XML, satellites of the original village that leverage
the original civic infrastructure. Some of those layers may even reect dif-
fer ent fundamental goals and requir ements fr om those that originally
dr ove the creation and adoption of XML. That doesnt mean that you
wont put SAX interfaces on them (or at least SAX-friendly ones), but
because they are data layers over the core of XML, they may involve API
layers too.
* In fact, since this list includes the XML Infoset in the core, documents with the xml:base
attribute implicitly need XML base in their core view of XML to augment normal interpre-
tation of URIs in document content. Example 5-1 shows one way to implement such pro-
cessing in SAX.
What XML Are We Talking About? 21
3 January 2002 10:06
22 Chapter 1: The Simple API for XML
If you look at Java implementations of other technologies in Greater XML,
youll probably nd SAX not far from the surface. This book identies a
number of such SAX-based tools and shows SAX events used as a frame-
work to efciently integrate these differ ent technologies.
3 January 2002 10:06
2
Introducing SAX2
In this chapter:
Producer s and
Consumer s
Beg inning SAX
Basic
ContentHandler
Events
Producer-Side
Validation
Exception Handling
Namespaces and
SAX2
SAX gets its power from the unifying notion that sequences of event call-
backs are power ful and lightweight ways to repr esent the information in
XML documents. Building on that notion, you can create many powerful
tools. Most of the essential SAX calls use interfaces, so the interesting
behavior comes from how you combine implementations of those inter-
faces to assemble tools and what those implementations do.
This chapter shows the basic structure of SAX and of several classic SAX
applications using an XML parser, the simple core of SAX. It starts by
showing the essential components and the framework through which they
relate. Then it shows how to customize the most important features and
concepts in that framework and how to work with the core XML data
model of elements, attributes, and text. Youll also see how to handle
err ors and learn how SAX exposes XML namespaces.
This chapter focuses on the parts of SAX that essentially every application
needs to know. It doesnt provide full information about every interface.
Later chapters elaborate on these structures and concepts, showing addi-
tional parts of these APIs, ways to combine SAX components, and how to
work with additional parts of the XML data model. Depending on what
your application needs to do, you may not need to know much more of
SAX than is explained in this chapter.
23
3 January 2002 10:07
24 Chapter 2: Introducing SAX2
Producer s and Consumers
The rst thing to learn is that there are really two kinds of roles in this
API or thr ee, if you include your role as director, conguring sets of
components to serve those roles and provide your applications function-
ality. Complete SAX applications integrate all these roles.
The rst role is an event pr oducer, which is typically an XML parser pack-
aged as an instance of some library class. The producer is in charge of
pushing parsing events to objects that serve the second role: an event
consumer. Most SAX applications will only have one event producer,
though well look at some cases where you need more than one. This
chapter touches on several of the ways to congure (or customize) event
pr oducers.
Consumers normally do most of the real work for any given SAX-based
application: they make sense of the parsing events and often create some
specialized data structures. Without a consumer to handle events, nothing
happens! SAX2 denes several kinds of handlers to consume differ ent
parts of the XML content. Later chapters look at each kind of handler in
detail, but in this chapter we look only at the most important methods and
handlers.
When we show SAX components connecting, well use diagrams like Fig-
ur e 2-1, with the dashed lines indicating individual event handlers. There
ar e four of them because there are four handlers used to deliver content
to consumers. The producer uses a big arrow, which should remind you
in which direction it pushes events.
Push events to consumer
Producer Consumer
Figur e 2-1. Producer and consumer
When youre using SAX, any or all of these components can be provided
by your application or can be library components. Often youll use an
XML parser from a library to produce events, but in other cases applica-
tions produce such events directly.
3 January 2002 10:07
Beg inning SAX
This chapter explores SAX through some progr essively mor e functional
examples, which build on each other to present the key concepts that are
discussed later in more detail. Essential producer and consumer interfaces
ar e pr esented together to show how they interact, and youll see how to
customize classic SAX congurations. Well focus rst on the producer
side, saving most details about consumer-side APIs for a bit later.
How Do the Par ts Fit Together?
In the simplest possible example, you (in your role as director) will get an
XML parser, which will later produce parsing events. Then you will get a
consumer and connect it to the producer for processing the most impor-
tant events. Finally, youll ask that parser to produce events, pushing them
thr ough to the consumer.
To start, focus on what the differ ent parts are, and how they relate to each
other. Example 2-1 is a simple SAX program, which you can compile and
run if you like.
Example 2-1. SAX2 application skeleton
import java.io.IOException;
import org.xml.sax.*;
import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.helpers.XMLReaderFactory;
public class Skeleton {
// argv[0] must be the absolute URL of an XML document
public static void main (String argv [])
{
XMLReader producer;
DefaultHandler consumer;
// Get an instance of the default XML parser class
try {
producer = XMLReaderFactory.createXMLReader ();
} catch (SAXException e) {
System.err.println (
"Cant get parser, check configuration: "
+ e.getMessage ());
return;
}
// Set up the consumer
Beg inning SAX 25
3 January 2002 10:07
26 Chapter 2: Introducing SAX2
Example 2-1. SAX2 application skeleton (continued)
try {
// Get a consumer for all the parser events
consumer = new DefaultHandler ();
// Connect the most important standard handler
producer.setContentHandler (consumer);
// Arrange error handling
producer.setErrorHandler (consumer);
} catch (Exception e) {
// Consumer setup can uncover errors,
// though this simple one shouldnt
System.err.println (
"Cant set up consumers:"
+ e.getMessage ());
return;
}
// Do the parse!
try {
producer.parse (argv [0]);
} catch (IOException e) {
System.err.println ("I/O error: ");
e.printStackTrace ();
} catch (SAXException e) {
System.err.println ("Parsing error: ");
e.printStackTrace ();
}
}
}
This is a complete SAX application, though its sort of boring since it
thr ows away all the data the parser delivers. The only reason this program
would print anything at all is if you didnt pass it an argument that was the
URL for a well-formed XML le. Other than that, its fairly typical of how
youll be using SAX2, at least in terms of the basic structure. You can
make real programs from this skeleton if you substitute smarter compo-
nents for the simple ones shown here.
We intr oduced a few SAX classes and interfaces, so we can add some
details to our earlier producer/consumer picture to get Figure 2-2. This
pr oducer is an XMLReader, and were listening to one consumer interface
and the Err orHandler. The whole thing is driven by an application which
is pulling the whole document through the reader.
3 January 2002 10:07
parse()
XMLReader DefaultHandler
ErrorHandler
Figur e 2-2. Basic SAX roles and components
XMLReader producer;
The most common type of SAX2 event producer is an XML parser.
Like most parsers, XML parsers implement the XMLReader inter face.
Whether or not they parse actual XML (instead of HTML or something
else), they are requir ed to produce events as if they did.
Dont confuse this class with the java.io.Reader fr om which you can
pull a stream of character data. SAX parsers produce streams of SAX
events, which they push to event consumers. Those are rather differ-
ent models for how to deliver data.
producer = XMLReaderFactory.createXMLReader ();
This is the best all-around SAX2 bootstrap API when you need an
XML parser. The only time it should produce any kind of exception is
when your environment is miscongured. For example, you might
need to set the or g.xml.sax.driver system property to the class name
for your parser (see the section The XMLReaderFactory Class in
Chapter 3).
You can (and should!) keep reusing this XMLReader, but you should
only have one thread touch a parser at a time. That is, parsing is not
re-entrant. Parsers are per fectly safe to use with multiple threads,
except that two threads cant use the same parser at the same time.
(Thats a good rule of thumb for most objects in multithreaded code,
in all programming languages; it should feel natural to apply that rule
to SAX parsers.)
consumer = new DefaultHandler ();
The DefaultHandler class is particularly handy when youre just start-
ing to use SAX. It implements most of the event consumer interfaces,
pr oviding stubbed out (no-op) implementations for each method
Beg inning SAX 27
3 January 2002 10:07
28 Chapter 2: Introducing SAX2
thats not part of an extension handler. That means its easy to sub-
class this method if you need a place to start: just override each stub
method to provide real code when you need it. Well use Default-
Handler to avoid presenting extra callback methods.
producer.setContentHandler (consumer);
In this chapter, wer e only showing the most commonly used con-
sumer interfaces. ContentHandler is used to report elements,
attributes, and characters; thats enough to get almost all serious XML
work done.
producer.setErrorHandler (consumer);
Err orHandler lets applications control handling of various kinds of
err ors, and well need it in later examples. Well usually look at error
handling as a specialized kind of task, differ ent fr om other consumer
roles. Even though handler is part of its name, its a differ ent kind of
object.
producer.parse (argv [0]);
This call tells a parser to read the XML text found at a particular fully
qualied URL. Theres another call youll use when you dont have a
URL for that text, but most of the time this is the call you ought to
use. If youre tempted to pass lenames or relative URIs, just say no!
Filenames need to be converted to URLs rst (see the section File-
names Versus URIs in Chapter 3), and relative URIs must be con-
verted to absolute ones.
Parsing can report exceptions. This is important, and not just because
its the only way that a chunk of code like this (using just an XML-
Reader) could seem to do anything. Normally, those exceptions will
be thrown only for fatal errors, such as well-formedness errors in an
XML document, or for document I/O problems.
The application thread is pulling the XML text through the XMLReader-
style producer: the parse() call wont retur n until the whole document is
parsed, or until parsing is aborted by throwing an exception. Until it
retur ns, the thread that called the XMLReader is either blocking on I/O,
parsing data that it just read, or pushing data into one of the consumer
inter faces. That is, from the perspective of event consumers SAX2 is a
push API: handlers do nothing until theyre asked.
3 January 2002 10:07
What Are the SAX2 Event Handlers?
SAX2 events are grouped into several interfaces, which we explore later in
mor e detail. All except two are implemented by DefaultHandler. Each
inter face encapsulates a set of events; to see those events, applications
give parsers objects that implement the handler interfaces theyre inter-
ested in.
or g.xml.sax.ContentHandler
Essentially every signicant use of SAX2 involves this handler. The
element and character data callbacks (discussed later in this chapter)
ar e dened in this interface, as are callbacks for most other SAX2
events for general-purpose data. Many SAX2 applications will focus
primarily on this interface. If you only need the core XML data model
(elements, attributes, and text), this could be the only handler you
use.
or g.xml.sax.ext.DeclHandler
This handler reports DTD declarations that arent exposed through
DTDHandler (or in one case LexicalHandler) callbacks: declarations
for elements, attributes, and parsed entities.
Because it is an extension handler, it wont necessarily be recognized
by all SAX2 parsers, and DefaultHandler doesnt provide no-op imple-
mentations for its callbacks.
or g.xml.sax.DTDHandler
This handler reports DTD declarations that the XML 1.0 specication
requir es all processors to expose: declarations for notations and for
unparsed entities. Most applications wont use this interface unless
theyr e connected to SGML-based infrastructure that depends on such
tools. This is probably the most exotic SAX handler interface; web-
oriented XML applications will use MIME types instead of notations
and URIs instead of unparsed entities.
or g.xml.sax.ErrorHandler
The events reported by this class are err ors and warnings. These
behaviors are part of XML, but not part of the data model so they
dont show up in the Infoset. Gr ouping these events in one interface
lets application code centralize treatment of XML or application data
err ors. After ContentHandler, its probably the most important SAX2
handler. Its also usefully managed apart from other handlers, so in
this book its usually not lumped with real handlers. (This interface
is discussed later in this chapter.)
Beg inning SAX 29
3 January 2002 10:07
30 Chapter 2: Introducing SAX2
or g.xml.sax.ext.LexicalHandler
This interface mostly exposes information that is intended to be
semantically meaningless, such as comments and CDATA section
boundaries, as well as entity and DTD boundaries.
Because it is an extension handler, it wont necessarily be recognized
by all SAX2 parsers, and DefaultHandler doesnt provide no-op imple-
mentations for its callbacks.
With the exception of Err orHandler, youll normally want to work with all
of these interfaces as a single group: four interfaces, two for content in the
document body and two for DTD content. That way, you will work with
all the XML data from a document (its Infoset) as part of a cohesive
whole. There are SAX2 helper classes (like DefaultHandler and XMLFilter-
Impl) that group most of these interfaces into classes, but they ignore the
two extension handlers (Decl and Lexical handlers in the or g.xml.sax.ext
package). SAX2 application layers often handle such grouping; for exam-
ple, you can subclass those helper classes in a differ ent package, adding
extension interface support.
The logic behind keeping these interfaces separate, rather than merging all
of their methods into one huge interface, is that its more appr opriate for
simple applications. You must explicitly ask for bells and whistles; they
ar ent thrust upon you by default. You can easily prune out certain data by
ignoring the interfaces that report it. Most code only uses ContentHandler
and Err orHandler implementations, so the methods in other interfaces are
easy to ignore. Plus, from the application perspective, parser recognition
of the extension handlers isnt guaranteed. Theres a slight awkwardness
associated with needing to bind each type of handler separately, but thats
a small trade-off for the benet of having a modular API extension model
alr eady in place.
SAX2 denes another important interface beyond these handlers and the
XMLReader : parsers use EntityResolver to retrieve external entity text they
must parse. That interface is also stubbed out by DefaultHandler. If you
want the parser to use local copies of DTDs rather than DTDs accessed
fr om a server that might not be available, youll want to become familiar
with EntityResolver. However, it isnt really a consumer API since it doesnt
deal directly with parsed XML data (the Infoset); it deals with accessing
raw unparsed text, the same stuff thats given to XMLReader.parse() meth-
ods. This book presents it as a producer-side helper for parsers, in the
section The EntityResolver Interface in Chapter 3.
3 January 2002 10:07
XMLWr iter: an Event Consumer
The next part of SAX we show in this overview is really not a part of SAX,
except that it uses SAX to do something youll likely need to do fairly
often. (Pretty much everyone does!) As youve seen, SAX2 includes an
XMLReader inter face, used to turn XML text into a stream of SAX events.
But it does not include the corresponding XMLWriter to reverse the pro-
cess: turning such events back into text and supporting XML for program
outputs as well as inputs. SAX isnt only for reading XML. The same APIs
ar e used to write XML too.
Its almost a tradition to show how to write most of such a class as an
example when explaining SAX. We avoid that in this book because getting
all the XML details right is tricky, and because this class is a clear example
of something that should be treated as a reusable SAX library component.
Ther e ar e lots of ways the data needs to be escaped, and sometimes you
need to use output encodings (like ASCII) that have problems repr esent-
ing some XML characters.
Ther es a better solution: use one of several such classes, which are
widely available. This book uses the gnu.xml.util.XMLWriter class (bun-
dled with gnujaxp.jar and lfred) when it needs XML generation func-
tionality, because it doesnt force applications to discard as much of the
XML data. It supports all of the SAX2 handlers, including the extension
handlers LexicalHandler and DeclHandler, so it can round-trip almost all
XML data. To use such classes, at least in their simple low-delity modes,
you can modify the skeleton program shown earlier to something like this:
import java.io.FileOutputStream;
import gnu.xml.util.XMLWriter;
public class ... {
...
setContentHandler (
new XMLWriter (new FileOutputStream ("out.xml"))
);
...
}
In addition to the GNU class used in this book, other versions are avail-
able. One is provided with DOM4J or g.dom4j.io.XMLWriter, which sup-
ports Content and Lexical handlers and evolved from the
com.megginson.sax.XMLWriter class, which supports only Con-
tentHandler. Curiously, neither Crimson nor Xerces include such SAX-to-
text functionality at this time.
Beg inning SAX 31
3 January 2002 10:07
32 Chapter 2: Introducing SAX2
Event pipelines
Of course, just parsing and echoing data is not very useful. Such classes
ar e best used to output XML data that youve massaged a bit. Well look at
two ways to do this later. One way is to use an XML pipeline, wher e con-
sumers produce data for other consumers, as illustrated in Figure 2-3. For
example, one stage could lter the event stream from a parser to remove
various uninteresting elements, or otherwise transform the data, and then
feed the result to an XMLWriter. You can combine several such stages into
a pipeline and debug them using an XMLWriter to watch data as it ows
thr ough particular stages. Remember that XMLReader isnt the only kind of
SAX event producer: programs can write events and feed the result to an
XMLWriter. Also, the consumer doesnt need to be an XMLWriter; it could
construct any kind of useful data structure. In fact well look later at doing
this with DOM.
Read Transform Write
Figur e 2-3. Simple SAX2 event pipeline
This kind of processing pipeline is a fundamental model for more
advanced uses of SAX and for structuring components that are SAX-awar e.
We look at pipelines again in the section XML Pipelines in Chapter 4.
For now, keep in mind that sometimes event consumers will be producing
events for later processing components.
Concer ns when writing XML text
Ther e ar e several important issues to consider when writing XML output,
which should be mentioned in the documentation for the XMLWriter you
use. You may even be able to use your XMLWriter to canonicalize output,
so you can safely compare processor output or create digital signatures.
The GNU class shown earlier handles most of these directly, but thats not
true for all such classes.
You need the exibility to choose differ ent line endings, such as Mac-
intosh style (CR only), DOS style (CRLF), and Unix style (LF only).
The default should be right for the host Operating System, but some-
times thats not right for the destination.
3 January 2002 10:07
The SAX2 event stream might discard essential namespace prex
infor mation. If youre using documents with namespaces, you need
to provide a sanitized event stream, making sure either that such data
is not discarded (using the mixed mode namespace handling dis-
cussed later in this chapter) or that corresponding data gets synthe-
sized (maybe in some pipeline stage).
You might be sending XML to applications that dont handle DTDs or
exter nal entities very well. For example, many web browsers wont
read DTDs. To talk robustly to such applications, you might need to
send standalone documents.
If your application just uses ContentHandler events, youll have dis-
carded information needed to re-cr eate high-delity output reecting
DTD content, comments, entity refer ences, and CDATA section
boundaries. More handlers are detailed in Chapter 4 as well as and
briey summarized later in this section; most of the writers implement
many such interfaces.
If you dont want to use UTF-8 as your character encoding (or
UTF-16), youll have to be sure the names used by your markup can
be expressed using that character encoding. Thats because while
numeric character refer ences can be used inside text, they cant be
used inside markup components like element and attribute names.
ASCII, for example, is hopeless at handling element names that use
Japanese ideographic characters, but it can handle Japanese text if you
dont mind that every character in the document text is cryptically
expr essed as a numeric character refer ence.
The rst time you try to debug XML output where a single line is even
just a few kilobytes in length, youll want your XMLWriter to be
pr etty printing. Minimally it should add line breaks; ideally it should
be able to indent to show document structure.
Such an XMLWriter is part of almost every developers SAX toolkit, even
though it isnt part of SAX itself. As you work with SAX, youll probably
start to collect and develop your own library of such reusable event con-
sumer code.
Basic ContentHandler Events
Youve just seen how the parts of a SAX2 application t together, so now
your e ready to see how the data is actually handled as it arrives. Here we
Basic ContentHandler Events 33
3 January 2002 10:07
34 Chapter 2: Introducing SAX2
focus on the events that deal with the core XML data model of elements,
attributes, and text. To work with that model, you need to use only a
handful of methods from the ContentHandler inter face.
The DefaultHandler Class
As mentioned earlier, this class is a convenient way to start using SAX2
because it provides stubs for many of the handler methods. You can just
override those stubs with methods to do real work. Using DefaultHandler
as a base class is just an implementation option. Its often just as conve-
nient not to use such a base class. The class is used in this chapter to
avoid explaining handler methods that you dont really need.
In some scenarios, Suns JAXP requir es you to use DefaultHandler as a
base class. Thats much more of a restriction than SAX itself makes. If you
stick to using the SAX XMLReader API, as recommended in this book,
youll still have the option of using DefaultHandler as a base class, but
this policy wont be imposed on your application code. For example, you
can have separate objects to encapsulate policies such as error handling,
so you wont need to hardwire all such policies into a single class.
Example: Elements and Text
Lets use this simple XML document to learn the most essential SAX call-
backs:
<stanza>
<line>In a cavern, in a canyon,</line>
<line>Excavating for a mine,</line>
<line>Dwelt a miner, forty-niner,</line>
<line>And his daughter Clementine.</line>
</stanza>
This is a simple document, only elements and text, with no attributes,
DTD, or namespaces to complicate the code were going to write. When
SAX2 parses the document, our ContentHandler implementation will see
events reported for those elements and for the text. The calls will be more
or less as follows; theyre indented here to corr espond to the XML text,
and the characters() calls show strings since slices of character arrays are
awkward:
startElement ("", "", "stanza", empty)
characters ("\n ")
startElement ("", "", "line", empty)
characters ("In a cavern, i");
characters ("n a canyon,");
3 January 2002 10:07
endElement ("", "", "line")
characters ("\n ")
startElement ("", "", "line", empty)
characters ("Excavating for a mine,");
endElement ("", "", "line")
characters ("\n ")
startElement ("", "", "line", empty)
characters ("Dwelt a miner, forty-niner,");
endElement ("", "", "line")
characters ("\n ")
startElement ("", "", "line", empty)
characters ("And his daughter");
characters (" Clementine.");
endElement ("", "", "line")
characters ("\n")
endElement ("", "", "stanza")
Notice that SAX does not guarantee that all logically consecutive charac-
ters will appear in a single characters() event callback. With this simple
text, most parsers would deliver it in one chunk, but your application
code cant rely on that always being done. Also, notice that the rst two
parameters of startElement() ar e empty strings; they hold namespace
infor mation, which we explain toward the end of this chapter. For now,
ignor e them and the last parameter, which is for the elements attributes.
For our rst real work with XML, lets write code that prints only the lyrics
of that song, stripping out the element markup. Well start with the char-
acters() method, which delivers characters in part of a character buffer
with a method signature like the analogous java.io.Reader.read()
method. This looks like Example 2-2.
Example 2-2. Printing only character content (a simple example)
public class Example extends DefaultHandler {
public void characters (char buf [], int offset, int length)
throws SAXException
{
System.out.write (new String (buf, offset, length));
}
}
If you create an instance of this Example class instead of DefaultHandler
in Example 2-1 and then run the resulting program
*
with a URL for the
XML text shown earlier, youll see the output.
* On some systems, the user will need to provide system property on the command line,
passing -Dor g.xml.sax.driver= . . . , as shown in the section Bootstrapping an XMLReader
in Chapter 3.
Basic ContentHandler Events 35
3 January 2002 10:07
36 Chapter 2: Introducing SAX2
$ java Skeleton file:///db/sax2/verse.xml
In a cavern, in a canyon,
Excavating for a mine,
Dwelt a miner, forty-niner,
And his daughter Clementine.
$
Youll notice some extra space. It came from the whitespace used to
indent the markup! If we had a DTD, the SAX parser might well report
this as ignorable whitespace. (See the section Other ContentHandler
Methods in Chapter 4 for information about this callback.) But we dont
have one, so to get rid of that markup we should really print only text
thats found inside of <line> elements. In this case, we can use code like
Example 2-3 to avoid printing that extra whitespace; however, well have
to add our own line ends since the input lines wont have any.
Example 2-3. Printing only character content (a better example)
public class Example extends DefaultHandler {
private boolean ignore = true;
public void startElement (String uri, String local, String qName,
Attributes atts)
throws SAXException
{
if ("line".equals (qName))
ignore = false;
}
public void endElement (String uri, String local, String qName)
throws SAXException
{
if ("line".equals (qName)) {
System.out.println ();
ignore = true;
}
}
public void characters (char buf [], int offset, int length)
throws SAXException
{
if (ignore)
return;
System.out.write (new String (buf, offset, length));
}
}
With a more complicated content model, this particular algorithm proba-
bly wouldnt work. SAX content handlers are often written to understand
particular content models and to carefully track application state within
3 January 2002 10:07
parses. They often keep a stack of open element names and attributes,
along with other state thats specic to the particular task the content han-
dler perfor ms (such as the ignored ag in this example). A full example
of an element/attribute stack is shown later, in Example 5-1.
*
In simple cases like this, where namespaces arent involved, you could
use a particularly simple stack, as shown in Example 2-4. You can use
such an element stack for many purposes. The depth of the stack corre-
sponds to the depth of element nesting. This feature can help you debug
by allowing you to structurally indent diagnostics. You can also use the
stack contents to make decisions: maybe you want to print line elements
that are from some stanza of a song, but not lines spoken by a character
in a play. To do that, you might verify that the parent element of the line
was a stanza. Make sure you understand how this example works; once
you understand how startElement() and endElement() always match, as
well as how they repr esent the document structure, youll understand an
essential part of how SAX works.
Example 2-4. Printing only character content (element stack)
public class Example extends DefaultHandler {
private Stackstack = new Stack ();
public void startElement (String uri, String local, String qName,
Attributes atts)
throws SAXException
{
stack.push (qName);
}
public void endElement (String uri, String local, String qName)
throws SAXException
{
if ("line".equals (qName))
System.out.println ();
stack.pop ();
}
public void characters (char buf [], int offset, int length)
throws SAXException
{
if (!"line".equals (stack.peek ()))
* Whitespace handling in text can get quite messy. XML denes an xml:space attribute that
may have either of two values in a document: default, signifying that whatever your
application wants to do with whitespace is ne, and preserve, which suggests that whites-
pace such as line breaks and indentation should be preserved. W3C XML Schemas replace
default with two other options to provide a partial match for the whitespace normaliza-
tion rules that apply to attribute values.
Basic ContentHandler Events 37
3 January 2002 10:07
38 Chapter 2: Introducing SAX2
Example 2-4. Printing only character content (element stack) (continued)
return;
System.out.write (new String (buf, offset, length));
}
}
Although they didnt appear in this simple scenario, most startElement()
callbacks will have if/then/else decision trees that compare element
names. Or if youre the kind of developer who likes to generalize such
techniques, you can store per-element handlers in some sort of table and
look them up by name. In both cases, you need to have some way to
handle unexpected elements, and because of XML namespaces, the qName
parameter isnt always what you should check rst. One policy is just to
ignor e unexpected elements, which is what most HTML browsers do with
unexpected tags. Another policy is to treat them as some kind of docu-
ment validity error.
The Attr ibutes Interface
In the previous section, we skipped over the attributes provided with each
element. Lets look at them in a bit more detail.
SAX2 wraps the attributes of an element into a single Attributes object. For
any attribute, there are thr ee things to know: its name, its value, and its
type. Ther e ar e two basic ways to get at the attributes: by an integer index
(think array) or by names. The only real complication is there are two
kinds of attribute name, courtesy of the XML Namespaces specication.
Attr ibute lookup by name
You often need to write handler code that uses the value of a specic
attribute. To do this, use code that accesses attribute values directly, using
the appropriate type of name as arguments to a getValue() call. If the
attribute name has a namespace URI, youll pass the URI and the local
name (as discussed later in this chapter). Otherwise youll just pass a sin-
gle argument. A value that is an empty string would be a real attribute
value, but if a null value is retur ned, no value was known. In such a case,
your application might need to infer some nonempty attribute value. (This
is common for #IMPLIED attributes.)
Consider this XML element:
<billable label=finance
xmlns:units="http://www.example.com/ns/units"
3 January 2002 10:07
units:currency="NLG"
>
25000
</billable>
Application code might need to enforce a policy that it wont present doc-
uments with such data to users that arent permitted to see nance
labeled data. That might be a meaningful policy for code running in appli-
cation servers where users could only access data through the server.
Code to enforce that policy might look like this:
public void
startElement (String uri, String local, String qName, Attributes atts)
throws SAXException
{
String value;
value = atts.getValue ("label");
if ("finance".equals (value) && !userClearedForFinanceData
getUser ()))
throw new SAXException ("you cant see this data");
... process the element
}
Other application code might need to know the currency in which the bil-
lable amount was expressed. In this example, this information is provided
using namespace-style naming, so you would use the other kind of acces-
sor to ensure that you see the data no matter what prex is used to iden-
tify that namespace:
String currency;
currency = atts.getValue ("http://www.example.com/ns/units",
"currency");
// whats the best exchange rate today?
Ther e ar e corr esponding getType() accessors, which accept both types of
attribute names, but you shouldnt want to use those. After all, if you
know enough about the attribute to access it by name and to process it,
you should certainly know its type already!
Accessing attribute values or types using an index is faster than looking
up their names. If you need to access attribute values or types more than
once, consider using the appropriate one of the two getIndex() calls to
get and save the index, as well as using the third syntax of the getValue()
or getType() calls (shown in the next section).
Basic ContentHandler Events 39
3 January 2002 10:07
40 Chapter 2: Introducing SAX2
Attr ibute lookup by index
You might need to look at all the attributes provided with an element, par-
ticularly when youre building infrastructure components. Heres how you
might use an index to iterate over all the attributes you were given in a
startElement() callback and print all the important information. This code
uses a few methods that well explain later when we discuss namespace
support. getLength() works like the length attribute on an array.
Attribute atts = ...;
int length = atts.getLength ();
for (int i = 0; i < length; i++) {
String uri = atts.getURI (i);
// Does this have a namespace-style name?
if (uri.length () > 0) {
System.out.print ("{ " + uri);
System.out.print (" " + atts.getLocalName (i) + " }");
// no namespace
} else
System.out.println (atts.getQName (i));
// value comes from document, or is defaulted from DTD
System.out.print (", value = " + atts.getValue (i))
// type is CDATA unless it comes from <!ATTLIST ...> in DTD
System.out.print (", type = " + atts.getType (i))
}
Youll notice that accomodating input documents that use XML names-
paces has complicated this code. Its important to remember that from the
SAX perspective, attributes can have either of two kinds of names, and
you must not use the wrong kind of name. (The same is true for ele-
ments.) Application code that handles arbitrary input documents will usu-
ally needs to handle both types of names, using the logic shown earlier.
Its rarely safe to assume your input documents will only use one kind of
name.
Its often good practice to scan through all the attributes for an element
and report some kind of validity error if a document has unexpected
attributes. (These might include xmlns or xmlns:* attributes, but often its
best to just ignore those.) This can serve as a sanity check or a kind of
pr ocedural validation. For example, if you validated the input against its
own DTD, that DTD might have been modied (using the internal subset
or some other mechanism) so that it no longer meets your programs
expectations. Such a scan over attribute values can be a good time to
3 January 2002 10:07
make sure your application does the right thing with any attributes that
need to be #IMPLIED, or have type ID.
Other attributes issues
Attribute values will always be whitespace-normalized as requir ed by the
XML specication. This means that the only whitespace in an attribute will
be space characters or whitespace provided by character refer ences to a
tab, newline, or carriage retur n. If the type isnt reported as CDATA, addi-
tional normalization is done: leading and trailing spaces are stripped, and
consecutive space characters are replaced by a single space.
If the parser read the DTD, you are able to see the XML attribute type it
declar ed. The best way to see this type is to use the DeclHan-
dler.attributeDecl() event, which needs a bit of advance planning. (This
callback is discussed later in the section The DeclHandler Interface in
Chapter 4.) Or you can use the Attributes.getType() methods if you can
deal with incomplete reporting for enumerated types. (You wont see the
possible values, and the type will either be NOTATION or NMTOKEN.)
The Attributes object passed to startElement() is only usable during that
callback. If you need access to information found there, you must copy it.
A utility AttributesImpl class is available, with a copy constructor, and is
discussed in Chapter 5 in the section The AttributesImpl Class..
The methods in the Attributes inter face ar e summarized in Appendix A.
For more infor mation, consult the SAX javadoc.
Essential ContentHandler Callbacks
In the earlier code example, we used some callbacks without really
explaining what they did and what their parameters were. This section
pr ovides mor e details.
In the summaries of handler callbacks presented in this book, the event
signatur es ar e omitted. This is just for simplicity: with a single exception
(ContentHandler.setDocumentLocator()), the event signature is always the
same. Every handler can throw a SAXException to terminate parsing, as
well as java.lang.RuntimeExceptions and java.lang.Err or, which any Java
method can throw. Handlers can throw such exceptions directly, or as a
slightly more advanced technique, they can delegate the error-handling
policies to an Err orHandler and recover cleanly if those calls retur n
instead of throwing exceptions. (Err orHandler is discussed later in this
chapter.)
Basic ContentHandler Events 41
3 January 2002 10:07
42 Chapter 2: Introducing SAX2
The ContentHandler callbacks include:
void startElement(uri,local,qName,Attributes atts)
void endElement(uri,local,qName)
These two callbacks bracket element content, starting with startEle-
ment() to identify the element and provide its attributes. Typically,
startElement() will be followed by a series of other event callbacks
to report child content, such as character data and other elements.
After all children of the element have been reported, endElement()
reports the end of the element.
String uri
For elements associated with a namespace URI, this is the URI.
For other kinds of elements, this is the empty string.
String local
For elements associated with a namespace URI, this is the element
name with any prex removed. For other kinds of elements, this
is the empty string.
String qName
This is the element name as found in the XML text, but for ele-
ments associated with a namespace URI, this might be the empty
string. (Dont rely on it being nonempty unless the URI is empty,
or youve congured the parser in mixed namespace reporting
mode as described later in this chapter, in the section Namespace
Featur e Flags.)
Attributes atts
An elements attributes are only provided in the startElement()
call. The atts object is owned by the parser and is only on
short-ter m loan to the event callback. If your application code
needs to save attribute data, it must make a copy. (The
AttributesImpl helper class may help.)
These callbacks appear in pairs unless an exception is thrown to abort
parsing. Even empty elements (like <this/>) cause two calls.
Most applications do a lot of work in startElement() callbacks to set
up further processing, but endElement() work varies. Sometimes
endElement() does nothing, sometimes its just a quick state cleanup
(popping stacks), and sometimes its where all the work queued dur-
ing an elements processing is nally perfor med.
3 January 2002 10:07
void characters (buf, offset, length)
Text content is provided as a range from a character array. Applica-
tions will often need to make a copy of this data, appending it either
to another character array or to a StringBuf fer. (Use strings if their
extra cost is not a problem.) Then the real action to process charac-
ter data would be taken when this callback learns that all the relevant
characters have been provided, often because of a startElement() or
endElement() call.
char buf[]
A character array holding the text being provided. You must
ignor e characters in this buffer that are outside of the specied
range.
int offset
The index of the rst character from the buffer that is in range.
int len
The number of text characters that are in the ranges buffer,
beginning at the specied offset.
Application code must expect multiple sequential calls to this method.
For example, it would be legal (but slow) for a parser to issue one
callback per character. Content found in differ ent exter nal entities
will be reported in differ ent characters() invocations so location
infor mation is reported correctly. (This is described in the section
The Locator Interface in Chapter 4.) Most parsers have only a lim-
ited amount of buffer space and will ush characters whenever the
buf fer lls; ushing can improve perfor mance because it eliminates a
need for extra buffer copies. Excess buffer copying is a classic perfor-
mance killer in all I/O-intensive software.
The XML specication guarantees that you wont see CRLF- or CR-
style line ends here. All the line ends from the document will use sin-
gle newline characters (\n). However, some perverse documents
might have placed character refer ences to carriage retur ns into their
text; if you see them, be aware that theyre not real line ends!
Ther e ar e many other methods in the ContentHandler inter face, discussed
later in the section Other ContentHandler Methods in Chapter 4.
Basic ContentHandler Events 43
3 January 2002 10:07
44 Chapter 2: Introducing SAX2
Producer-Side Validation
All uses of SAX2 parsers will involve extending and customizing the basic
scenario we saw earlier. Our next example illustrates two basic congura-
tion mechanisms: err or handling options, which lets you use the appro-
priate policy when you see errors, and parser conguration thr ough
featur e ags, which let you control some details of how the parser works.
(Some event handlers are managed with a conguration mechanism that is
quite similar to the feature ag mechanism.) The example also shows how
SAX2 parsers expose the core XML notion of DTD-based validation.
You will often tell XML parsers to validate XML as they produce events.
Because SAX2 provides access to most of the data in XML documents,
including declarations from DTDs, it also supports perfor ming such valida-
tion on the event consumer side, possibly with a cached DTD or schema.
(The consumer side is the only place to perfor m pr ocedural validation.)
Such consumer-side validation can be important when youre trying to
make your program output meet the constraints of a particular information
interchange agreement; just add a streaming validation stage to your out-
put processing. This approach can also be used for DOM revalidation and
similar purposes. Here, we look at how to validate data that is already in
the form of XML text.
Keep in mind that some important DTD-related processing does not
involve validation. Documents with DTDs can use entity substitution for
document modularity and text portability, and can have attributes
defaulted and normalized. Validation with DTDs only involves checking a
set of rules. Disabling DTD validation turns off only the rule checks, not
the processing for entities and attributes.
SAX2 Feature Flags
SAX2 exposes many parser behaviors, including DTD validation, using a
featur e ag mechanism. These ags are Boolean settings, which may
have values or be unspecied. Parsers can have up to four differ ent
modes for any feature ag. For example, with the validation ag SAX2
implies four kinds of XML parsers:
Optionally validating parsers
The feature ag is read/write and can be either true or false. If its set
to false, few nonfatal errors will be reported and parsing will be a bit
faster (maybe 5 or 10 percent of the cost of parsing XML, which is
usually negligible to start with).
3 January 2002 10:07
Validity and XML
Validation is particularly important when you are interchanging doc-
uments that have been wholly or partially authored by hand, but it
can also be helpful when working with XML thats generated by cus-
tom code. When you validate an XML document, you ensure that it
meets certain rules needed to process themsuch as requiring a
<title> element as the rst child of every <chapter> element or pro-
hibiting dangling internal cross-r efer ences.
Validation is done at several levels in most applications. Lower levels
tend to use rule-based logic, such as the DTD validation thats
dened by XML 1.0. The various types of XML schema provide dif-
fer ent kinds of rule-based sanity checks, which are usually done
befor e applications see the data. (W3Cs schemas also extract addi-
tional information items, beyond the XML data model of elements,
attributes, and text. This information is called the Post-Schema-Vali-
dation Infoset or PSVI.) Higher-level validation processes tend to
involve richer notions of data validity and tend to be expressed as
pr ocedural logic. For example, business logic often involves ad
hoc relationships, policies, and heuristics; it relies on information not
nor mally expr essible by DTD or schema-style rules. Such logic is
often captured in application-level methods. As a rule, no single
data validation technology is sufcient for all purposes.
Your development process should try to ensure that you create only
valid documents; you will likely send XML to applications that dont
handle invalid data very well. Safe operational practice involves val-
idating all documents received from other parties and accepting the
small costs involved. (Use local copies of DTDs, or schemas, to
avoid depending on remote les that might disappear. Techniques to
achieve this are discussed in the section The EntityResolver Inter-
face in Chapter 3.) The cost of rule-based validation is usually
smaller than routine system load variations for real applications;
even in parsing speed benchmarks its rarely high. Its usually worth
the cost since it can prevent someone elses data from accidentally
br eaking your software. Validation against a good DTD (or schema)
pr ovides a useful base level of input data checking, but it will rarely
be sufcient.
Producer-Side Validation 45
3 January 2002 10:07
46 Chapter 2: Introducing SAX2
Nonvalidating parsers
The feature ag is read-only and always false. Some nonfatal errors
might be reported (the XML specication demands them in some
cases).
Always validating parsers
The feature ag is read-only and always true. Validity errors are
always reported as nonfatal. (By default, such errors are ignor ed; see
the section Handling Validity Errors later in this chapter.)
Unknown validation behavior
The feature ag is not recognized, so its value cant be determined.
(This mode is uncommon for the SAX2 validation ag, but youll see it
with other feature ags.)
Later in this chapter, look at the feature ags used to characterize names-
pace processing. Those ags are not optional, so fewer potential parser
modes are possible. All the standardized feature ags are detailed in the
section XMLReader Feature Flags in Chapter 3.
In SAX, URIs identify feature ags. These are used purely as unique identi-
ers. This is the same approach used in XML namespaces: dont use these
URIs to retrieve data, even if they do look like URLs you could type into a
br owser. The URI http://xml.or g/sax/features/validation identies the ag-
contr olling validation.
To check how a given XML parser handles validation, use code similar to
Example 2-5. Code for any other kind of parser feature will look much the
same, as long as you use the correct ID for the feature ag; youll see the
same exception types working in the same way. (The same is true for
parser properties, which youll see in the section XMLReader Proper-
ties in Chapter 3.)
Example 2-5. Checking for validation support
XMLReader producer;
String uri = "http://xml.org/sax/features/validation";
// ... get the parser
// Try getting and setting the flag
try {
System.out.println ("Initial validation setting: "
+ producer.getFeature (uri));
// if we get here, validation behavior is known
producer.setFeature (uri, true);
// if we get here, the parser either validates by
3 January 2002 10:07
URIs = URLs + URNs
The use of URIs in XML namespaces has been confusing, and since
SAX2 also uses URIs to identify parser feature ags and properties,
the same sort of confusion can show up. Think of URIs as names:
you can talk about Fred even if hes not there, or about Godot
even if he may not exist, and the third house on the left probably
makes sense to someone standing at your side.
Classically, a Universal Resource Identier (URI), is either a Universal
Resour ce Locator (URL) or a Universal Resource Name (URN). Both
types of URIs are repr esented as strings. Your e used to seeing URLs
in web browsers; they serve as detailed addresses. They often look
like http://www.example.com/ but they may use other URI schemes
for example, they may use https:, ftp: and le:. The scheme indicates
the way to access the resource. URNs use URI schemes that start
with ur n:. You probably have not seen many URNs; one example is
ur n:uuid:221ffe10-ae3c-11d1-b66c-00805f8a2676. URN schemes
(like uuid in this example) describe what the resource is, more than
how to access it.
Filenames are never URIs, but you can convert a lename into a URL
(hence URI) that works on systems where the original lename was
legal. Just to be confusing, there are also relative URIs, which
often look like POSIX-style lenames. Like lenames, relative URIs
should never be handed directly to a SAX parser or be used as
namespace identiers.
With XML namespaces and SAX2, the term URI is used to emphasize
that the string is being used as a pure identier: its more like a URN
than a URL, even when the URI is syntactically a URL. Its explicitly
irr elevant whether any resource is actually associated with the URI.
Dont assume you can fetch resources using those URIs.
Example 2-5. Checking for validation support (continued)
// default or is optionally validating
} catch (SAXNotSupportedException e) {
// value not supported; parser is nonvalidating
System.out.println ("Cant enable validation: "
+ e.getMessage ());
System.exit (1);
} catch (SAXNotRecognizedException e) {
Producer-Side Validation 47
3 January 2002 10:07
48 Chapter 2: Introducing SAX2
Example 2-5. Checking for validation support (continued)
// feature not understood; parser has weak SAX2 support.
// maybe its a SAX1 parser inside a ParserAdapter
System.out.println ("Doesnt understand validation: "
+ e.getMessage ());
System.exit (1);
}
As a rule, programs will probably set the validation ag to true only when
they really need reports of validity errors. (Why? As well see in a moment,
its natural to ignore reports of validity errors when theyre not important,
so it doesnt much matter if you validate when you dont need to.) The
skeleton program in Example 2-1 really just needs a setFeature() call and
a small update to the diagnostic message, to be sure its always validating.
(The diagnostics could be more precise using some more-specialized
exceptions that we havent discussed yet.)
// Get an instance of the default XML parser class
try {
producer = XMLReaderFactory.createXMLReader ();
producer.setFeature (
"http://xml.org/sax/features/validation",
true);
} catch (SAXException e) {
System.err.println (
"Cant get validating parser, check configuration: "
+ e.getMessage ());
return;
}
The validation feature ag is probably the most widely used, with the pos-
sible exception of the ags controlling namespace handling. Most parsers
leave validation off by default to save some minor parsing overhead.
Handling Validity Errors
If you modify the skeleton program to set the parsers validation ag and
then run it on a well-formed but invalid document (perhaps one without a
DTD), you will probably be surprised to discover that it doesnt seem to
report any errors. Thats exactly what should happen since its the default
behavior specied by SAX. To make validity errors cause anything inter-
esting to happen, you have to change how theyre handled. If you dont
change this handling, you wont be able tell a validating parser apart from
a nonvalidating one!
The simplest way to change the handling of validity errors is to make
them work just like well-formedness errors: by aborting the parse. This
3 January 2002 10:07
uses the Err orHandler inter face that we look at later in this chapter, in the
section ErrorHandler Interface, but for now its simpler to focus on one
method. In terms of the skeleton program shown earlier, such a change
can be an update to just one line, using an anonymous inner class to
make the code look simple. (Of course, avoid using anonymous classes
for anything complex; they can make code hard to maintain.)
// Get a consumer for all the parser events
consumer = new DefaultHandler () {
public void error (SAXParseException e)
throws SAXException
{ throw e; }
};
XML parsers call ErrorHandler.error() whenever they nd a validity
err or, or when they see certain other nonfatal errors. In this case, our cus-
tom handler adopts a policy that whenever it sees such an error, it will
abort the parse by throwing the exception reported to it. Later in this
chapter we look at some alternative policies.
When your callback detects serious application-level errors, you can throw
a SAXException fr om any SAX event handler callback to abort parsing.
That doesnt have be done only from an Err orHandler. For example,
when input data is valid XML but doesnt meet essential semantic requir e-
ments of the application, report it using some kind of SAXException. If
your code only knows how to process shipping invoices, then greeting
cards should be rejected immediately.
Exception Handling
Exceptions are the primary way that SAX event consumers communicate
to event producers; this is the reverse of the typical communication pat-
ter n (fr om pr oducer to consumer). Well look at SAX exceptions before we
delve more deeply into either producers or consumers. Well look at the
several types of exceptions that might be thrown, the error handler inter-
face that lets your code decide how to handle errors, and then how these
nor mally t together.
Keep this rule of thumb in mind: when a SAX handler throws any excep-
tion including a java.lang.RuntimeException or a java.lang.Err orpars-
ing stops immediately. The exception passes through the parser and is
thr own by XMLReader.parse(). Beyond some possible additional error
reports, the only additional event callback should be ContentHandler.end-
Document(). This method is always called before parsing nishes, even
Exception Handling 49
3 January 2002 10:07
50 Chapter 2: Introducing SAX2
after errors, to ensure it can be used for cleaning up. (That callback is pre-
sented in Chapter 4, in the section Other ContentHandler Methods.)
SAX2 Exception Classes
Ther e ar e four standard exception classes, with a common base class used
in the signature for all handler methods. The parse() methods, as well as
the EntityResolver class presented in the section The EntityResolver Inter-
face in Chapter 3, can also throw java.io.IOException to indicate prob-
lems unrelated to XML text content. You will nd that many XML APIs are
declar ed the same way; for example, JAXP parser methods may throw
such exceptions even if they dont expose SAX events directly. See
Appendix A for method summaries for these exception classes.
or g.xml.sax.SAXException
This is the base exception class. Typically you will see its subclasses.
These exceptions have messages and may wrap other exceptions for
diagnostic purposes. When an applications event callback catches an
exception its not permitted to throw, it can wrap it in one of these
exceptions and then throw that exception. Every SAX2 event callback
can throw a SAXException, although most callback examples in this
book wont demonstrate this.
or g.xml.sax.SAXNotRecognizedException
This exception is thrown when the parser does not understand the
URI identifying a feature or property you tried to access. Most proces-
sors recognize the standard IDs, so if youre trying to use those and
you get this exception, make sure your e using the correct URI.
or g.xml.sax.SAXNotSupportedException
These exceptions are typically used to indicate that an XMLReader
pr operty or feature value you tried to change was recognized, but the
value you requested isnt supported. Reasons this might be reported
include setting a property to an illegal value (such as the wrong type
of handler) and trying to set a feature or property that is read-only in
a given implementation (or when the request is made). For instance,
its not possible to ask a parser to stop validating in mid-parse, but for
some parsers its reasonable to do so before starting to parse a docu-
ment.
or g.xml.sax.SAXParseException
This is the most commonly seen exception class; instances provide
detailed diagnostic information, such as the base URI of a le with
bad XML content, and the line and column number of such content.
3 January 2002 10:07
XML parsers provide such exceptions when the report sends errors to
Err orHandler implementations.
Applications can also construct this information when reporting appli-
cation-level errors through SAX callbacks. In fact, they probably
should do so, providing a Locator object to the constructor (and per-
haps wrapping an exception to identify a root cause) in order to pro-
vide good diagnostics. (See the section The Locator Interface in
Chapter 4 for information about Locator objects.)
The wrapped exception is a powerful tool. You might be familiar with
this mechanism from the new JDK 1.4 Chained Exception facility or the
older java.lang.r eect.InvocationTar getException exception mechanism.
(The JDK 1.4 getCause() method exposes essentially the same functional-
ity as the SAX getException(), though it builds on new JVM features to
add intelligence to exception printing.) While parsers may use it internally,
youll likely want to use it to ensure higher-level software will see the root
cause of some SAXException your handler reported:
// in some SAX event handler:
try {
... application specific stuff ...
} catch (MyApplicationException cause) {
throw new SAXException ("it broke!", cause);
// or better yet: throw new SAXParseException
// ("broke", locator, cause)
}
If you print the stack backtrace of such a SAXException, youll see two
stacks, starting with the root cause. Being able to see that root cause infor-
mation can be a real lifesaver when debugging. And some application
err or recovery strategies will use the SAXException.getException()
method to nd the root cause and then determine how to recover from it.
For example, if the application exception identied some resource that
was unavailable, higher levels in the application might be able to use that
infor mation to choose an alternative resource and restart processing.
Er rorHandler Interface
Nor mally, you will congure SAX event-processing code to use a special-
ized implementation of Err orHandler to process faults that are uncover ed
during parsing. This is done with the XMLReader.setErrorHandler() call.
This interface has three methods; you saw one of them in an earlier exam-
ple. The interface is used to encapsulate an error-handling strategy. The
primary choices you have to make are whether to ignore an err or or to
abort parsing, and whether to emit diagnostics. Those strategies are driven
Exception Handling 51
3 January 2002 10:07
52 Chapter 2: Introducing SAX2
by the severity of the problem, as exposed by which method is used to
report it, though sometimes exception-typing may give programs informa-
tion about exactly what error was detected.
void error (SAXParseException e)
This method is used to report errors that arent expected to be fatal.
The best-known example is violation of XML validity constraints, but
some other XML errors are nonfatal too. Many kinds of application-
level errors (as reported by event-consumer logic, not XML parsers)
will fall into this category, and most parsers use this callback to report
violations of namespace constraints (such as referring to an unde-
clar ed namespace prex).
When validating, applications often adopt a policy of treating these
err ors as if they were fatal, or generating a diagnostic for every such
err or. By default, all nonfatal errors are ignor ed. That default will be
a big surprise, if you expect a validating parser to stop parsing when
it sees validation errors. You have to override the default error-han-
dling policy if you want such behavior.
void fatalError (SAXParseException e)
This method is used to report errors, typically violations of well
for medness, that are fatal. Some XML parsers may be able to continue
pr ocessing after reporting such errors, but only to report additional
err ors. The XML specication itself requir es that no more data will be
reported after a fatal error.
By default, fatal errors cause parsing to stop; the parse() method will
retur n. This method is often used to provide a diagnostic or to log
the exception. After it does that, it has two main choices: throw the
parameter to terminate processing or retur n. Most parsers will treat a
retur n as equivalent to throwing the parameter to terminate parsing.
Some XML parsers continue checking for errors; in such cases, they
ar ent allowed to call any handlers other than the Err orHandler.
void warning (SAXParseException e)
This method is used to report problems that arent errors. Such situa-
tions are specic to the software that reports the warning; unlike fatal
and nonfatal errors, the XML specication doesnt place requir ements
on reporting such situations. XML infrastructure softwar e may gener-
ate warnings for any reason at all (much like many pet dogs I have
known) and yet be fully compliant with the XML specication.
By default, warnings are ignor ed. Applications typically ignore them,
or print low-priority diagnostics. Because there is such variability in
3 January 2002 10:07
what generates a warning, it is probably not useful to put a no warn-
ings allowed policy into software (by treating this like a fatal error);
users have to decide on a warning-by-war ning basis whether to
ignor e it or treat it as signicant.
Event consumers can also use this API to provide a standard way to report
faults uncovered in layers above pure XML, for instance, when data in ele-
ment content or an attribute value is invalid or corrupt. When both the
application and the SAX-related components use the same Err orHandler
instance to handle error-r eporting policy issues, maintaining that policy is
easier. For example, developers like being able to collect lots of error
reports with one test run rather than getting only one error per run; it can
be more effective to resolve problems in groups, with shorter test cycles.
You can do that with SAX by saving the exceptions (or their associated
diagnostics) as theyre reported. The same exibility can be important in
pr oduction systems.
An Err orHandler can throw any SAXException it wants; it doesnt have to
be the SAXParseException passed as its argument. Dont throw a differ ent
exception unless you nd a certiably excellent reason to do so; to dis-
card that original exception just makes problems become harder to trou-
bleshoot. One such reason might be to report a double fault, in which
you triggered another exception while handling the rst one. (Operating
systems sometimes panic in such cases, so theres no reason applications
shouldnt do so too!)
JAXP also uses this handler to report errors when building DOM docu-
ments; SAXException objects may be thrown to terminate parsing after a
DOM parser nds a problem, if the application chooses to handle those
err ors. Most DOM implementations in Java use SAX parsers to populate
their DOM tree, so this is natural behavior. (JAXP only species a SAX-
compatible way to present and report such errors. They might be reported
fr om a non-SAX parser.)
Er ror s and Diagnostics
When you see a SAXException, itll normally have a message youll use for
diagnostics, like any exception. Itll also have stack backtrace, which will
help when youre debugging, like any exception; in some cases you might
even see a nested root cause exception. At this time, standard methods
only tell an errors severity; theres no way to distinguish differ ent validity
err ors fr om each other, for example.
Exception Handling 53
3 January 2002 10:07
54 Chapter 2: Introducing SAX2
You can get better diagnostics when the exception is really a SAXParseEx-
ception, and give accurate information about exactly where the error
appear ed. SAX parsers normally provide such data when reporting parsing
err ors, and applications can do the same thing by avoiding the more
generic SAXException. With non-GUI applications, I often use code like
that shown in Example 2-6 to present the most important diagnostic data.
Example 2-6. Getting diagnostics from a SAXParseException
static private String printParseException (
String label,
SAXParseException e
) {
StringBuffer buf = new StringBuffer ();
int temp;
buf.append ("** ");
buf.append (label);
buf.append (": ");
buf.append (e.getMessage ());
buf.append (\n);
// most such exceptions include the (absolute) URI for the text
if (e.getSystemId () != null) {
buf.append (" URI: ");
buf.append (e.getSystemId ());
buf.append (\n);
}
// many include approximate line and column numbers
if ((temp = e.getLineNumber ()) != -1) {
buf.append (" line: ");
buf.append (temp);
buf.append (\n);
}
if ((temp = e.getColumnNumber ()) != -1) {
buf.append (" char: ");
buf.append (temp);
buf.append (\n);
}
// public ID might be available, but is seldom useful
return buf.toString ();
}
Its natural to call such code in two places. One place is after youve
caught an exception of this type, in a try block. Thats a bit awkward
and error prone; youll need to have two differ ent catch clauses, rst for
SAXParseException and then for SAXException, or else use a cast. The
mor e natural place is centralized in an Err orHandler that can treat gener-
ating diagnostics as one of several options for processing errors, as shown
in Example 2-7. In fact, its the only way to generate diagnostics for
3 January 2002 10:07
nonfatal errors, or for warnings, without treating them as fatal errors; or to
centralize your error-handling policy to make it easily congurable.
Example 2-7. Customizable diagnostic error handler
public class MyErrorHandler implements ErrorHandler
{
int flags;
// bit mask values for flags
public static final int ERR_PRINT = 1;
public static final int ERR_IGNORE = 2;
public static final int WARN_PRINT = 4;
public static final int FATAL_PRINT = 8;
public static final int FATAL_IGNORE = 16;
MyErrorHandler () { flags = 0; }
MyErrorHandler (int flags) { this.flags = flags; }
public void error (SAXParseException e)
throws SAXParseException
{
if ((flags & ERR_PRINT) != 0)
System.err.print (printParseException ("Error", e));
if ((flags & ERR_IGNORE) == 0)
throw e;
}
public void fatalError (SAXParseException e)
throws SAXParseException
{
if ((flags & FATAL_PRINT) != 0)
System.err.print (printParseException ("FATAL", e));
if ((flags & FATAL_IGNORE) == 0)
throw e;
}
public void warning (SAXParseException e)
throws SAXParseException
{
if ((flags & WARN_PRINT) != 0)
System.err.print (printParseException ("Warning", e));
// always ignored
}
// printParseException() method (above) is part of this class
}
Such an error handler gives you exibility about which errors to report
and how to handle the various types that show up. A silent mode of oper-
ation might never print diagnostics, a verbose one might print all of them,
and a differ ent default could be somewhere in between. A defensive
Exception Handling 55
3 January 2002 10:07
56 Chapter 2: Introducing SAX2
operational mode might terminate XML processing when it sees any error;
a per missive one might try to continue after every error. The default
shown is verbose and permissive.
To use such an error handler for handling application-specic SAXExcep-
tions, youll need to adopt the same classications that SAX derives from
XML: fatal errors, nonfatal errors, and warnings. Thats usually pretty natu-
ral, particularly if application conguration ags control which potential
err or cases are tested.
Namespaces and SAX2
However you use XML namespaces with SAX, you need to understand the
cor e concepts discussed in this section. Namespaces can be confusing;
theyr e mor e complex than perhaps they ought to be. In part this is
because of how they interact (or dont interact) with other parts of Greater
XML; in part its because everyone has differ ent ways to a determine what
words mean, and XML names are kinds of words. Well look at some of
those complexities rst, and then at the mechanisms SAX2 has to help you
deal with them.
But rst, just what are namespaces supposed to do? Usually, they identify
some particular technical vocabulary. People often reuse words rather
than create new ones, and they acquire context-specic meanings and
nuances that can be extremely important. A namespace can distinguish
whether a word like bill refers to part of a bird, a now-archaic weapon,
part of a hat, legislative acts, or a number of other things. So a <bill
length=45cm/> element might be associated with a namespace, which
pr ovides context that should help applications interpret the element. A
pr ocessor for Birders Markup Language could know to reject (or ignore)
markup intended for legislative or nancial uses, even if they all use bill
elements.
XML denes a way to declare namespaces as needed, using attributes.
Namespaces are usually indicated by a prex, which can serve as a quali-
fying adjective: the birds bill might be bird:bill while the consultants
bill might be consultant:bill. You can also set up a default element
namespace so that an unadorned bill element might indicate, for exam-
ple, a weapon.
3 January 2002 10:07
What Namespaces Do to XML
XML namespaces are a convention for using attributes to associate URIs
with some element and attribute names. Since not all legal XML docu-
ments follow this convention, the XML Namespaces specication effec-
tively species a dialect of XML. SAX2 supports both dialects: strict XML
and XML plus namespaces. By default, SAX2 parsers expect the names-
paces dialect. In most cases youll be able to ignore the differ ence
between those two XML dialects, since documents that use XML in names-
pace-incompatible ways arent common.
Even apart from the two-dialects issue, the use of namespaces with XML
complicates XML programming. There are two models for using element
and attribute names in XML:
In one model, names have a single role (or type, or meaning)
thr oughout the document. This is the model most XML DTDs use,
except they allow differ ent attributes to use the same name with dif-
fer ent meanings if the attributes are attached to differ ent element
types. (But even for attributes, well-designed DTDs share attribute
denitions between elements.) The startElement() callback parame-
ters give you all the information you need, even when those names
ar e globalized using namespace URIs.
In the other model, a names role is dependent on context. For exam-
ple, the same name used in two differ ent enclosing elements might
mean two differ ent things. It gets confusing, just like names in the
real world. This model is used with elements in some schema sys-
tems, such as local elements in W3C-style XML schemas.
If youre working with or designing XML structures with context-depen-
dent names, then namespaces add new kinds of context and hence new
ways to cause confusion. SAX2 gives you the tools to track all the context,
but youll have to record it yourself (probably with some kind of stack)
since startElement() parameters will no longer give all the context you
need.
Ther e ar e also some conicts between the element-naming approach of
the XML Namespaces specications and DTD validity as dened in the
XML specication. They may not affect your SAX2 programs, but can
af fect the systems youre implementing with XML and SAX2. The issue is
Namespaces and SAX2 57
3 January 2002 10:07
58 Chapter 2: Introducing SAX2
basically that DTDs expect everything to be declared once up front (like
import statements in Java), while the namespace mechanism provides a
lexical scoping mechanism (like declaring variables that live on the execu-
tion stack) thats exible about what a given prex indicates. You can
make namespace-correct documents that are DTD-valid, but then you
cant change the prexes bound to namespaces.
*
Namespace-awar e DTDs
will often dene default element namespaces for element names.
If you are designing a namespace and want to use the URI to publish
infor mation describing the namespace, rather than just use it as a unique
identier, then RDDL (http://www.r ddl.org) is probably a good resource.
RDDL denes an XHTML-based document syntax that can be viewed or
mechanically processed. It lets you nd some of the resources that might
be important when working with the namespacesfor example, differ ent
stylesheets and schemas and documentation in various languages. The
RDDL web site includes SAX support for accessing this data.
Element and Attribute Naming
with Namespaces
The direct impact of XML namespaces on your SAX2 application code is
to give you a second way to identify elements and attributes. Documents
will normally use only one identication style for a given element or
attribute. These identication styles are distinct from the two models for
using such names, described earlier:
Qualied names
These are exactly as found in the XML text. Examples include para
and, with a prex, xhtml:p. (XML documents that dont use names-
paces, and some namespace-style documents wont use colons.)
Universal names
These consist of two separate strings: a local name from the XML
text (removing any namespace prex) and a namespace name
(always a URI) from namespace declarations. For the qualied name
xhtml:p, the local name is p, and the namespace name is the URI
associated with the prex xhtml, which is a function in the namespace
declaration. Such names are in a sense universalized by addition of
a suitable URI.
* If you want any exibility in those prexes, and have a deep understanding of how to use
parameter entities, look at the approach to DTD modularization found in the XHTML 1.1
specication.
3 January 2002 10:07
Note that the XML Namespaces specication only standardizes the quali-
ed name (qName) ter minology; it doesnt standardize terminology for
universal names. Because of this, you will also see other terms, such as
expanded names (the term used by XPath) or namespace-style names
(used to talk about that style of naming).
Since ContentHandler.startElement() callbacks now have to deal with
thr ee dif ferent kinds of name strings, the code can get rather complicated.
Plus, even if youre expecting only universal names, youll need to notice
when elements or attributes dont have universal names and use qualied
names to work with them. Element names are identied in method param-
eters (the same as in ContentHandler.endElement()), while attribute names
show up in accessor methods for Attributes objects. Well use the follow-
ing XML text to illustrate these differ ent types of names:
<big:animals xmlns="http://www.example.com/dog">
xmlns:big="http://www.example.com/big">
<wolfhound cat=no big:dog=yes >
<greyhound big:dog=yes xmlns="">
</big:animals>
SAX2 calls names in XML text Qualied Names. These are the same
thing as XML 1.0 names except that XML 1.0 names have no restrictions
on the use of colons. When you disable namespace processing in a SAX2
parser, it will deliver qualied names that are really XML 1.0 names,
without those restrictions. With namespace processing enabled, many
qualied names (including every name with a prex) will correspond to a
namespace-style name.
Element names without a prex might not have a corresponding universal
name. Unprexed attribute names will never have a universal name. In
those cases, applications must use the qualied name along with non-
namespace context, such as the enclosing element, to gure out what the
name is supposed to mean. There are no universally accepted policies for
such cases. Yes, all that confuses other people as well.
Element naming
The identiers for the element names are the rst three parameters of void
startElement(String namespaceURI, String localName, String qName,
Attributes atts). Table 2-1 shows the values of the element names for
the previous example, as reported by a SAX2 parser in its default mode.
Notice particularly that the namespace URI is empty except when a
namespace declaration applies to that element name, and that if theres a
nonempty namespace URI, there might not be a value for qName. Thats
Namespaces and SAX2 59
3 January 2002 10:07
60 Chapter 2: Introducing SAX2
not just for element names using namespace prexes; for element names,
a default element namespace declaration will apply if its within scope.
(Remember that empty strings arent the same as nulls.)
Table 2-1. ContentHandler.startElement( ) parameters for element names
namespaceURI localName qName
http://www.example.com/big animals empty or big:animals
http://www.example.com/dog wolfhound empty or wolfhound
empty empty greyhound
You could end up with lots of code like this in your SAX event handlers.
Or, you may prefer to factor it as a table lookup (maybe using application-
specic types of handler objects) rather than as a tree of comparisons.
Notice that for elements without a namespace URI, the qName is checked,
but if theres a namespace URI, then localName is used. Also all unrecog-
nized elements are reported as a kind of validity error. You may well need
to have more context-dependent logic too, if elements may only show up
in appropriate contexts. Such contexts often need differ ent decision trees.
See Example 2-8 for a decision tree for startElement().
Example 2-8. Decision tree in startElement( )
public void
startElement (String uri, String localName, String qName, Attributes atts)
throws SAXException
{
// elements outside of any namespace?
if ("".equals (uri)) {
if ("greyhound".equals (qName)) {
... handle
return;
}
... else handle N other elements; return on success
// no recognized element: a validity error
errorHandler.error (new SAXParseException (
"Unrecognized element: " + qName,
locator
));
// if that doesnt abort the parse:
return;
// in the "big" namespace?
} else if ("http://www.example.com/big".equals (uri)) {
if ("animals".equals (localName)) {
... handle
return;
3 January 2002 10:07
Example 2-8. Decision tree in startElement( ) (continued)
}
... handle "islands" and N other big things; return on success
// FALLTHROUGH for unrecognized elements
// in the "dog" namespace?
} else if ("http://www.example.com/dog".equals (uri)) {
if ("wolfhound".equals (localName)) {
... handle
return;
}
... handle "terrier", "collie" and so on; return on success
// FALLTHROUGH for unrecognized elements
}
... and so on for other namespaces
// element not in a namespace we recognize: a validity error
errorHandler.error (new SAXParseException (
"Unrecognized element: " + uri + " (" + localName + ")",
locator
));
// returns if that doesnt abort the parse
}
Most SAX2 parsers provide qualied names in all cases, but you shouldnt
rely on their availablity unless the parser is congured to provide names-
pace prex information (which also causes namespace-declaration
attributes to be un-hidden). You should probably avoid using the qName,
even for diagnostics, when theres a nonempty namespaceURI.
Attr ibute naming
The identiers for the attribute names are accessed using Attributes meth-
ods such as getQName(), getLocalName(), and getURI() when you iterate
over an elements attributes with a for loop. You can access attribute val-
ues directly if you use either XML 1.0style names (qName) or XML Names-
pacestyle names (namespaceURI and localName).
SAX2 parsers handle attribute names from the example text as shown in
Table 2-2. This table shows the mixed mode behavior, described later; in
the default SAX2 parser mode, the xmlns and xmlns:big attributes wont
appear. Youd have to set the namespace-pr exes featur e ag (as
described later in this chapter, in the section Namespace Feature Flags)
to see these attributes. Note that according to the namespaces specica-
tion there is no such thing as a default namespace for attribute names, so
that namespace declaration attributes dont go into any namespace.
Namespaces and SAX2 61
3 January 2002 10:07
62 Chapter 2: Introducing SAX2
Table 2-2. Attributes methods to access attribute names
getURI() getLocalName() getQName()
empty empty xmlns
empty empty xmlns:big
empty empty cat
http://www.example.com/big dog empty or big:dog
So if you wanted to write some code that ignored elements without a
big:dog attribute (that is, the URI is http://www.example.com/big/ and the
local name is dog) with value yes, it might look like this:
public void startElement (String uri, String local, String qName,
Attributes atts)
throws SAXException
{
String value;
value = atts.getValue ("http://www.example.com/big", "dog");
if (!"yes".equals (value)) {
// arrange to ignore text and elements until this finishes
return;
}
... process the element
}
Things to keep in mind
To avoid confusing things, the previous code didnt illustrate two some-
what perverse cases. First, if the big pr ex wer e redened for some ele-
ment, the same qualied name could correspond to a differ ent universal
name, with the same local name but differ ent namespace URIs. Thats one
reason the previous code doesnt check for a qName of big:dog. Using a
qName of big:dog might make sense if you were working with XML 1.0
without using XML namespaces. Second, if the URI used with the big pr e-
x were associated with a second prex, differ ent qualied names could
corr espond to the same universal names. Thats another reason the previ-
ous code doesnt check for a qName of big:dog. If you are writing names-
pace-awar e code, use only namespace-style name testing in your code to
avoid such problems. That makes your code work correctly even when it
deals with documents that use namespace declarations in ways you didnt
expect.
By default, SAX2 XML parsers provide universal names for elements and
attributes that have namespaces (theyll have nonempty localName and
3 January 2002 10:07
namespaceURI strings) or qualied names for elements and attributes that
dont, and will remove the namespace declaration attributes from the
Attributes object provided in the ContentHandler.startElement() event.
Unless a default element namespace declaration is in scope, an element
whose XML 1.0style name has no prex wont have a namespace-style
identier. Attributes with unprexed names work differ ently, since default
element namespace declarations never apply to attribute names.
If you work with both SAX2 and DOM Level 2, you need to be aware of
the differ ences in how these APIs expose namespaces. The terminology is
similar but not identical; SAX2 talks about URI while DOM Level 2 talks
about NamespaceURI, and SAX2 uses QName not Name; but both
APIs talk about the LocalName. When using element or attribute con-
struction methods in the or g.w3c.dom.Document class, you will notice
that DOM uses two differ ent APIs in places in which SAX2 provides just
one callback (in three differ ent modes, as discussed in the next section).
You are most likely to trip over differ ent ways to tell whether an element
or attribute has no namespace URI: SAX2 uses an empty string (length
zer o), while DOM Level 2 uses a null string. You may also notice that
while SAX2 follows the XML Namespaces specication with regards to the
attributes that dene namespaces, DOM does not. In SAX2, those
attributes have no URIs, but DOM assigns http://www.w3.or g/2000/xmlns/
as their namespace URI.
Namespace Feature Flags
SAX2 controls its namespace-processing support through two feature ags,
which can be tested and changed using the setFeature() and getFea-
ture() methods described earlier in this chapter in the section SAX2 Fea-
tur e Flags. The two ags are http://xml.or g/sax/features/namespaces
(namespaces), which controls whether parsers handle namespace declara-
tions, and http://xml.or g/sax/features/namespace-pr exes (namespace-pr e-
xes), which controls whether applications can see the underlying XML
syntax. All SAX2 parsers support both ags, although their values might be
read-only.
Given two ags, there are four possible combinations. Only three are
legal. Its easiest to understand what the ags do by considering them as
each controlling a small processing task layered over a core that just
parses XML text. The SAX2 defaults are set so both tasks are per formed.
Namespaces and SAX2 63
3 January 2002 10:07
64 Chapter 2: Introducing SAX2
XML 1.0 mode
Only XML 1.0style names are reported for elements and attributes,
using the qName. The namespaces ag is false, and the namespace-
pr exes ag is true; those values are exactly the opposite of the SAX2
defaults.
This mode passes xmlns and xmlns:* attributes without looking at
them. Namespace-style names (with URIs) might be provided with
element or attribute names, but you must not rely on this; few parsers
will do the extra work of processing the namespace declarations. If
you enable this mode, your SAX2 parser will be doing what a SAX1
parser did, but the information will ow through APIs with slots for
holding namespace-style names.
Mixed mode
Both XML 1.0 and XML plus Namespacesstyle names are reported
for elements and attributes. The namespaces ag is true (like the
default SAX2 mode), and the namespace-pr exes ag is true (like XML
1.0 mode).
This mode is much like XML 1.0 mode, but setting the namespaces
ag causes startPrefixMapping() and endPrefixMapping() events (dis-
cussed in the next section) to match xmlns and xmlns:* attributes, and
pr ocesses those declarations so the parser always provides namespace
URIs for element and attribute names when theyre dened. The
qName is always provided, even when a namespace URI is dened.
Parsers running in this mode should generate some kind of error
report for legal XML 1.0 documents that dont meet all the rules of the
XML plus namespaces dialect. (Most parsers use ErrorHan-
dler.error() although the namespace specication doesnt say what
class of error to report.) One example is to use colons in names for
things that arent elements or attributes, and not declare namespace
pr exes. Similarly, you might get warnings about using relative URIs
in namespace declarations. Ther e is a perfor mance impact to this
additional processing, often ve percent of the usually negligible over-
head for XML parsing.
XML plus namespaces mode (SAX2 default)
The differ ence between this and mixed mode is that some information
is discarded. The namespaces ag is true, and the namespace-pr exes
ag is false.
Clearing the namespace-pr exes ag tells parsers they must lter out
xmlns and xmlns:* attributes, and they may report empty strings
3 January 2002 10:07
instead of providing the qName (as found in the document) whenever
a namespace URI is reported. In practice, most current SAX2 parsers
always report qualied names, since theres little benet to ltering
them out.
The fourth combination of ags, disabling both namespace support and
namespace prex reporting, would be meaningless, and so it is an illegal
parser state. Dont set this mode; parsers might not detect that youve put
them into an illegal mode and may react unintelligently (such as by enter-
ing XML 1.0 mode). Unfortunately its easy to set this mode if you just
set the namespaces ag to false without rst setting the namespaces-pr ex
ag to true (entering mixed mode).
I tend to prefer the mixed mode over the SAX2 default mode. Enabling it
is simple: just set the namespaces-pr ex ag to true, after setting up a
parser for the SAX2 defaults. This mode provides better support for the
XML Infoset, since it doesnt discard information about the prexes. You
wont see implementation-dependent behaviors in exposing either type of
name. Certain kinds of XML processing will work better. In particular,
algorithms working near the XML syntax level such as writing out XML
text or perfor ming consumer-side DTD validationwill then work with-
out needing to guard against discarded prexes and without re-cr eating
namespace declaration attributes. Discarding or changing prexes, in par-
ticular, can cause confusion when people need to look at the XML output.
The only real impact on applications is having to ignore xmlns and
xmlns:* attributes, which isnt hard.
Few, if any, applications really need to work with documents that use
colons in ways other than the XML namespaces specication, leaving a
small perfor mance impact as the primary reason to care about the pure
XML 1.0 mode. Even applications that dont use namespaces usually wont
see colons used in interesting ways (like nested:contexts:for:names).
While most SAX2 XML parsers support all three of these modes, they are
only requir ed to support the SAX2 default mode.
ContentHandler and Prefix Mappings
Sometimes XML needs to handle meta-level processing, in which XML
talks about XML. In such processing, namespace URIs are sometimes
implicitly called by prexes found in places no XML parser will look:
CDATA attributes (which can contain anything) and character content
found within elements. For example, XPath expressions include prexes,
and they are found in XSLT template attributes. The W3C XML Schema
Namespaces and SAX2 65
3 January 2002 10:07
66 Chapter 2: Introducing SAX2
Datatypes (XSD) denes a QName datatype that formalizes such usage.
When you need to work with those types of XML text, youll nd two par-
ticular ContentHandler event callbacks helpful. They provide the same
infor mation found in xmlns and xmlns:* attributes, relieving your applica-
tion code of the responsibility of correctly applying the XML Namespaces
specication. For example, your code wont need to know how a default
element namespace declaration can be explicitly undone by xmlns=""
attributes or by ending the lexical scope of that attribute.
void startPrefixMapping(String prefix, String uri)
Each namespace declaration causes one of these calls. Each call cor-
responds to an attribute in the next startElement() callback to be
made; you probably wont see other callbacks intervening. (This
method has to appear before the element; the mapping will be used
to interpret names of the element or its attributes.) If the prex is the
empty string, then the declaration is for the default element names-
pace. This is the only time the URI may be specied as the empty
string (indicating that there is no longer a default element namespace
in effect).
void endPrefixMapping(String prefix)
Each call to startPrefixMapping() is paired with a matching event to
declar e that the mapping has gone out of scope. These calls corre-
spond to the most recent endElement() callback. However, the map-
ping start calls and the mapping end calls wont necessarily be
per fectly nested. For example, two prex mappings found in one ele-
ment might be started in the order xlink then MyApp, but either map-
ping could end rst.
Youd normally ignore these two calls, unless you use them to maintain
some data structure that tracks active namespace prexes. It would have
to be a stacklike data structure, since one mapping for a prex only tem-
porarily hides a previous mapping for the same prex. This is the notion
of lexical scope, which you are familiar with from most programming lan-
guages. SAX2 includes a helper class to handle this for you: Namespace-
Support, discussed in the section The NamespaceSupport Class in
Chapter 5. Then when you parse the meta-level content, you can use
those data structures to interpret prex refer ences and handle other
namespace-r elated work.
3 January 2002 10:07
3
Producing
SAX2 Events
In this chapter:
Pull Mode Event
Production with
XMLReader
Bootstrapping an
XMLReader
Congur ing
XMLReader Behavior
The EntityResolver
Interface
Other Kinds of SAX2
Event Producer s
The preceding chapter provided an overview of the most widely used SAX
classes and showed how they relate to each other. This chapter drills
mor e deeply into how to produce XML events with SAX, including further
customization of SAX parsers.
Pull Mode Event Production
with XMLReader
Most of the time you work with SAX2, youll be using some kind of
or g.xml.sax.XMLReader implementation that turns XML text into a SAX
event stream. Such a class is loosely called a SAX parser. Dont confuse
this with the older SAX1 or g.xml.sax.Parser class. New code should not
be using that class!
This interface works in a kind of pull mode: when a thread makes an
XMLReader.parse() request, it blocks until the XML document has been
fully read and processed. Inside the parser theres a lot of work going on,
including a pull-to-push adapter: the parser pulls data out of the input
source provided to parse() and converts it to events that it pushes to
event consumers. This model is differ ent fr om the model of a
java.io.Reader, from which applications can only get individual buffers of
character data, but its also similar because in both cases the calling thread
is pulling data from a stream.
You can also have pure push mode event producers. The most common
kind writes events directly to event handlers and doesnt use any kind of
input abstraction to indicate the datas source; its not parsing XML text.
67
3 January 2002 10:08
68 Chapter 3: Producing SAX2 Events
We discuss several types of such producers later in this chapter. Using
thr eads, you could also create a producer that lets you write raw XML text,
a buf fer at a time, to an XMLReader that parses the text; thats another
kind of push mode producer.
The XMLReader Interface
The SAX overview presented the most important parts of the XMLReader
inter face. Her e we discuss the whole thing, in functional groups. Most of
the handlers are presented in more detail in the next chapter, which
focuses on the consumption side of the SAX event streaming process.
Each handler has get and set accessor methods, and has a default value of
null.
XMLReader has the following functional groups:
void parse(String uri)
void parse(InputSource in)
Ther e ar e two methods to parse documents. In most cases, the Java
envir onment is able to resolve the documents URI; the form with the
absolute URI should be used when possible. (You may need to con-
vert lenames to URIs before passing them to SAX. SAX specically
disallows passing relative URIs.) The second form is discussed in
mor e detail along with the InputSour ce class. Both of these methods
can throw a SAXException or java.io.IOException, as presented earlier.
A SAXException is normally thrown only when an event handler
thr ows it to terminate parsing. That policy is best encapsulated in an
Err orHandler, but handler methods can make such decisions them-
selves.
Only one thread may call a given parsers parse() method at a time;
applications are responsible for ensuring that threads dont share
parsers that are in active use. (SAX parsers arent necessarily going to
report applications that break that rule, though!) The thread doing the
parsing will normally block only while its waiting for data to be deliv-
er ed to it, or if a handlers processing causes it to block.
void setContentHandler(ContentHandler handler)
ContentHandler getContentHandler()
Key parts of the ContentHandler inter face wer e pr esented as part of
the SAX overview; ContentHandler packages the fundamental parsing
callbacks used by SAX event consumers. This interface is presented
in more detail in Chapter 4, in the section Other ContentHandler
Methods.
3 January 2002 10:08
void setDTDHandler(DTDHandler handler)
DTDHandler getDTDHandler()
The DTDHandler is presented in detail later, in Chapter 4 in the sec-
tion The DTDHandler Interface.
void setEntityResolver(EntityResolver handler)
EntityResolver getEntityResolver()
The EntityResolver is presented later in this chapter, in the section
The EntityResolver Interface. It is used by the parser to help locate
the content for external entities (general or parameter) to be parsed.
void setErrorHandler(ErrorHandler handler)
ErrorHandler getErrorHandler()
The Err orHandler was presented in the section ErrorHandler Inter-
face in Chapter 2. It is often used by consumer code that interprets
events reported through other handlers, since they may need to report
err ors detected at higher levels than XML syntax.
void setFeature(String uri, boolean value)
boolean getFeature(String uri)
Parser feature ags were discussed in Chapter 2, and are presented in
mor e detail later in this chapter in the section XMLReader Feature
Flags.
void setProperty(String uri, Object value)
Object getProperty(String uri)
Parser properties are used for data such as additional event handlers,
and are presented in more detail later in this chapter in the section
XMLReader Properties.
All the event handlers and the entity resolver may be reassigned inside
event callbacks. At this level, SAX guarantees late binding of handlers.
Layers built on top of SAX might use earlier binding, which can optimize
event processing.
Many SAX parsers let you set handlers to null as a way to ignore the
events reported by that type of handler. Strictly speaking, they dont need
to do that; theyre allowed to throw a NullPointerException when you use
null. So if you need to restor e the default behavior of a parser, you should
use a DefaultHandler (or something implementing the appropriate exten-
sion interface) just in case, rather than use the more natural idiom of set-
ting the handler to its default value, null.
If for any reason you need a push mode XML parser, which takes blocks
of character or byte data (encapsulating XML text) that you write to a
parser, you can easily create one from a standard pull mode parser. The
Pull Mode Event Production with XMLReader 69
3 January 2002 10:08
70 Chapter 3: Producing SAX2 Events
cost is one helper thread and some API glue. The helper thread will call
parse() on an InputSour ce that uses a java.io.PipedInputStr eam to read
text. The push thread will write such data blocks to an associated
java.io.PipedOutputStr eam when it becomes available. Most SAX parsers
will in turn push the event data out incrementally, but theres no guaran-
tee (at least from SAX) that they wont buffer megabytes of data before
they start to parse.
The InputSour ce Class
The InputSour ce class shows up in both places where SAX needs to parse
data: for the document itself, through parse(), and for the external parsed
entities it might refer ence thr ough the EntityResolver inter face.
In almost all cases you should simply pass an absolute URI to the XML-
Reader.parse() method. (If you have a relative URI or a lename, turn it
into an absolute URI rst.) However, ther e ar e cases when you may need
to parse data that has no URI. It might be in unnamed storage like a
String; or it might need to be read using a specialized access scheme
(maybe a java.io.PipedInputStr eam, or POST input to a servlet, or some-
thing named by a URN). The web server for the URI might misidentify the
documents character encoding, so youd need to work around that server
bug. In such cases, you must use the alternative XMLReader.parse()
method and pass an InputSour ce object to the parser.
InputSour ce objects are fundamentally holders for one or two things: an
entitys URI and the entity text. (There can be a public ID too, but its
rar ely useful.) When only one of those is needed, an applications work
for setting up the InputSour ce might end with choosing the right construc-
tor. Whenever you provide the entity text, you need to pay attention to
some character encoding issues. Because character encoding is easy to get
wr ong, avoid directly providing entity text when you can.
Always provide absolute URIs
You should try to always pr ovide the fully qualied (absolute) URI of the
entity as its systemId, even if you also provide the entity text. That URI
will often be the only data you need to provide. You must convert le-
names to URIs (as described later in this chapter in the section Filenames
Versus URIs), and turn relative URIs into absolute ones. Some parsers
have bugs and will attempt to turn relative URIs into absolute ones, guess-
ing at an appropriate base URI. Do not rely on such behavior.
3 January 2002 10:08
If you dont provide that absolute URI, then diagnostics may be useless.
Mor e signicantly, relative URIs within the document cant be correctly
resolved by the parser if the base URI is forgotten. XML parsers need to
handle relative URIs within DTDs. To do that they need the absolute doc-
ument (or entity) base URIs to be provided in InputSour ce (or parse()
methods) by the application. Parsers use those base URIs to absolutize rel-
ative URIs, and then use EntityResolver to map the URIs (or their public
identiers) to entity text. Applications sometimes need to do similar things
to relative URIs in document content. The xml:base attribute may provide
an alternative solution for applications to determine the base URI, but it is
nor mally needed only when relative URIs are broken. This can happen
when someone moves the base document without moving its associated
resources, or when you send the document through DOM (which doesnt
record base URIs). Moreover, relative URIs in an xml:base attribute still
need to be resolved with respect to the real base URI of the document.
The following methods are used to provide absolute URIs:
InputSource(String uri)
Use this constructor when you are creating an InputSour ce consisting
only of a fully qualied URI in a scheme understood by the JVM you
ar e using. Such schemes commonly include http://, le://, ftp://, and
incr easingly https://.
InputSource.setSystemId(String uri)
Use this method to record the URI associated with text you are pro-
viding directly.
For example, these three ways to parse a document are precisely equiva-
lent:
String uri = ...;
XMLReader parser = ...;
parser.parse (uri);
// or
parser.parse (new InputSource (uri);
Providing entity text
For data without a URI, or that uses a URI scheme not supported by your
JVM, applications must provide entity text themselves. There are two ways
to provide the text through an InputSour ce: as character data or as binary
data, which needs to be decoded into character data before it can be
parsed. In both cases your application will create an open data stream and
Pull Mode Event Production with XMLReader 71
3 January 2002 10:08
72 Chapter 3: Producing SAX2 Events
give it to the parser. It will no longer be owned by your application; the
parser should later close it as part of its end-of-input processing. If you
pr ovide binary data, you might know the character encoding used with it
and can give that information to the parser rather than turning it to char-
acter data yourself using something like an InputStr eamReader.
InputSource(java.io.Reader in)
Use this constructor when you are providing predecoded data to the
parser, which will then ignore what any XML or text declaration says
about the character encoding. (Also, call setSystemId(uri) when pos-
sible.) This constructor is useful for parsing data from a
java.io.Reader such as java.io.CharArrayReader and for working
ar ound conguration bugs in HTTP servers.
Some HTTP servers will misidentify the text encoding used for XML
documents, using the content type text/xml for non-ASCII data,
instead of text/xml;charset= . . . or application/xml.
*
If you know a
particular server does this, and that the encoding wont be autode-
tected, create an InputSour ce by using an InputStr eamReader that
uses the correct encoding. If the correct encoding will be autode-
tectable, you can use the InputStr eam constructor.
InputSource(java.io.InputStream in)
Use this constructor when you are providing binary data to a parser
and expect the parser to be able to detect the encoding from the
binary data. (Also, call setSystemId(uri) when possible.)
For example, UTF-16 text always includes a Byte Order Mark, a docu-
ment beginning <?xml ... encoding="Big5"?> is understood by most
parsers as a Big5 (traditional Chinese) document, and UTF-8 is the
default for XML documents without a declaration identifying the actual
encoding in use.
InputSource.setEncoding(String id)
Use this method if you know the character encoding used with data
you are providing as a java.io.InputStr eam. (Or provide a
java.io.Reader if you can, though some parsers know more about
* application/xml is the safest MIME type to use for *.xml, *.dtd, and other XML les. See
RFC 3023 for information about XML MIME types and character encodings.
3 January 2002 10:08
encodings than the underlying JVM does.)
*
If you dont know the
encoding, dont guess. XML parsers know how to use XML and text
declarations to correctly determine the encoding in use. However,
some parsers dont autodetect EBCDIC encodings, which are mostly
used with IBM mainframes. You can use this method to help parsers
handle documents using such encodings, if you cant provide the doc-
ument in a fully interoperable encoding such as UTF-8.
All XML parsers support UTF-8 and UTF-16 values here, and most
support other values, such as US-ASCII and ISO-8859-1. Consult your
parser documentation for information about other encodings it sup-
ports. Typically, all encodings supported by the underlying JVM will
be available, but they might be inconsistently named. (As one exam-
ple, Suns JDK supports many EBCDIC encodings, but gives them
unusual names that dont suggest theyre actually EBCDIC.) You
should use standard Internet (IANA) encoding names, rather than Java
names, where possible. In particular, dont use the name UTF8; use
UTF-8.
So if you want to parse some XML text you have lying around in a charac-
ter array or String, the natural thing to do is package it as a java.io.Reader
and wrap it up in something like this:
String text = "<lichen color=red/>";
Reader reader = new StringReader (text);
XMLReader parser = ... ;
parser.setContentHandler (...);
parser.parse (new InputSource (reader));
In the same way, if youre implementing a servlets POST handler and the
servlet accepts XML text as its input, youll create an InputSour ce. The
InputSour ce will never have a URI, though you could support URIs for
multipart/r elated content (sending a bundle of related components, such
as external entities). Example 3-1 handles the MIME content type correctly,
though it does so by waving a magic wand: it calls a routine that imple-
ments the rules in RFC 3023. That is, text/* content is US-ASCII (seven-bit
code) by default, and any charset= . . . attribute is authoritative. When
parsing XML requests inside a servlet, youd typically apply a number of
* JDK 1.4 includes public APIs through which applications can support new character
encodings. Some applications may need to use those APIs to support encodings beyond
those the JVM handles natively.
Pull Mode Event Production with XMLReader 73
3 January 2002 10:08
74 Chapter 3: Producing SAX2 Events
conguration techniques to speed up per-r equest pr ocessing and maintain
security.
*
Example 3-1. Parsing POST input to an HTTP Servlet
import gnu.xml.util.Resolver;
public void doPost (HttpServletRequest request, HttpServletResponse response)
throws IOException, ServletException
{
String type = req.getContentType ();
InputSource in;
XMLReader parser;
if (!(type.startsWith ("text/xml")
|| type.startsWith ("application/xml")) {
response.sendError (response.SC_UNSUPPORTED_MEDIA_TYPE,
"non-XML content type: " + type);
return;
}
// theres no URI for this input data!
in = new InputSource (req.getInputStream ());
// use any encoding associated with the MIME type
in.setEncoding (Resolver.getEncoding (req.getContentType ()));
try {
parser = XMLReaderFactory.createXMLReader();
...
parser.setContentHandler (...);
parser.parse (in);
// content handler expected to handle response generation
} catch (SAXException e) {
response.sendError (response.SC_BAD_REQUEST,
"bad input: " + e.getMessage ());
return;
} catch (IOException e) {
* You might have a pool of parsers, to reduce bootstrap costs. Youd use an entity resolver
to turn most entity accesses from remote ones into local ones. Depending on your appli-
cation, you might even prevent all access to nonlocal entities so the servlet wont hang
when remote network accesses get delayed.
Some security policies would also involve the entity resolver. Basically, every entity access
r equested by the client (through a refer ence in the document) is a potential attack. If its
not known to be safe (for example, access to standard DTD components), it may be
important to prevent or nullify the access. (This does not always happen in the entity
resolver; sometimes system security policies will be more centralized.) In a small trade-off
against perfor mance, security might requir e that the request data always be validated, and
that validity errors be treated as fatal, because malformed input data is likely to affect sys-
tem integrity.
3 January 2002 10:08
Example 3-1. Parsing POST input to an HTTP Servlet (continued)
// maybe a relative URI in the input couldnt be resolved
response.sendError (response.SC_INTERNAL_SERVER_ERROR
"i/o problem: " + e.getMessage ());
return;
}
}
You might have some XML text in a database, stored as a binary large
object (BLOB, accessed using java.sql.Blob) and potentially referring to
other BLOBs in the database. Constructing input sources for such data
should be slightly differ ent because of those refer ences. Youd want to be
sur e to provide a URI, so the refer ences can be resolved:
String key = "42";
byte data [] = Storage.keyToBlob (key);
InputStream stream = new ByteArrayInputStream (data);
InputSource source = new InputSource (stream);
XMLReader parser = ... ;
source.setSystemId ("blob:" + key);
parser.parse (source);
In such cases, where you are using a URI scheme that your JVM doesnt
support directly, consider using an EntityResolver to create the Input-
Sour ce objects you hand to parse(). Such schemes might be standard
(such as members of a MIME multipart/r elated bundle), or they might be
private to your application (like this blob: scheme). (Example 3-3 shows
how to package handling for such nonstandard URI schemes so that you
can use them in your application, even when your JVM does not under-
stand them. You may need to pass such URIs using public IDs rather than
system IDs, so that parsers wont report errors when they try to resolve
them.)
Filenames Ver sus URIs
Filenames are not URIs, so you may not provide them as system identiers
wher e SAX expects a system identier: in parse() or in an InputSour ce
object. If you are depending on JDK 1.2 or later, you can rely on
new File(name).toURL().toString() to turn a lename into a URI. To be
most portable, you may prefer to use a routine as shown in Example 3-2,
which handles key issues like mapping DOS or Mac OS lenames into
legal URIs.
Pull Mode Event Production with XMLReader 75
3 January 2002 10:08
76 Chapter 3: Producing SAX2 Events
Example 3-2. File.toURL( ) analogue for JDK 1.1
public static String fileToURL (File f)
throws IOException
{
String temp;
if (!f.exists ())
throw new IOException ("no such file: " + f.getName ());
temp = f.getAbsolutePath ();
if (File.separatorChar != /)
temp = temp.replace (File.separatorChar, /);
if (!temp.startsWith ("/"))
temp = "/" + temp;
if (!temp.endsWith ("/") && f.isDirectory ())
temp = temp + "/";
return "file:" + temp;
}
If youre using the GNU software distribution that is described earlier,
gnu.xml.util.Resolver.fileToURL() is available so you wont need to
enter that code yourself.
Bootstrapping an XMLReader
Ther e ar e several ways to obtain an XMLReader. Her e well look at a few
of them, focusing rst on the most commonly available ones. These are
the pure SAX solutions.
Its good policy to reuse parsers, rather than constantly discard and recr e-
ate them. Some parsers are mor e expensive to create than others, so such
reuse can improve perfor mance if you parse many documents. Similarly,
factory approaches add some xed costs to achieve vendor neutrality, and
those costs can add up. In contexts like servlets, where any number of
thr eads may need to parse XML concurrently, parsers are often pooled so
those bootstrapping costs wont increase per-r equest service times.
The XMLReaderFactor y Class
The simplest way to get a parser is to use the default parser for your envi-
ronment, as we saw earlier:
import org.xml.sax.helpers.XMLReaderFactory;
...
XMLReader parser = null;
3 January 2002 10:08
try {
parser = XMLReaderFactory.createXMLReader ();
// success!
} catch (SAXException e) {
System.err.println ("Cant get default parser: " + e.getMessage ());
}
Nor mally, the default parser is dened by setting the or g.xml.sax.driver
system property. Application startup should set that property, normally
using JVM invocation ags. (In a very few cases System.setProperty()
may be appropriate.)
$ java -Dorg.xml.sax.driver=gnu.xml.aelfred2.XMLReader
Unfortunately, in many cases the original refer ence implementation of that
method is used. This is problematic in two situations: when the system
pr operty isnt set and when security permissions are set to prevent access
to that system property; this is common for many applets. Good SAX2 dis-
tributions will ensure that this factory method succeeds in the face of such
err ors. The current release of the SAX2 helper classes makes this easy to
do.
*
Because of that problem, you may choose to code your application so
parser choice is a conguration option encoded through some other
mechanism than system properties. You cant keep it in your applications
XML-for mat conguration le. Once you get that conguration data youll
pr obably use a differ ent XMLReaderFactory call:
import org.xml.sax.helpers.XMLReaderFactory;
...
XMLReader parser = null;
String className = ...;
try {
parser = XMLReaderFactory.createXMLReader (className);
// success!
* The current version of XMLReaderFactory has more intelligence and supports additional
conguration mechanisms. For example, your application or parser distribution can con-
gur e a META-INF/services/or g.xml.sax.driver resource into its class path, holding a single
string to be used if the system property hasnt been set. SAX2 parser distributions are
expected to work even if the system property or class path resource hasnt been set.
Bootstrapping an XMLReader 77
3 January 2002 10:08
78 Chapter 3: Producing SAX2 Events
} catch (SAXException e) {
System.err.println ("Cant get default parser: " + e.getMessage ());
}
Using this factory call, the class name identies the SAX parser you want
to use. It may well be one of the entries in Table 3-1, though some frame-
works bundle other parsers.
Table 3-1. SAX2 XMLReader implementation classes
Parser (and type) Class name
lfr ed (nonvalidating) gnu.xml.aelfr ed2.SAXDriver
lfr ed (optionally validating) gnu.xml.aelfr ed2.XmlReader
Crimson (optionally validating) or g.apache.crimson.XmlReaderImpl
Xerces (optionally validating) or g.apache.xerces.parsers.SAXParser
If youre using a parser without a settable option for validation, you may
want to let distinct parsers be congured for validating and nonvalidating
usage, assuming that your application needs both. Parsers with validation
support are signicantly larger than ones without it, which is partly why
lfr ed still has a nonvalidating class.
Calling Par ser Constr uctors
If you need to force the use of some particular parser, you can invoke its
constructor directly. Every SAX2 XMLReader must have a default construc-
tor in order to work with the XMLReaderFactory class. Since it exists, you
can invoke it directly using the same class names you may have passed to
the XMLReaderFactory, if you used application-level conguration:
import org.xml.sax.XMLReader;
import gnu.xml.aelfred2.XmlReader;
...
XMLReader parser = new XmlReader ();
In some cases you may actually prefer to force use of some particular
parser. In other cases, you may have no option, maybe because of class
loader or security conguration. If you run into trouble with those mecha-
nisms, you may not be able to use factory APIs to access parsers unless
they are visible through the system class loader.
In general, avoid such nonportable coding decisions; use a factory API
wher ever you can.
3 January 2002 10:08
Using JAXP
Suns JAXP 1.1 supports yet another way to bootstrap SAX parsers. Its a
mor e complex process, taking several steps instead of just one:
1. First, get a javax.xml.parsers.SAXParserFactory.
2. Tell it to retur n parsers that will do the kind of processing needed by
your application.
3. Ask it to give you a JAXP parser of type javax.xml.parsers.SAXParser.
4. Finally, ask the JAXP parser to give you the XMLReader that is nor-
mally lurking inside of it.
Conceptually this is like the no-parameters XMLReaderFactory.createXML-
Reader() method, except its complicated by expecting the factory to
retur n pr econgur ed parsers.
*
Conguring the parser using the SAX2 ags
and properties directly is preferable; the API surface area is smaller.
Other than having differ ent default namespace-processing modes, the
practical differ ence is primarily availability: many implementations ensure
that a JAXP system default is always accessible, but they havent paid the
same attention to providing the default SAX2 parser. (Curr ent versions of
the SAX2 classes make that easier, but you might not be using such ver-
sions.)
The code to use the JAXP bootstrap API to get a SAX2 parser looks like
this:
import org.xml.sax.*;
import javax.xml.parsers.*;
XMLReader parser;
try {
SAXParserFactory factory;
factory = SAXParserFactory.newInstance ();
factory.setNamespaceAware (true);
parser = factory.newSAXParser ().getXMLReader ();
// success!
} catch (FactoryConfigurationError err) {
System.err.println ("cant create JAXP SAXParserFactory, "
+ err.getMessage ());
* You can also look at this as choosing between parsers. For example, JAXP 1.2 will proba-
bly say how to request that schema validation be done. Thats most naturally done as a
layer on top of SAX, with a parser lter postprocessing the output of some other SAX
parser.
Bootstrapping an XMLReader 79
3 January 2002 10:08
80 Chapter 3: Producing SAX2 Events
} catch (ParserConfigurationException err) {
System.err.println ("cant create XMLReader with namespaces, "
+ err.getMessage ());
} catch (SAXException err) {
System.err.println ("Hmm, SAXException, " + err.getMessage ());
}
Rather than calling newInstance(), you can hardcode the constructor for a
particular factory, probably using one of the classes listed in Table 3-2. Its
better to keep implementation prefer ences as conguration issues though,
and not hardwire them into source code. For situations where you may
have several parsers in your class path (or a tree of class loaders, as found
in many recent servlet engines), JAXP offers several methods to congure
such prefer ences. You can associate the factory class name value with the
key javax.xml.parsers.SAXParserFactory by using the key to name a sys-
tem property (which sets the default parser for your JVM instance) or by
putting it in the $JAVA_HOME /jre/lib/jaxp.pr operties pr operty le (which
sets the default policy for that JVM implementation). I prefer the
jaxp.pr operties solution; with the other method the default parser is a
function of your class path settings and even the names assigned to vari-
ous JAR les. You can also embed this prefer ence in your applications
JAR les as a META-INF/services/ . . . le, but that solution is similarly sen-
sitive to class loader conguration issues.
Table 3-2. JAXP SAXParserFactory implementation classes
JAXP factor y Class name
lfr ed gnu.xml.aelfr ed2.JAXPFactory
Crimson or g.apache.crimson.jaxp.SAXParserFactoryImpl
Xerces or g.apache.xerces.jaxp.SAXParserFactoryImpl
If youre using JAXP to bootstrap a SAX2 parser, rather than the SAX2
APIs, the default setting for namespace processing is differ ent: JAXP
parsers dont process namespaces by default, while SAX2 parsers do.
SAX2 normally removes all xmlns* attributes, reports namespace scope
events, and may hide the namespace prexes actually used by element
and attribute names. JAXP does none of that unless you make it; in fact,
the default parser mode for some current implementations is the illegal
SAX2 mode described in the previous chapter. The example code in this
section made the JAXP factory follow SAX2 defaults.
This book encourages you to use SAX2 directly, rather than through the
JAXP factory mechanism. Even if JAXP is available, its more complex to
3 January 2002 10:08
use. Also, the resulting parser is congured differ ently, so many of the
examples in this book would break.
Configur ing XMLReader Behavior
A conguration mechanism was one of the key features added in the
SAX2 release. Parsers can support extensible sets of named Boolean fea-
tur e ags and pr operty objects. These function in similar ways, including
using URIs to identify any number of features and properties. The excep-
tion model, presented in Chapter 2 in the section SAX2 Feature Flags is
used to distinguish the three basic types of feature or property: the current
value may be read-only, read/write, or undened. Some ags and proper-
ties may have rules about when they can be changed (typically not while
parsing) or read.
Applications access property objects and feature ags through get*() and
set*() methods and use URIs to identify the characteristic of interest.
Since SAX does not provide a way to enumerate such URIs as supported
by a parser, you will need to rely on parser documentation, or the tables
in this section, to identify the legal identiers. (Or consult the source code,
if you have access to it.)
If you happen to be dening new handlers or features using the SAX2
framework, you dont have to ask for permission to dene new property
or feature ag IDs. Since they are identied using URIs, just start your ID
with a base URI that you control. (Only the SAX maintainers would start
with the http://xml.or g/sax/ URI, for example.) Typically, it will be easiest
to make up some HTTP URL based on a fully qualied domain name that
you control. As with namespace URIs, these are used purely as identiers
rather than as locations from which data would be retrieved. (The I in
URI stands for identier.)
XMLReader Proper ties
SAX2 denes two XMLReader calls for accessing named property objects.
One of the most common uses for such objects is to install non-core event
handlers. Accessing properties is like accessing feature ags, except that
the values associated with these names are objects rather than Booleans:
XMLReader producer ...;
String uri = ...;
Object value = ...;
Configur ing XMLReader Behavior 81
3 January 2002 10:08
82 Chapter 3: Producing SAX2 Events
// Try getting and setting the property
try {
System.out.println ("Initial property setting: "
+ producer.getProperty (uri);
// if we get here, the property is supported
producer.setProperty (uri, value);
// if we get here, the parser set the property
} catch (SAXNotSupportedException e) {
// bad value for property ... maybe wrong type, or parser state
System.out.println ("Cant set property: "
+ e.getMessage ());
System.exit (1);
} catch (SAXNotRecognizedException e) {
// property not supported by this parser
System.out.println ("Doesnt understand property: "
+ e.getMessage ());
System.exit (1);
}
Youll notice the URIs for these standard properties happen to have a
common prex. This means that you can declare the prex
(http://xml.or g/sax/properties/ ) as a constant string and construct the iden-
tiers by string catenation.
Her e ar e the standard properties:
http://xml.or g/sax/properties/declaration-handler
This property holds an implementation of or g.xml.sax.ext.DeclHan-
dler, used for reporting the DTD declarations that arent reported
thr ough or g.xml.sax.DTDHandler callbacks or for the root element
name declaration, or g.xml.sax.ext.LexicalHandler callbacks. This han-
dler is presented in the section The DeclHandler Interface.
lfr ed, Crimson, and Xerces support this property. In fact, all JAXP-
compliant processors must do so.
http://xml.or g/sax/properties/dom-node
Only specialized parsers will support this property: parsers that tra-
verse DOM document nodes to produce streams of corresponding
SAX events. (Typical SAX2 parsers parse XML text instead of DOM
content.) When read, this property retur ns the DOM node correspond-
ing to the current SAX2 callback. The property can only be written
befor e a parse, to specify that the DOM node beginning and ending
the SAX event stream need not be a or g.w3c.dom.Document. This
3 January 2002 10:08
type of parser is presented later in this chapter, in the section DOM-
to-SAX Event Production (and DOM4J, JDOM).
One example of such a parser is gnu.xml.util.DomParser, which is
curr ently packaged along with the lfred parser. At this time, neither
Crimson nor Xerces include such functionality.
http://xml.or g/sax/properties/lexical-handler
This property holds an implementation of or g.xml.sax.ext.LexicalHan-
dler, used for reporting various events mostly (but not exclusively)
relating to details of XML text that have no semantic or structural
meaning, such as comments. This handler is presented in Chapter 4 in
the section The LexicalHandler Interface.
lfr ed, Crimson, and Xerces support this property. In fact, all JAXP-
compliant processors must do so.
http://xml.or g/sax/properties/xml-string
This property retur ns a literal string of characters associated with the
curr ent parser callback event. Exactly which characters are retur ned
isnt specied by SAX2. An example would be retur ning all the char-
acters in the start tag of an element, including unexpanded entity and
character refer ences as well as excess whitespace and the exact type
of quote characters (single, double) used to delimit attribute values.
(This feature is intended to be of use when constructing certain kinds
of XML editors, or DTD analyzers, that are willing to re-parse this
data.)
No widely available open source SAX2 parser currently supports this
pr operty.
Applications may nd it useful to dene their own types of handler inter-
faces, assembling sequences of SAX event atoms into higher-level event
molecules that incorporate essential application-level semantics (and
pr obably some procedural validation). This is the same kind of process
model used by W3Cs XML schema processing model: the Post-Schema-
Validation Infoset (PSVI) additions incorporate semantics suited to pro-
cessing with that kind of schema. Most applications need to associate even
mor e semantics with data than are easily captured by such simple rules
(including DTDs and all types of schema). Those semantics would likely
not be understood by any common XMLReader, but other kinds of SAX
pr ocessing components can help manage such application-level handlers.
You can see an example of this technique in Example 6-3.
Configur ing XMLReader Behavior 83
3 January 2002 10:08
84 Chapter 3: Producing SAX2 Events
XMLReader Feature Flags
The previous chapter showed how to access feature ags from SAX
parsers and used the standard validation ag as the primary example.
Accessing feature ags follows the same model as accessing properties,
except the values are boolean not Object. Ther e ar e a handful of stan-
dard SAX2 feature ags, which are all you normally need. The namespace
for features is differ ent fr om the namespace for properties. You cant set a
pr operty to a java.lang.Boolean value and expect to have the same effect
as setting the feature ag that happens to use the same identier.
As with properties, the URIs for these standard feature ags happen to
have a common prex: http://xml.or g/sax/features/. Its good programming
practice to declare the prex as a constant and construct these feature
identiers by string catenation, helping reduce errors. Also, remember that
ags arent necessarily either settable (read/write)
*
or readable (sup-
ported); some parsers wont recognize all these ags, and in some cases
these ags expose parser behaviors that dont change.
The standard ags are as follows:
http://xml.or g/sax/features/exter nal-general-entities
The default value for this ag is parser-specic. When the parser is
validating, and in most other cases, the ag is true, indicating that the
parser reads all external entities used outside the DTD. When the ag
is false, the XML parser wont expand refer ences to external general
entities, so applications wont see the entire body of documents using
such entities. This value cant be changed during parsing.
Crimson and Xerces only support true for this property. (For such
parsers, you can get most of the effect of setting this ag to false by
using an EntityResolver that retur ns zer o-length entities after the rst
startElement() event.) lfred supports changing the value of this
pr operty.
http://xml.or g/sax/features/exter nal-parameter-entities
The default value for this ag is parser-specic. When the parser is
validating, and in most other cases, the ag is true, indicating the
DTD will be completely processed. When the ag is false, the XML
parser will skip any external DTD subset, as well as named external
parameter entities, so it wont necessarily read the entire DTD for a
document. This value cant be changed during parsing.
* SAX could support write-only ags too, but these are rar ely a good idea.
3 January 2002 10:08
Skipping these entities means attributes declared in them will not be
defaulted or normalized as expected, and their types wont be known.
As a result, default namespace declarations may get dropped. Parts of
the internal subset after a refer ence to a skipped external parameter
entity will be ignored. It also means some general entities might not
be declared, making it impossible to correctly distinguish whether ref-
er ences to undened entities are well-for medness err ors.
Nor mally, you are better off providing an entity resolver that accesses
locally cached copies of your DTD components, or not using DTDs,
rather than disabling processing of external parameter entities. But
dont assume all the XML you work with will have these DTD entities
pr ocessed; the XML processors in some web browsers will not read
these entities by default.
Xerces and Crimson only support true for this property. (For such
parsers, you can get an effect similar to setting this to false by using
an EntityResolver that retur ns zer o-length entities before the rst
startElement() event. The parser wont correctly ignore declarations
found later in the DTD.) lfr ed supports changing the value of this
pr operty.
http://xml.or g/sax/features/is-standalone/
This feature ag derives its value from the document being parsed, so
it is read-only and only available after the rst part of the document
has been parsed. When the ag is true, the document has been
declar ed to be standalone. If that declaration is correct, then all exter-
nal entities may be safely ignored. This featur e is part of XML 1.0 and
is intended to reduce the cost of parsing some documents.
This ag should be part of an upcoming SAX extensions release.
http://xml.or g/sax/features/lexical-handler/parameter-entities
The default value for this ag is parser-specic and is implicitly false if
the parser doesnt support the LexicalHandler thr ough a parser prop-
erty. When the ag is true, the parser will report the beginning and
end of parameter entities through LexicalHandler calls. (Skipped
parameter entities are always reported, through the appropriate Con-
tentHandler call.) Parameter entities are distinguished from general
entities because the rst character of their entity name will be a per-
cent sign (%). The value cant be changed during parsing.
Curr ently, only the lfred parser reports parameter entities.
Congur ing XMLReader Behavior 85
3 January 2002 10:08
86 Chapter 3: Producing SAX2 Events
http://xml.or g/sax/features/namespaces
This ag defaults to true in XML parsers, which indicates the parser
per forms namespace processing, reporting xmlns attributes by start-
PrefixMapping() and endPrefixMapping() calls and providing names-
pace URIs for each element or attribute. Otherwise no such process-
ing is done at the parser level. This cant be changed during parsing.
You will leave ag this at its default setting unless your XML docu-
ments arent guaranteed to conform to the XML Namespaces specica-
tion. Setting this to false usually gives some degree of parsing speed
impr ovement, although it will likely not provide a signicant impact
on overall application perfor mance. If you disable namespaces, make
sur e you rst enable the namespace-pr exes featur e.
This is supported by all SAX2 XML parsers. lfred, Crimson, and
Xerces support changing the value of this property.
http://xml.or g/sax/features/namespace-pr exes
This ag defaults to false in XML parsers, indicating the parser will not
pr esent xmlns* attributes in its startElement() callbacks. Unless the
ag is true, parsers wont portably present the qualied names (which
include the prex) used in an XML document for elements or
attributes. The value cant be changed during parsing.
If you want to see the namespace prexes for any reason, including
for generating output without further postprocessing or for perfor ming
layer ed DTD validation, make sure this ag is set. Also make sure this
ag is set if you completely disable namespace processing (with the
namespaces featur e ag), because otherwise the behavior of a SAX2
parser is undened.
This is supported by all SAX2 parsers. lfred, Crimson, and Xerces
support changing the value of this property.
http://xml.or g/sax/features/string-inter ning
The default value for this ag is parser-specic. When true, this indi-
cates that all XML name strings (except those inside attribute values)
and namespace URIs retur ned by this parser will have been interned
using String.intern(). Some kind of interning is almost always done
to improve the perfor mance of parsers, and this ag exposes this
work for the benet of applications. This value cant be changed dur-
ing parsing.
When applications know interning has been done, they know they
can rely on fast, identity-based tests for string equality (== or !=)
rather than the more expensive String.equals() method. Using
3 January 2002 10:08
equality testing for strings will always work, but it can be much
slower than identity testing. Java automatically interns all string con-
stants. Lots of startElement() pr ocessing needs to match element and
attribute name strings (as sketched in Example 2-8), so this kind of
optimization can often be a win.
lfr ed inter ns all strings. Some older versions of Crimson dont rec-
ognize this ag, but all versions should correctly intern those strings.
Xerces reports that it does not intern these strings.
http://xml.or g/sax/features/validation
The default value for this ag is parser-specic; in most cases it is
false. When the ag is true, the parser is perfor ming XML validation
(with a DTD, unless youve requested otherwise). When the ag is
false, the parser isnt validating. The value cant be changed while
parsing.
lfr ed, when packaged with its optional validator, Crimson, and
Xerces support both settings.
A few additional standard extension features will likely be dened, pro-
viding even more complete Infoset support from SAX2 XML parsers.
lfr ed also includes a nonvalidating parser, which supports only false for
this ag.
Of the widely available parsers, only Xerces has nonstandard feature ags.
(The Xerces distribution includes full documentation for those ags.) As a
rule, avoid most of these, because they are parser-specic and even ver-
sion-specic. Some are used to disable warnings about extra denitions
that arent errors. (Most parsers dont bother reporting such nonerrors;
Xerces reports them by default.) Others promote noncompliant XML vali-
dation semantics. Here are a few ags that you may want to use.
http://apache.or g/xml/features/validation/schema
This tells the parser to validate with W3C-style schemas. The docu-
ment needs to identify a schema, and the parser must have names-
paces and validation enabled. (Defaults to false.)
W3C XML schema validation does not need to be built into XML
parsers. In fact, most currently available schema validators are layer ed.
http://apache.or g/xml/features/validation/schema-full-checking
This ag controls whether W3C schema validation involves all the
specied tests. By default, some of the more expensive checks are
not perfor med; Xerces is not fully conforming by default.
Congur ing XMLReader Behavior 87
3 January 2002 10:08
88 Chapter 3: Producing SAX2 Events
http://apache.or g/xml/features/allow-java-encodings
This ag defaults to false, limiting the encodings that the parser
accepts to a handful. When the ag is set to true, mor e encoding
names are supported. Most other SAX2 parsers effectively have true as
their default. A few of those additional encoding names are Java-spe-
cic (such as UTF8); most of them are standard encoding names,
either the preferr ed version or recognized alternatives.
http://apache.or g/xml/features/continue-after-fatal-err or
When set, this ag permits Xerces to continue parsing after it invokes
ErrorHandler.fatalError() to report a nonrecoverable error. If the
err or handler doesnt abort parsing by throwing an exception, Xerces
will continue. The XML specication requir es that no more event data
be reported after fatal errors, but it allows additional errors to be
reported. (Of course, depending on the initial error, many of the sub-
sequent reports might be nonsense.)
The EntityResolver Interface
As mentioned earlier, this interface is used when a parser needs to access
and parse external entities in the DTD or document content. It is not used
to access the document entity itself. Cases where an EntityResolver should
be used include:
When more local copies of entity data should be used. Such copies
might be from a local lesystem or from a smart caching proxy. A nor-
mal web server may be unavailable or may only be accessible through
a slow or congested network link; such remote access can cause
application slowdowns and failures. This is generically called catalog
or cache processing.
When the entitys systemId uses a URI scheme that is not understood
by the underlying JVM. Built-in schemes usually include http://, le://,
ftp://, and increasingly https://. Schemes not supported by the JVM
include ur n: and application-specic schemes. (You may need to put
such URI schemes into publicID values, in order to prevent problems
resolving relative URIs.)
When entities need to be constructed dynamically, or not through the
standard URI resolution scheme. For example, entity text might be the
result of a query through some user interface or another computation.
When the XML source text doesnt provide usable URIs. SGML-style
systems sometimes use system identiers that arent really URIs; they
3 January 2002 10:08
might be relative to some base URI other than the base URI of the
appr opriate entity (document or DTD). Avoid this practice for XML-
based systems; its not very interoperable because most XML proces-
sors strongly expect system IDs in XML documents to be valid URIs,
relative to the actual base URI of their declaration.
Applications that handle documents with DTDs should plan to use an
EntityResolver so they work robustly in the face of partial network failures,
and so they avoid placing excessive loads on remote servers. That is, they
should try to access local copies of DTD data even when the document
species a remote one. There are many examples of sloppily written
applications that broke when a remote system administrator moved a DTD
le. Examples range from purely informative services like most RSS feeds
to fee-based services like some news syndication protocols.
You can implement a useful resolver with a data structure as simple as a
hash table that maps identiers to URIs. There is nor mally no reason to
have differ ent parsers use differ ent entity resolvers; documents shouldnt
use the same public or (absolute) system identiers to denote differ ent
entities. Youll normally just have one resolver, and it could adaptively
cache entities if you like.
Mor e complex catalog facilities may be used by applications that follow
the SGML convention that public identiers are For mal Public Identiers
(FPIs). FPIs serve the role that Universal Resource Names (URNs) serve for
Inter net-oriented systems. Such mappings can also be used with URIs, if
the entity text associated with URIs is as stable as an FPI. (Such stability is
one of the goals of URNs.)
Applications pass objects that implement the EntityResolver inter face to the
XMLReader.setEntityResolver() method. The parser will then use the
resolver with all external parsed entities. The EntityResolver inter face has
only one method, which can throw a java.io.IOException as well as the
or g.xml.sax.SAXException most other callbacks throw.
InputSource resolveEntity(String publicId, String systemId)
Parsers invoke this method to map entity identiers either to other
identiers or to data that they will parse. See the discussion in the
section The InputSource Class, earlier in this chapter, for information
about how the InputSour ce inter face is used. If null is retur ned, then
the parser will resolve the systemId without additional assistance. To
avoid parsing an entity, retur n a value that encapsulates a zero-length
text entity.
The EntityResolver Interface 89
3 January 2002 10:08
90 Chapter 3: Producing SAX2 Events
The systemId will always be present and will be a fully resolved URI.
The publicId may be null. If its not null, it will have been normal-
ized by mapping sequences of consecutive whitespace characters to a
single space character.
Example 3-3 is an example of a simple resolver that substitutes for a web-
based time service running on the local machine by interpreting a private
URI scheme and mapping public identiers to alternative URIs using a dic-
tionary thats externally maintained somehow. (For example, you might
prime a hashtable with the public IDs for the XHTML 1.0, XHMTL 1.1, and
DocBook 4.0 XML DTDs to point to local les.) It delegates to another
resolver for other cases.
Example 3-3. Entity resolver, with chaining
public class MyResolver implements EntityResolver
{
private EntityResolver next;
private Dictionary map;
// n -- optional resolver to consult on failure
// m -- mapping public ids to preferred URLs
public MyResolver (EntityResolver n, Dictionary m)
{ next = n; map = m; }
InputSource resolveEntity (String publicId, String systemId)
throws SAXException, IOException
{
// magic URL?
if ("http://localhost/xml/date".equals (systemId)) {
InputSource retval = new InputSource (systemId);
Reader date;
date = new InputStringReader (new Date().toString ());
retval.setCharacterStream (date);
return retval;
}
// nonstandard URI scheme?
if (systemId.startsWith ("blob:") {
InputSource retval = new InputSource (systemId);
String key = systemId.substring (5);
byte data [] = Storage.keyToBlob (key);
retval.setInputSource (new ByteArrayInputStream (data));
return retval;
}
// use table to map public id to local URL?
if (map != null && publicId != null) {
String url = (String) map.get (publicId);
3 January 2002 10:08
Example 3-3. Entity resolver, with chaining (continued)
if (url != null)
return new InputSource (url);
}
// chain to next resolver?
if (next != null)
return next.resolveEntity (publicId, systemId);
return null;
}
}
Traditionally, public identiers are mainly used as keys to nd local copies
of entities. In SGML, system identiers were optional and system-specic,
so public identiers were sometimes the only ones available. (XML
changed this: system identiers are mandatory and are URIs.) In essence,
public identiers were used in SGML to serve the role that URNs serve in
web-oriented architectur es. An ISO standard for FPIs exists, and now
RFC 3151 (available at http://www.ietf.or g/rfc/r fc3151.txt) denes a map-
ping from FPIs to URNs. (The FPI is normalized and transformed, then
gets a urn:publicid: pr ex.) When public identiers are used with XML
systems, its largely by adopting FPI policies to interoperate with such
SGML systems; however, XML public identiers dont need to be FPIs. You
may prefer to use URN schemes in newer systems. If so, be aware that
some XML processing engines support only URLs as system identiers. By
letting applications interpret public IDs as URNs, SAX offers more power
than some other XML APIs do.
If you want richer catalog-style functionality than the table mapping
shown earlier, look for open source implementations of the XML version
of the OASIS SGML/Open Catalog (SOCAT). At this time, a specication
for such a catalog is a stable draft, still in development; see
http://www.oasis.or g/committees/entity/ for more infor mation. This speci-
cation denes an XML text repr esentation of mappings; the mappings can
be signicantly more complex than the tabular one shown earlier.
Other Kinds of SAX2 Event
Producer s
Nor mally, an XMLReader tur ns XML text into SAX event callbacks. This
book encourages you to think of those event consumer callbacks as the
most important part of the process, so using XML text as input is just one
option for feeding those consumers.
Other Kinds of SAX2 Event Producer s 91
3 January 2002 10:08
92 Chapter 3: Producing SAX2 Events
For example, some SAX parsers have turned HTML text into SAX call-
backs; there have even been SAX wrappers around the limited
javax.swing.text.html parser. These wrappers can help migrate to XHTML,
rst by making sure tags are properly formed, paired, and nested, then by
helping make the XHTML be valid so more tools can work with it. Mal-
for med HTML is a huge problem; theres lots of brain-dead HTML text on
the Web.
*
In practice, no generally available SAX HTML parser is quite
good enough to substitute for tools like HTML Tidy (see http://tidy.sour ce-
for ge.net) combined with manual xup for problem cases, but that could
change.
DOM-to-SAX Event Production
(and DOM4J, JDOM)
Its so typical to want to turn a DOM node into a series of SAX events that
SAX2 dened a standard way to do this. Several of the projects that claim
to improve on DOM by being more Java-friendly, such as DOM4J and
JDOM, have similar functionality.
In conjunction with any sort of SAX text output API (such as an XML-
Writer), this technique is an easy way to turn a DOM tree into text. Utili-
ties to turn a DOM node into text all need to do more or less the same
thing: traverse the tree and emit the right sort of text. Using SAX (and SAX
utilities) you can do this without needing support for any optional DOM
Level 3 modules and without relying on any vendor-specic DOM exten-
sions. (Its also a ne technique to use when you need a debugging snap-
shot and cant afford the memory needed to deep-clone a DOM
document.)
* One early browser development policy was that ther es no such thing as broken HTML, so
parsers needed to accept pretty much everything. The policy helped simplify content cre-
ation when there wer e few tools beyond text editors, but it also led to serious problems
with browser incompatibility which are only now beginning to go away. Its also helped
spr ead tools fostering malformed HTML (including akey CGI scripts) and made it harder
to present HTML on low-cost systems (it takes a fat parser to handle even a fraction of the
dif ferent kinds of broken HTML).
The draconian error-handling policy of the XML specication (if its not well formed, it
must be rejected) was a reaction to those problems: XML parsers dont need to compete
on how well they can make sense of garbage input. It was added at the request of the
main browser vendors, which were then Netscape and Microsoft. This policy makes it a
lot easier to create tools to process XML text, including presentation tools (XHTML brows-
ing) that can even work on limited resource systems (such as PDAs or cell phones), con-
tent management tools, and screen scrapers for mining XHTML presentation text (to
repurpose the data shown there).
3 January 2002 10:08
Of course, any other processing can be done too, such as validating the
output. After initializing and connecting an appropriate event producer,
consumer-side validator, and Err orHandler, just produce the events and
watch for reports of validity errors. In some cases (as with DOM-to-SAX
converters), you can look at individual element subtrees; in other cases,
youll need to examine entire documents.
Turning DOM trees into SAX events
To tur n a DOM node into SAX events, youll need to use a special parser
class; normal SAX parsers requir e text as input and wont know the rst
thing about DOM. If its a Level 2 DOM and is using namespace support,
youll probably need to manually patch up the namespace data, since
DOM isnt guaranteed to maintain it. Patching can be done before or after
you generate SAX events; I prefer to use a single, generic SAX2 processing
component to handle namespace xups no matter where the problem
ar ose, since DOM isnt the only culprit. Given such a parser class (the
GNU version is used here), your code will look like this:
import gnu.xml.util.XMLWriter;
import org.w3c.dom.Node;
import gnu.xml.util.DomParser;
XMLReader parser;
Node node = ...;
ContentHandler contentHandler = new XMLWriter (system.out);
parser = new DomParser ();
parser.setContentHandler (contentHandler);
// you may also set DTDHandler, LexicalHandler, and DeclHandler
parser.setProperty ("http://xml.org/sax/properties/dom-node", node);
parser.parse ("dom-node value gets parsed");
Neither Crimson nor Xerces currently include support for such DOM-to-
SAX transformations.
Turning DOM4J trees into SAX events
In DOM4J (http://www.dom4j.or g ), it works like this. The current version
of DOM4J isnt as exible or complete as a DOM-to-SAX converter, though
it has a few more options than JDOM. See the current release for more
infor mation.
import gnu.xml.util.XMLWriter;
import org.dom4j.io.SAXWriter;
import org.dom4j.Document;
Other Kinds of SAX2 Event Producer s 93
3 January 2002 10:08
94 Chapter 3: Producing SAX2 Events
SAXWriter parser;
ContentHandler contentHandler = new XMLWriter (system.out);
Document doc = ...;
parser = new SAXWriter ();
parser.setContentHandler (contentHandler);
// you may also set DTDHandler and LexicalHandler
parser.write (doc);
Turning JDOM trees into SAX events
Her es how to do this conversion in JDOM (http://www.jdom.or g ). As this
is being written, the current version of JDOM doesnt support the level of
exibility of a DOM-to-SAX parser; it only handles JDOM document
nodes. It also doesnt support LexicalHandler or DeclHandler events.
JDOM could support some of the LexicalHandler events, such as those for
comments and CDATA section boundaries. See the current release for
mor e infor mation.
import gnu.xml.util.XMLWriter;
import org.jdom.Document;
import org.jdom.output.SAXOutputter;
SAXOutputter parser;
ContentHandler contentHandler = new XMLWriter (system.out);
Document doc;
parser = new SAXOutputter (contentHandler);
// you may also set DTDHandler
parser.output (doc);
Push Mode Event Production
Since SAX event handlers are just objects, your application software can
call their methods directly. This is a common technique for application
code that needs to convert data structures to XML: turn them into SAX
event streams for processing by other components. That component could
be an XMLWriter sending data across the web to a partner, but you can do
other kinds of processing too. Such application code normally has no rea-
son to be wrapped as an implementation of XMLReader.
When used with in-memory data structures, this is part of whats some-
times called serialization. Be car eful not to confuse this with the more
specialized meaning in Java RMI, where serialization is a binary data for-
mat tied to individual Java classes. Other words used to describe this kind
3 January 2002 10:08
of process include marshaling, encoding, and pickling. Reversing the
pr ocess is an important parallel problem, since most of the time applica-
tions must both produce and consume XML data. That is, most applica-
tions round-trip data, rather than just consuming it or producing it.
This event generation technique is not restricted to data structures that
wer e originally stored in memory. You can use it with data from
databases, stored on lesystems, and entered through user interfaces. The
same general technique is used in all these cases.
Turning CSV files into SAX events
Comma Separated Values, or CSV, is a data format that is widely used for
some data interchange problems. Many spreadsheets and databases can
read and write it, and it can be used to publish fairly large databases. Its
one of the more widely understood at le text formats, and its not
uncommon to need to translate data CSV formats into XML. With luck, the
meaning of each eld will be documented or maybe obvious from con-
text. A simple CSV list of some yoga classes might have ve elds per
record and look like this:
daniela,4:30-5:45pm,ashtanga,sun,mixed
(staff),10:30am-12:00m,sivanenda,daily,open
philippe,7-9:00pm,ashtanga,mon,mixed
larry,4:30-5:45pm,ashtanga,wed,rocket
mahadevi,6-8:00pm,sivanenda,wed,advanced
savonn,7-8:30pm,vinyasa,wed,2-3
kei,9:30-11am,vinyasa,thu,intermediate
patti,7:30-9pm,iyenegar,thur,1-2
regan,9:30-11am,bikram,fri,open
mark,12m-2pm,ashtanga,sat,mysore
The translation is easier than the parsing of CSV itself. Details like han-
dling of empty or missing elds, quoted values, and inconsistent value
syntax are messy, and critical when importing lots of data. In fact, its so
messy that Example 3-4 completely avoids such lexical issues for CSV
input data. (Nonlexical issues should be delegated to XML processing lay-
ers.) The example shows one way to translate; its packaged more simply
than a real-world application would probably expect. (Making an XML-
Reader that emits SAX events is possible and might be convenient.) This
appr oach tur ns each CSV record into a single element by using attributes
(with a sneak peek at a helper class well see later). It prints the output as
XML text, which is probably not how youd normally work with such data;
the output is more naturally sent through a processing pipeline.
Other Kinds of SAX2 Event Producer s 95
3 January 2002 10:08
96 Chapter 3: Producing SAX2 Events
Example 3-4. Pr oducing SAX2 events from CSV input
import java.io.*;
import java.util.StringTokenizer;
import org.xml.sax.*;
import org.xml.sax.helpers.*;
import gnu.xml.util.XMLWriter;
public class csv
{
// stdin = (simple) CSV, stdout = XML
public static void main (String argv [])
{
BufferedReader in;
XMLWriter out;
ErrorHandler errs;
String line;
try {
in = new BufferedReader (new InputStreamReader (System.in));
out = new XMLWriter (System.out);
errs = new DefaultHandler () {
public void fatalError (SAXParseException e) {
System.err.println ("** parse error: "
+ e.getMessage ());
}
};
out.startElement ("", "", "yoga", new AttributesImpl ());
while ((line = in.readLine ()) != null)
parseLine (line, out, errs);
out.endElement ("", "", "yoga");
out.flush ();
} catch (Exception e) {
System.err.println ("** error: " + e.getMessage ());
e.printStackTrace (System.err);
System.exit (1);
}
}
// this doesnt handle quoted strings (with commas inside),
// empty fields, tabs used as delimiters, or column headers.
private static void parseLine (
String line,
ContentHandler out,
ErrorHandler errs
) throws SAXException
{
StringTokenizer tokens = new StringTokenizer (line.trim (), ",");
String values [] = new String [5];
// if there arent five values, its malformed
if (tokens.countTokens () != 5) {
3 January 2002 10:08
Example 3-4. Pr oducing SAX2 events from CSV input (continued)
errs.fatalError (
new SAXParseException ("not enough values", null));
return;
}
for (int i = 0; i < 5; i++)
values [i] = tokens.nextToken ();
// now that we parsed the line safely, report its contents
// the AttributesImpl class is shown later
AttributesImpl atts = new AttributesImpl ();
atts.addAttribute ("", "", "teacher", "CDATA", values [0]);
atts.addAttribute ("", "", "time", "CDATA", values [1]);
atts.addAttribute ("", "", "type", "CDATA", values [2]);
atts.addAttribute ("", "", "date", "CDATA", values [3]);
atts.addAttribute ("", "", "level", "CDATA", values [4]);
out.ignorableWhitespace ("\n ".toCharArray (), 0, 3);
out.startElement ("", "", "class", atts);
out.endElement ("", "", "class");
}
}
The output of that program looks somewhat like this:
<yoga>
<class teacher="daniela" time="4:30-5:45pm" type="ashtanga"
date="sun" level="mixed"></class>
<class teacher="(staff)" time="10:30am-12:00m" type="sivanenda"
date="daily" level="open"></class>
<class teacher="philippe" time="7-9:00pm" type="ashtanga"
date="mon" level="mixed"></class>
<class teacher="larry" time="4:30-5:45pm" type="ashtanga"
date="wed" level="rocket"></class>
<class teacher="mahadevi" time="6-8:00pm" type="sivanenda"
date="wed" level="advanced"></class>
<class teacher="savonn" time="7-8:30pm" type="vinyasa"
date="wed" level="2-3"></class>
<class teacher="kei" time="9:30-11am" type="vinyasa"
date="thu" level="intermediate"></class>
<class teacher="patti" time="7:30-9pm" type="iyenegar"
date="thur" level="1-2"></class>
<class teacher="regan" time="9:30-11am" type="bikram"
date="fri" level="open"></class>
<class teacher="mark" time="12m-2pm" type="ashtanga"
date="sat" level="mysore"></class></yoga>
This included some ignorable whitespace to prevent the output from
appearing as one big line of text; enabling pretty printing would do as
well. Notice that the output needed to be ushed, else the JVM would
Other Kinds of SAX2 Event Producer s 97
3 January 2002 10:08
98 Chapter 3: Producing SAX2 Events
nor mally exit with data still buffer ed in memory. We havent yet looked at
the endDocument() callback that would normally ush the data. Finally,
notice that handling of any CSV conversion errors is delegated to a SAX
err or handler, which in this case adopts a very permissive strategy.
Turning objects into SAX events
For simple objects, something like the following Address example
works. For a more complex object, such as a purchase order with multiple
addr esses for shipping and billing, youll likely have routines that encode
other data and use routines like this one as subroutines. You wont need
to use any other handler interfaces, though you might want to embed
comments or create CDATA boundaries using a LexicalHandler. Notice
that startElement() calls always have matching endElement() calls, just as
if the text was generated by an XML parser. This example declares and
uses namespaces; you dont need to do that on the producer side if you
patch them up later, but its a reasonable practice to adopt. As used here,
the AttributesImpl class just creates an empty set of attributes to pass on
because null values cant be used:
static final String nsURI = "http://example.com/xml/address";
void toXML (Address addr, ContentHandler stream)
{
char temp [];
Attributes atts;
// create an empty set of attributes
atts = new AttributesImpl ();
// <address xmlns="http://example.com/xml/address">
stream.startPrefixMapping ("", nsURI);
stream.startElement (nsURI, "address", "address", atts);
// <street>...</street>
stream.startElement (nsURI, "street", "street", atts);
temp = addr.getStreet ().toCharArray ();
stream.characters (temp, 0, temp.length);
stream.endElement (nsURI, "street", "street");
// <city>...</city>
stream.startElement (nsURI, "city", "city", atts);
temp = addr.getCity ().toCharArray ();
stream.characters (temp, 0, temp.length);
stream.endElement (nsURI, "city", "city");
// <country>...</country>
stream.startElement (nsURI, "country", "country", atts);
temp = addr.getCountry ().toCharArray ();
3 January 2002 10:08
stream.characters (temp, 0, temp.length);
stream.endElement (nsURI, "country", "country");
// ... there would probably be more elements,
// but not all application data in the "Address"
// would be shared with the recipient.
// </address>
stream.endElement (nsURI, "address", "address");
stream.endPrefixMapping ("");
}
If youre printing such output, you might want to add some ignorable
whitespace to keep all the text from appearing on a single line. The
resulting XML text will be easier to read, though having text without line
br eaks should not matter otherwise. (Better yet: use an XMLWriter with
pr etty-printing support.) If you are working with many namespaces, you
may want to use the NamespaceSupport class (see the section The
NamespaceSupport Class in Chapter 5) to track and select the prexes
used in the element and attribute names you write.
It may also be a good idea to write unmarshaling code (taking such
events and recr eating, or looking up, application objects) at the same time
you write marshaling code (like the preceding code, creating SAX events
fr om application objects). That helps test the code: when round-trip pro-
cessing works for many differ ent data items (save a lot of test cases), you
know its behaving. Unmarshaling code can also be an appropriate place
to test for semantic validity of data: you might have reason to trust that
your current marshaling code is correct, but changes made next year
could break something, and its not good to expect everyone else will
marshal correctly.
Data modeling concerns
As a rule of thumb, avoid assuming that your XML data model ought to
match your applications data structures. Such policies can sometimes be
appr opriate, but more often, your applications internal data structures
wer e optimized for something unrelated to communicating with other
applications. Most systems that automatically marshal and unmarshal data
structur es (maybe using reection in Java) will make such assumptions;
they lead to tightly coupled systems. Tight coupling tends to cause
fragility in the face of system evolution, since upgrades normally occur
incr ementally on widely distributed systems (such as almost all web-based
applications).
Other Kinds of SAX2 Event Producer s 99
3 January 2002 10:08
100 Chapter 3: Producing SAX2 Events
For example, when you interchange the results of a complex set of
queries from your database (perhaps for a large purchase order), it is typi-
cally appropriate to mask the exact relational structure used in your appli-
cation. The recipient of your XML may well have adopted a differ ent
relational normalization. The recipient might not even expect to perfor m
database operations on such data. Data displays may need to address
usability issues that are completely unrelated to how applications think
about the same data. Similar logic applies when the application data isnt
stor ed in a database or is only partially stored in one.
On the other hand, if youre using XML to transfer a relation from one
database to another, encoding a java.sql.ResultSet (or CSV table) into a
series of elements (one element per table row, without duplications) may
be exactly the right model. (The reverse transformation would be unmar-
shaling consuming XML to populate a database.) You wont always want
to denormalize, even though the ability to easily do that is one of the
gr eat str engths of using XML to interchange data. Many common messag-
ing scenarios involve the kind of data model that serves as input to nor-
malization processes, and are oriented to individual cases not aggregates.
When youre encoding individual data items, such as integers, dates, or
binary data encoded using BASE64, you should consider using the data-
typing facilities in Part 2 of the W3C XML Schema specication
(http://www.w3c.or g/TR/xmlschema-2/ ). Those simple datatypes are
intended to be used in many specications. Its association with the partic-
ular schema system described in other parts of the W3C XML Schema
specication can be viewed as a historical accident; you dont need to use
W3C schemas to use these datatypes.
Producing Well-For med Event Streams
If you are generating SAX2 events from any event producer thats not an
actual XML parser (maybe by using an HTML parser or code that traverses
data structures), you may need to ensure the event stream is legal before
passing it to other components (maybe by printing it as XML text). There
ar e issues of well formedess to think about: startElement() calls need
matching endElement() calls, other calls requir e similar start/end nesting,
carriage retur ns ar e pr ohibited in line ends, and more. Correct reporting of
namespace information is important: prexes must be declared and cor-
rectly used. Validity will also be an issue in many contexts as a policy of
eliminating data format errors as early as possible. (Its cheaper to x bugs
befor e you ship them in products than afterward, and validation tools
make some bugs easy to nd.)
3 January 2002 10:08
The particular issues you may have depend on what kind of event pro-
ducer you use and what kinds of events you generate. DOM streams can
easily be namespace-invalid; for example, prexes are often undeclared or
missing. Code that generates events directly is particularly prone to violate
element nesting and closure requir ements and to omit namespace declara-
tions. Few tools prevent all kinds of illegal content; ]]> could appear in
CDATA sections, and -- (two hyphens) within comments, both of which
will prevent generation of legal XML text.
With high-quality producer-side code, youll have xed all those problems
befor e the code is released. But youll still probably want code that
dynamically veries that theres no problem to use when debugging or
tr oubleshooting. If you adopt a good SAX2 event pipeline framework, it
can easily support components that monitor event streams to ensure they
meet those data integrity constraints or, in some cases (like namespaces),
patch event streams so they are corr ect.
The XMLFilter Interface
SAX2 added the XMLFilter inter face. XMLFilter is just an XMLReader that
can be associated with a parent reader. Whats interesting is the expecta-
tion that the parent is producing the events and the lter postprocesses
them; the lter parses and modies Infoset data, not XML text. From the
perspective of your application code, a lter that you use as an XML-
Reader is doing some postprocessing of your parser requests, some pro-
cessing on the XML data, then passing you the results; its a prepr ocessor
for infoset data.
The XMLFilter inter face adds these methods to XMLReader:
void setParent(XMLReader parser)
XMLReader getParent()
The parent of an XMLFilter is accessed using standard JavaBeans
pr operty-naming conventions. Use this property to control which
parser (or lter) generates the events to be ltered.
The role of the XMLFilter implementation is primarily to intercept and pro-
cess SAX content events. Because its real work is to process those events,
the code in such a lter is acting as a consumer. Implementing the XML-
Reader inter face is a facade to make that consumer code look like a pull
API (XMLReader) and let it intercept requests to an underlying parser. That
is, it supports one kind of XML pipeline model.
Other Kinds of SAX2 Event Producer s 101
3 January 2002 10:08
102 Chapter 3: Producing SAX2 Events
Since the interesting issues are all on the consumer side, XMLFilter is dis-
cussed later with other kinds of SAX event pipeline models, in the section
XML Pipelines, in Chapter 4, along with the XMLFilterImpl helper class.
If youre using these lters as event producers, youll need to pay atten-
tion to a secondary role of an XMLFilter : intercepting and modifying
parser requests. This kind of lter is a compound object. It consists of the
lter, plus a reader (which might in turn be another lter), handler bind-
ings, and settings for feature ags and properties. The interrelationships of
these parts can get murky. In simple cases you can ignore the distinction,
tr eating this type of SAX lter just like another reader. But in other cases
you may need to remember that the lter and its parent are distinct
objects with differ ent behaviors.
For example, sometimes youll nd implementations of XMLFilter that
dont use mechanisms such as the EntityResolver or Err orResolver. When
you need to use those mechanisms, youd need to bind such objects to
the parent parser. But most lters pass those objects on to the parent and
may even need to use them internally, so youd bind them to the lter
instead. Youll need to know which kind of lter you have. In a similar
way, if an underlying parser interns its strings, but the lter changes them
(for example, swapping one namespace URI for another) and doesnt
inter n those strings, then code that talks to the lter cant use identity tests
to replace the slower equality tests. The lter would have to expose a dif-
fer ent setting for such feature ags than the parent parser.
3 January 2002 10:08
4
Consuming SAX2
Events
In this chapter:
More About
ContentHandler
The LexicalHandler
Interface
Exposing DTD
Infor mation
Tur ning SAX Events
into Data Structures
XML Pipelines
Most of the power of SAX is exposed through event callbacks. In previous
chapters youve seen some of the most widely used event callbacks as
well as how to ensure that all the callbacks are generated and reported to
application code.
This chapter presents the rest of the standard SAX event-handling inter-
faces (including the extension handlers), then talks about some of the
common ways that event consumers use those interfaces. These interfaces
ar e primarily implemented by application code that consumes events and
needs to solve particular problems. You might also write custom event
pr oducers, which call these interfaces directly rather than expecting some
type of XMLReader to issue them.
More About ContentHandler
In the section Basic ContentHandler Events, in Chapter 2, we looked at
the most important APIs used to handle XML document content. Some
other APIs were deferr ed to this section because they arent used as
widely. Depending on what problems youre solving, you may rely heav-
ily on some of these additional methods.
Other ContentHandler Methods
Five ContentHandler callbacks were discussed in Chapter 2: the section
Essential ContentHandler Callbacks explained how characters and ele-
ment boundaries were reported, and the section ContentHandler and Pre-
x Mappings explained how namespace-prex scopes were reported. But
103
3 January 2002 10:08
104 Chapter 4: Consuming SAX2 Events
the interface has ve other methods. Heres what they do and when youll
want to use them:
void setDocumentLocator (Locator l)
This is normally the rst callback from a parser; the single parameter
is a Locator, discussed later. Strictly speaking, SAX parsers are not
requir ed to provide a locator or to make this callback; however, youd
want to avoid parsers that dont provide this information. Your imple-
mentation of this callback will normally just save the locator; it cant
do much more since its the only SAX event callback that cant throw
a SAXException:
class MyHandler implements ContentHandler ... {
private Locator locator;
...
public void setDocumentLocator (Locator l)
{ locator = l; }
...
}
Use this object as discussed later in this chapter, in the section The
Locator Interface. It is the standard way to report the base URI of the
XML text currently being parsed; that information is essential for
resolving relative URIs. Its also essential for diagnostics that tell you
wher e application code detects errors in large quantities of XML text.
void startDocument ()
void endDocument ()
These two callbacks bracket processing for a document, and they are
nor mally used to manage application state associated with the docu-
ment being parsed. If youre parsing a document, these methods will
always be called once each, even when parsing is cut short by a
thr own exception. No other methods have such guarantees.
startDocument() is always called before any data is reported from the
parser, and is normally used to initialize application data structures. It
will usually be the second callback from the parser; parsers that pro-
vide a Locator will report that rst. You cant rely on a setDocumentLo-
cator() call before startDocument(); structur e your initialization code
to do the real work in the callback guaranteed to be available.
endDocument() is always called to report that no more document data
will be provided. The nor mal application response is to clean up all
state associated with the current parse. The parser closes any input
data streams you gave it using an InputSour ce (discussed later), so the
3 January 2002 10:08
application doesnt need to do that. Cleanup would include forgetting
any saved Locator since that object is no longer usable when the
parse is complete. Also, youd likely close other les or sockets that
wer e opened while processing this document:
class MyHandler implements ContentHandler ... {
...
public void startDocument ()
throws SAXException
{
// initialize data structures for ALL handlers here
...
}
public void endDocument ()
throws SAXException
{
// free those same data structures
locator = null;
elementStack = null;
...
}
...
}
These two calls are widely used in robust SAX code because they pro-
vide such good hooks to control memory usage and manage associ-
ated le descriptors. However, some SAX2 parsers have a bug that
reduces the robustness offer ed by SAX; they wont correctly call end-
Document() when parsing is aborted by throwing exceptions.
void processingInstruction (target, data)
Pr ocessing Instructions (PIs) are used in XML for data that doesnt
obey the rules of a DTD. They can be placed anywhere in a docu-
ment, including within the DTD, except inside other markup con-
structs like tags. Unlike comments, PIs are designed for applications
to use. Theyr e part of the document structure that programmatic
logic must understand; they can follow rules, just not ones found in a
DTD or schema. This method has two parameters:
String target
XML applications use this parameter to determine how to handle
the PI. You can rely on the fact that itll never be the string xml
(in any combination of upper- and lowercase characters) because
XML and text declarations are not processing instructions.
Some documents follow the convention that the target of a PI
names a notation (perhaps the fully qualied URI found in its sys-
tem identier) and the meaning is associated with the notation
More About ContentHandler 105
3 January 2002 10:08
106 Chapter 4: Consuming SAX2 Events
rather than the name. Thats a ne practice to follow, but it isnt
essential. Most code just compares target names as strings, rather
than use data reported with DTDHandler.notationDecl() to gure
out what a target name should mean.
String data
This parameter is data associated with the PI, and it may be the
null string if no data was provided after the target name. Some
applications use the syntax of an attribute here; others dont
bother.
Pr ocessing instructions are natural to use in template systems and
other document-oriented applications.
*
Pr ocessing instructions are nor mally safe to ignore when your pro-
cessing doesnt recognize them (passing them on to any subsequent
pr ocessing stage), or to store. If the parser does recognize them, it
nor mally acts on then immediately. For example, an <?xml-stylesheet
...?> PI might select a particular XSLT stylesheet to use for generat-
ing a servlets output. The processing instruction event is used later, in
Example 6-9.
void ignorableWhitespace(buf,offset,len)
This is an optional callback, made by most parsers (including all that
ar e validating) to report whitespace that separates elements in element
content models, like those of the form (title,para*,sect1*) but not
(#PCDATA|para|comment)*, ANY, or EMPTY. Whitespace before or after
the documents root element is not treated as ignorable and is com-
pletely discarded. Pr oviding this information is a requir ement of the
XML specication, since this kind of whitespace is dened to be
markup rather than document content. If the parser doesnt see such
a content model declaration for any reason, it cant use this callback;
itll use characters() instead, and applications will need to gure out
if the whitespace is part of markup or part of content.
The parameters are exactly the same as those of the characters()
callback, except that you know the characters in the specied range
will all be spaces, tabs, or newlines. (Keep that in mind if youre
dir ectly pr oducing ignorable whitespace to feed some event con-
sumer. Using CRLF- or CR-style line ends here is a bug, though you
* For example, the syntax of PHP, the web page scripting tool, looks like a processing
instruction, <?php ...?>. For various reasons, PHP is not actually an XML document
syntax.
3 January 2002 10:08
might not see immediate consequences.) Like characters(), this
method can be called several times in a row, to complete processing a
single stretch of characters.
Ther e ar e two popular ways to handle this callback. My favorite is to
dr op all the characters; theyre only in the source document to make
the elements lay out nicely, so they wont ever mean anything.
Ther es rar ely a reason to even look at the data, much less save it.
The other option is to delegate handling and just call the charac-
ters() callback with the whitespace.
void skippedEntity (String name)
The parameter is a String that identies an internal or external parsed
entity. General entity names are presented as found in their declara-
tions (dudley). Parameter entity names begin with a percent sign
(%nell). The exter nal DTD subset is special; its an unnamed parame-
ter entity and is reported with the name [dtd]. You might not be able
to tell if the skipped entity was an internal or external entity, even
using DeclHandler events.
You probably dont ever want to see this call, since it means that part
of your document has been hidden. XML 1.0 processors are requir ed
to report this case; SAX 1.0 didnt, and most other parser-level APIs
(such as DOM Level 2) still dont. This is a call that only nonvalidat-
ing parsers may issue, and even then only if they are not parsing all
the external entities referr ed to in documentsthat is, where one or
both of the external entities feature ags is set to false, to disable
reading external general or parameter entities. No widely used Java
parsers clear those ags by default, so this is a rare call in Java. How-
ever some C parsers, such as Expat (used in Mozilla), wont normally
parse external entities, so the notion isnt exotic in all languages.
The Locator Interface
This useful interface is sometimes overlooked. It gives information that is
essential for providing location-sensitive diagnostics and is often given to
SAXParseException constructors. That same information is also needed to
resolve relative URIs in document content or attribute values (such as
xml:base). Parsers provide one instance of this class, which can be used
inside event callbacks to nd what entity triggered the event and approxi-
mately where. Use that locator only during such callbacks. There are only
a few methods in this class.
More About ContentHandler 107
3 January 2002 10:08
108 Chapter 4: Consuming SAX2 Events
String getSystemId ()
This is the most important method in this interface. It retur ns the
base URI (system ID) for the entity being parsed; this is always an
absolute URI. (However, versions of Xerces that are curr ent at this
writing have a bug here. They sometimes retur n nonabsolute URIs.)
Use this method to identify the document or external entity in diag-
nostics or to resolve relative URIs (perhaps in conjunction with
xml:base attributes).
If the parser doesnt know this value, null is retur ned. This normally
indicates that the parser was not given such a URI inside of a Input-
Sour ce encapsulating document text. Thats bad practice except when
its unavoidable, such as parsing in-memory data or input to the POST
method in a servlet.
int getLineNumber ()
int getColumnNumber ()
These two functions approximate the current position of a parser
within an entity. The position reected is where the relevant events
data ended. It is only an approximation for diagnostics, but most
parsers do try to be accurate about the line number.
These numbers count up from 1 as appropriate for user-oriented diag-
nostics. Not all implementations will provide these values; the value
-1 is retur ned to indicate that no value was provided.
String getPublicId ()
A public identier may be provided with this method. Otherwise null
is retur ned. This may be useful for diagnostics in some cases.
One common use for a locator is to report an error detected while an
application processes document content. The SAXParseException class has
two constructors that take locator parameters. (The descriptive string is
always rst, the locator is second, and an optional root cause exception
is third.) Once you create such an exception, it can be thrown directly,
which always terminates a parse. Or you pass it to an Err orHandler to
centralize error handling-policy in your application:
// "locator" was saved when setDocumentLocator() was called earlier
// or was initialized to null; this is safe in both cases
try {
...
engine.setWarpFactor (11);
...
} catch (DriveException e) {
SAXParseException spe = new SAXParseException (
"The warp engines gonna blow!",
3 January 2002 10:08
locator,
e);
errHandler.error (e);
// well get here whenever such problems are ignored
}
To resolve relative URIs in document contentfor example, one found in
an <xhtml:a href="..."/> refer ence in a link checkeryoud use code
like this (ignoring xml:base complications):
public void startElement (String uri, String lname, String qname,
Attributes atts) throws SAXException
{
if (xhtmlURI.equals (uri)) {
if ("a".equals (lname)) {
String href = atts.getValue ("href");
if (href != null) {
// ASSUMES: locator is nonnull
System.out.println ("Found href to: " +
new URI (new URI(locator.getSystemId ()), href));
}
// else presumably <xhtml:a name="...">
}
} ...
}
Some of the XMLReader implementations cannot possibly call Con-
tentHandler.setDocumentLocator() with a Locator. When parsing in-mem-
ory data structures, such as a DOM document, a locator will normally be
meaningless. When parsing in-memory buffers like a String (with a
StringReader), there wont usually be a URI in the locator.
If your application supports the layered xml:base convention (which lets
documents lie about their true locations for purposes of resolving rela-
tive URIs), it will need to track those attributes itself, as part of a context
stack mechanism. (An example of such a stack is shown later, in Example
5-1.) Such attributes can sometimes help make up for SAX event sources
that cant provide locator information, such as DOM-to-SAX producers.
But they can confuse things too: in the following example, xml:base
would apply to the top element and its direct children, but nothing within
the external entity refer ence. (Lets assume, for the sake of discussion, that
no element has an xml:base attribute.)
<top xml:base="http://www.example.com/moved/doc2.xml">
<xhtml:a href="abc.xml">
<xhtml:div> &external; </xhtml:div>
<xhtml:a href="xyz.xml">
</top>
More About ContentHandler 109
3 January 2002 10:08
110 Chapter 4: Consuming SAX2 Events
When character content of an element is reported, characters from differ-
ent external entities will get differ ent callbacks, so the locator can be used
to tell those differ ent entities apart from each other.
Inter nationalization Concer ns
One of the goals of XML was to bring Unicode into widespread use so
that the Web could really become worldwide in terms of people, not just
technology. This brings several concerns into text management. You may
not need to worry about these if youre working only in ASCII or with just
one character encoding. While youre just starting out with Java and XML
you should certainly avoid worrying about these details. Some other users
of SAX2 will need to understand these issues. Since they surface primarily
with ContentHandler event callbacks, we briey summarize them here.
If your application works with MathML, or in various languages whose
character sets gained support in Unicode 3.1 through the so-called Astral
Planes, you will need to know that what Java calls a char is not really the
same thing as a Unicode character or an XML character. If you arent using
such languages, youll probably be able to ignore this issue for a while.
Still, you might want to read about Unicode 3.1 to learn mor e about this
and minimize trouble later. By the time you read this, the W3C may even
have completed its Blueberry XML update, intended to allow the use of
some such characters within XML names.
In the case of such characters, whose Unicode code point is above the
value U+FFFF (the maximum 16-bit code point), these characters are
mapped to two Java char values, called a surr ogate pair. The char values
ar e in a range reserved for surr ogate characters, with a high surrogate
always immediately followed by a low surrogate. (This is called a big-
endian sequence.) Surrogate pairs can show up in several places in XML,
and hence in SAX2: in character content, processing instructions, attribute
values (including defaults in the DTD), and comments.
At this time, Java does not have APIs to explicitly support characters using
surr ogate pairs, although character arrays and java.lang.String will hold
them as if the char values werent part of the same character. The
java.lang.Character class doesnt recognize surrogate pairs. The best pre-
caution seems to be to prefer APIs that talk in terms of slices of character
arrays (or Strings), rather than in terms of individual Java char values. This
3 January 2002 10:08
appr oach also handles other situations where mor e than one char value is
needed per character.
Depending on the character encodings youre using and the applications
your e implementing, you may also need to pay attention to the W3C
Character Model (http://www.w3.or g/TR/WD-charmod/ at this writing) and
Unicode Normalization Form C. Briey, these aim to eliminate undesirable
repr esentations of characters and to handle some other cases where Uni-
code characters arent the same as XML characters or a Java char, such as
composite characters. For example, many accented characters are repr e-
sented by composing two or more Unicode characters. Systems work bet-
ter when they only need to handle one way to repr esent such characters,
and Form C addr esses that problem.
The LexicalHandler Interface
This extension interface is new in SAX2. Its in the or g.xml.sax.ext pack-
age, which means among other things that it is optional and isnt sup-
ported by all SAX APIs and layers, such as DefaultHandler. However, any
SAX2 parser that can be bootstrapped with JAXP supports this interface.
Parsers that support LexicalHandler expose comment text and the bound-
aries of CDATA sections, DTDs, and most parsed entities. There is no
setLexicalHandler() method; bind these handlers to parsers like this:
XMLReader producer = ...;
LexicalHandler handler = ...;
producer.setProperty ("http://xml.org/sax/properties/lexical-handler",
handler);
// throws SAXNotSupportedException if parameter isnt a LexicalHandler
// throws SAXNotRecognizedException if parser doesnt support it.
The information this exposes is needed for applications that need more in
the way of round-tripping support than the SAX2 core allows. That is,
less of the information read by parsers will be completely discarded. The
application needs SAX to provide more complete support for the XML
Infoset (or for the XPath data model). To completely support DOM, XPath,
or XSLT on top of a SAX2 parser, this interface is as necessary as the
namespaces exposed in the SAX2 ContentHandler and Attributes inter-
faces. The downside is that much of this information is in the category of
infor mation applications shouldnt want to deal with. Be careful how you
use these callbacks; dont assume that just because the information is
available, you should use it.
The LexicalHandler Interface 111
3 January 2002 10:08
112 Chapter 4: Consuming SAX2 Events
LexicalHandler has the following methods:
void comment(buf,offset,len)
Reports characters inside a <!- -...--> comment section (without
the delimiting characters).For many applications, this event is the only
reason to use this interface. This is almost the same convention Con-
tentHandler uses to report character content or ignorable whitespace;
the parameters are identical. Comments are always reported in a sin-
gle callback. Two consecutive comment() calls means two consecutive
comments, while two consecutive characters() calls just enlarge a
given logical span of text.
char buf []
A character array that holds the comment text. As with the Con-
tentHandler.characters() callback, you must ignore characters in
this buffer that are outside of the specied range.
int offset
The index of the rst comment character in the buffer.
int len
How many comment characters are in the buffer, beginning at the
specied offset.
Comments show up in the XPath data model, so they are reected in
layers (such as XSLT, XPointer, and XLink) that build on XPath. Strictly
speaking, applications should ignore comments except when they
round-trip data provided during authoring. Instead, they should use
pr ocessing instructions when they need to work with annotations.
You might need to use comment data with HTML processors because
it doesnt support processing instructions. For example, HTML docu-
ments often use comments to wrap CSS data, JavaScript code, or
server-side includes.
Ther e ar e two good ways to handle comments. One is just to discard
them and make the implementation of this method do nothing. (I like
that one!) The other is to create a new String using the method
parameters and save the string somewhere. Avoid parsing comment
content; if youre tempted to do that in new applications, try to use
PIs (which were designed for such purposes).
public void comment (String buf, int offset, int len)
throws SAXException
{
String value = new String (buf, offset, len);
... now that you have it, what do you want to do?
}
3 January 2002 10:08
void startDTD(name, publicId, systemId)
void endDTD()
The startDTD() event reports the beginning of a documents DTD,
and endDTD() reports the end. These events can be useful when you
save DTD information, such as the partial support in DOM Level 2. It
is also important when you create SAX event streams that may need
to print as documents that include a DTD.
String name
The declared name of the root element for the document. It is
never omitted, though for invalid documents it may not corre-
spond to the name of the root element.
String publicId
Nor malized version of the public ID declared for the external sub-
set, or null if no such subset was provided.
String systemId
The system ID declared for the external subset, or null if no such
ID was provided. Note that this URI is not absolutized.
When the end of the DTD is reported, all other declarations that
should have been reported (with DeclHandler or DTDHandler call-
backs) will have been reported. If any ContentHandler.skippedEn-
tity() calls were made for external parameter entities, applications
will normally infer that some declarations were not processed.
Parsers are not requir ed to distinguish the internal and external sub-
sets. There are two mechanisms applications can use, but both of
them are optional. The natural method is to rely on external parame-
ter entity boundary reports, using other methods in this interface. Not
all parsers report those entities; you can check the lexical-handler/
parameter-entities featur e ag to see if this mechanism will work for
you. The other mechanism compares base URIs as reported through
the Locator.getSystemId() method; base URIs for external subset
components will differ from those of the document itself. Most
parsers support this method, but its awkward to use for this purpose.
If youre saving DTD content, these methods will bracket a lot of
work where you squirrel data away for later use. Otherwise, youll
pr obably arrange to ignore all the other DTD events and will only
need to decide what to do with comments and processing instruc-
tions, if you dont just ignore them. Ignoring them within DTDs is a
popular strategy even when theyre not ignored elsewhere. This is
because comments or PIs inside a DTD would seem to apply to DTD
The LexicalHandler Interface 113
3 January 2002 10:08
114 Chapter 4: Consuming SAX2 Events
contents, while most applications are instead working with document
contents.
void startCDATA()
void endCDATA()
These methods report the beginning and end of a <[CDATA[ . . . ]]>
text section; the bracketing characters are not reported. Any content
within a CDATA section is reported with characters() events; the <
and & characters within CDATA sections are parsed like normal char-
acters, not like delimiters for markup.
Most software has little reason to care whether character content is
contained in CDATA sections. Unless you are trying to round-trip data
while preserving those lexical artifacts (to simplify potential future
work done with text editors), the right response to CDATA events is to
ignor e them.
void startEntity(String name)
void endEntity(String name)
These methods report the beginning and end of internal or external
entity expansion. The entity is named using the same rules as the Con-
tentHandler.skippedEntity() callback. If you need to indicate which
kind of entity is being expanded, record information from the
DeclHandler.externalEntityDecl() callback and consult it in these
methods. (That means youll likely really want an extended Default-
Handler or XMLFilterImpl that supports both of the standardized
extension classes.)
Expansions of general entity refer ences, like &dudley;, are reported
everywher e except inside attribute values. Such expansions within
entity values cant meaningfully be reported, since all markup within
start tags is reported at the same time.
Not all parsers report expansion of parameter entities, like %nell;, in
DTDs. There is a special parser feature ag (lexical-handler/param-
eter-entities) that determines whether parsers report such events. As
with general entity refer ences, not all parameter entity expansions can
be meaningfully reported. Parameter entities that expand as part of
markup declarations or conditional section markers wont be seen,
since markup declarations are reported only in their entirety.
3 January 2002 10:08
Exposing DTD Infor mation
SAX2 exposes DTD information through three differ ent inter faces. Part of
it is exposed through the LexicalHandler extension interface: the DTDs
root element type declaration and boundaries of the various entities. The
rest is exposed through two DTD-specic interfaces, presented here.
When youre working with streams of SAX event data, remember that all
DTD event data is seen before the document data it describes. This means
that if you need it inside the document, youll need to plan ahead to save
the DTD data. It also means that if you need to merge streams of event
data, such DTD data may create a problem. Unless you know the DTD
data in advance, youd need to dam up the event stream until all data that
needs to go into downstream DTD events is in hand. Only then can you
send the events downstream (with the DTD rst). Luckily, merging event
str eams with unknown DTD data isnt common.
DTD information is automatically used inside XML parsers when they
parse XML documents. That includes expansion of conditional sections
and parameter entities in DTDs, expanding general entities, and normaliz-
ing or defaulting attributes. Most DTD validation can be cleanly layered on
top of SAX2 since these declaration callbacks provide all the most impor-
tant information.
*
SAX2 enables application-level processing of DTD con-
straints; the only internal support it provides for DTDs is a feature ag to
expose parser support for validation. When applications need to construct
valid documents, they can use DTD information as they make changes,
instead of needing to save the document and reparse the whole thing.
The support for working with DTDs provided by most XML tools is not as
good as the support provided by SAX2. For example, DOM Level 2 pro-
vides weaker support, and the TRAX support for SAX (java.xml.trans-
for m.sax) doesnt support DeclHandler at all.
Note that while a fully featured SAX2 parser will let you re-cr eate the
inter nal subset, it will not let you round-trip any external parameter enti-
ties. Thats because parameter entities will be expanded. You will not see
* The exceptions relate to lexical constraints that should arguably be well-formedness con-
straints. Entity nesting is supposed to match nesting of grammatical constructs within
DTDs; thats a validity constraint. However, the analogous constraint in a document body
af fects well-for medness instead.
Exposing DTD Infor mation 115
3 January 2002 10:08
116 Chapter 4: Consuming SAX2 Events
conditional sections in external PEs, or declarations being built up from
parameter entities. Instead, youll see the actual declarations that apply to
your documents. This may help you to understand exactly what a com-
plex DTD is doing.
The Dec lHandler Interface
This extension interface is new in SAX2. Its in the or g.xml.sax.ext pack-
age, which means among other things that it is optional and not all SAX
APIs support it. (DefaultHandler is one example of an API that does not.)
However, any SAX2 parser that can be bootstrapped with JAXP must sup-
port this interface. There is no setDeclHandler() method; bind these han-
dlers to parsers like this:
XMLReader producer = ...;
DeclHandler handler = ...;
producer.setProperty ("http://xml.org/sax/properties/
declaration-handler",handler);
// throws SAXNotSupportedException if parameter isnt a DeclHandler.
// throws SAXNotRecognizedException if parser doesnt support it.
Parsers that support DeclHandler ar e essential for applications that need
to work with declarations of elements and attributes or with parsed enti-
ties. DOM requir es such support for parsed entities, although even Level 2
hides or ignores element and attribute type data. This interface is the most
common way SAX2 exposes type constraints (the primary role of a Docu-
ment Type Declaration) from DTDs, so if you need to see those con-
straints, youll use this handler. It has four API callbacks:
void attributeDecl(eName,aName,type,mode,value)
This callback reports <!ATTLIST ... > declarations in a DTD. A
given declaration produces one callback for each attribute in the dec-
laration. Much of this information will also be provided through
Attributes methods if an instance of that element appears in a docu-
ment.
String eName
This is the name of the element whose attribute is being declared.
String aName
This is the name of the attribute associated with that element.
String type
This is one of the strings CDATA, ID, IDREF, IDREFS, NMTOKEN, NMTO-
KENS, ENTITY, or ENTITIES, or two types of enumerated values.
Enumerated values are encoded with parenthesized strings such
3 January 2002 10:08
as (a|b|c) to indicate that strings a, b, or c ar e per missible. If the
string is an enumeration of notation names, "NOTATION (which
includes one space) precedes that parenthesized string.
This type information is more complete than information you get
thr ough the Attributes object provided with startElement(),
because Attributes reports only enumerations as being either
NOTATION or NMTOKEN. However, at this time several widely avail-
able SAX2 parsers conform to a beta test version of this API and
dont correctly report enumerations. You may need to get a bug-
xed version of your parser if youre depending on this support.
String mode
This describes the kind of default value applied to this attribute:
#IMPLIED (the application determines the value), #REQUIRED (the
value must be given; defaulting is not permitted), #FIXED (only
one value is permitted), or null indicating that value is the default.
Unless the document provided a value, you wont see #IMPLIED
attributes in the Attributes object provided with startElement();
if you need to know this information, save it when you get this
callback.
String value
This parameter is either null or a string with the default value for
this attribute. That might be the only permitted value if the
attribute mode is #FIXED. The value will be reported exactly as
applications will see it: normalized and with character and entity
refer ences replaced.
XML structure editors can use this information to constrain the choices
pr esented to document authors so that only valid documents can be
cr eated. Other tools that construct documents will also benet from
having this information. When your e mostly reading documents
rather than creating them, the most important data here tends to be
declaration of ID, IDREF, and IDREFS attributes, which are used to
build links within and between XML documents.
If more than one declaration for an attribute is provided, only the rst
one will be used. (The second one will be ignored; unlike the analo-
gous case for element declarations, attribute redeclaration is not a
validity error.) Normally code to implement this callback would rst
retrieve any existing per-element data structure, or it would create one
(with a null content model) if none is yet known. Then if there is no
record of an attribute with this name for that element, a per-attribute
Exposing DTD Infor mation 117
3 January 2002 10:08
118 Chapter 4: Consuming SAX2 Events
data structure instance would be created and saved in the element
data structure, keyed by attribute name.
void elementDecl(name,model)
This method reports <!ELEMENT ... > declarations in a DTD.
String name
This is the element name.
String model
This is the element content model, with all whitespace removed.
For example, element content models like (a,(b|c)+,d?), mixed
content models like (#PCDATA|one|two|three)*, and simple mod-
els like ANY and EMPTY may all be found in the same document.
Note that parsers may do more than just remove the whitespace,
as long as an equivalent content model is reported.
Because the content model is provided as a string, applications using
it must always parse it themselves. Similarly, if applications want to
validate against that model, they must provide code to do that. Except
for the case of element content, such work is straightforward. Validat-
ing element content models requir es constructing and using some sort
of nite state automaton, and it takes a bit of work to parse the
model. Mixed content models are easier to handle since they can be
parsed with a java.util.StringT okenizer and because the validation
logic is simpler.
If more than one declaration for an element is provided, only the rst
one will be used. (The second one will be considered a validity error;
element type redeclaration is not allowed.) Nor mally the code imple-
menting this callback would create a new per-element data structure
to save the name and content model and store it in data structure
(hash table or other map) keyed by element name. Such a data struc-
tur e might already exist if an element attribute was declared before
the element. In this case, this callback just provides the content
model, which was previously unknown.
void externalEntityDecl(name,publicId,systemId)
This callback reports <!ENTITY ... > declarations in a DTD for
parsed external entities. These may be either general or parameter
entities.
String name
This is the entity name; it is always provided. Names that start
with % ar e parameter entities; all others are general entities.
3 January 2002 10:08
String publicId
This is the public ID for the entity and can be omitted (provided
as null). If public IDs are provided, any embedded whitespace is
nor malized, so these strings may be directly compared. They may
be used to determine a location for the entity, for example, by
using an SGML Formal Public Identier with some sort of catalog.
String systemId
This is the system ID for the entity and is always provided. It is an
absolute URI, which parsers normally use to retrieve the entity
befor e parsing it. However, some SAX2 parsers have a bug, and
wont report the absolute URI here.
Applications usually ignore all parameter entity declarations and use
the or g.xml.sax.EntityResolver when they want to provide local copies
of these entities to a parser. If applications dont ignore these declara-
tions, redeclaration should be ignored (it is not an error). XML editors
may want to offer menus of external (and internal) entities when edit-
ing element content. And in some cases you may want to track exter-
nal entities by name so that you can tell when LexicalHandler.star-
tEntity() is reporting the start of one; this is useful for applications
that use xml:base attributes to change applications views of the actual
URI that contains an element, using the Locator.getSystemId()
method. (Perhaps the actual location was not known, or should for
some reason be ignored.)
void internalEntityDecl(name,value)
This callback reports <!ENTITY ... > declarations in a DTD for
(parsed) internal entities. These may be either general or parameter
entities.
String name
This is the entity name. Names that start with % ar e parameter
entities, all others are general entities.
String value
This is the entity value, which contains arbitrary XML content
(including elements and nested entity refer ences) that will be
reparsed when this entity is expanded.
Applications normally ignore all parameter entity declarations. If
applications dont ignore these declarations, redeclaration for a name
should be be ignored (it is not an error). XML editors may want to
of fer menus of internal entities when they edit attribute values or ele-
ment content. However, SAX2 does not report entity refer ences inside
Exposing DTD Infor mation 119
3 January 2002 10:08
120 Chapter 4: Consuming SAX2 Events
the attribute values it parses. This means that you wont be able to
re-cr eate such text without heuristics.
The DTDHandler Interface
The DTDHandler inter face was carried unchanged from SAX1 into SAX2
and is primarily useful for applications that work with two specic SGML
notions: notations and unparsed entities. Some DTDs, such as XML Doc-
Book, use notations in such traditional roles. DOM also requir es such sup-
port. Use XMLReader.setDTDHandler() to bind this handler to a parser. You
pr obably wont ever need to use it for new code. On the Web, those
SGML notions correspond roughly to MIME types and URIs respectively,
web concepts that are much more widely understood and supported. The
inter face has only two API callbacks, provided to meet specic requir e-
ments in the XML 1.0 specication:
void notationDecl(name,publicId,systemId)
This callback reports a <!NOTATION ...> declaration in a DTD.
String name
This is the notation name; it is always provided. These names are
used explicitly in unparsed entity declarations and in some kinds
of attribute declaration (elements can have one such attribute,
used to associate type with the element). Also, some applications
follow a convention that they may be used to identify processing
instruction targets.
String publicId
This is the public ID for the notation and may be omitted (pro-
vided as null). If public IDs are supplied, then any embedded
whitespace is normalized, so these strings may be directly com-
par ed. These may be used to assign a meaning to the notation,
for example, by using an SGML Formal Public Identier in a role
much like a MIME type.
String systemId
This is the system ID for the notation and may be omitted (pro-
vided as null). When provided, it is an absolute URI. However,
some SAX2 parsers have a bug, and wont report the absolute URI
her e. These may be used to assign a meaning to the notation, for
example, by using a URI to identify a type or command.
In addition to assigning types to unparsed entities, a NOTATION attribute
may also associate a type with an element or processing instruction.
3 January 2002 10:08
Some DTDs provide extensive catalogs of notation declarations specif-
ically for such uses.
Note that notation declarations are the one place in XML syntax where
you can provide a public ID without a system ID, and that at least
one identier (public or system) must always be provided. If applica-
tions dont ignore these declarations, redeclaration should be ignored
(it is not an error).
void unparsedEntityDecl(name,publicId,systemId,notation)
This callback reports <!ENTITY ... > declarations with NDATA annota-
tions to associate them with a notation (such as jpeg or png).
Unparsed entities are used only in attributes that are declar ed to be of
type ENTITY or ENTITIES.
String name
This is the name of the unparsed entity; it is always provided.
String publicId
This is the public ID for the notation and may be omitted (pro-
vided as null). If public IDs are provided, any embedded whites-
pace is normalized, so these strings may be directly compared.
These may be used to assign a location to the entity, for example,
by using an SGML Formal Public Identier in a role much like a
URN.
String systemId
This is the system ID for the notation and is always provided. It
is normally an absolute URI. However, some SAX2 parsers have a
bug, and wont report the absolute URI here. These may be used
to assign a location to the entity.
String notation
This is the name of the notation associated with the entity; it is
always provided. The role of these names is much like that of an
exter nal MIME type annotation for the entity.
In XML, unparsed entities are declar ed to parsers but pass through
them without being parsed. Classic examples of unparsed entities
include JPEG or PNG image les. Such entities may also be used for
XML text that just doesnt need to be parsed in a given processing
stage. If applications dont ignore these declarations, redeclaration
should be be ignored (it is not an error).
Most XML applications that care about unparsed entities and notations do
so because they interface with SGML systems that use them or are
Exposing DTD Infor mation 121
3 January 2002 10:08
122 Chapter 4: Consuming SAX2 Events
migrating such systems to use the XML generation of tools. XML editors
supporting this functionality might use these event callbacks to create
menus of notations or unparsed entities when they are editing attributes
that hold such values.
Applications that use this interface will normally use the callbacks to cre-
ate two tables, keyed by entity or notation name respectively, that are
used to interpret element attributes. More rar ely, notations will be used to
deter mine the operation corresponding to a given processing instruction
target name. Secure applications will never use notations to directly
encode system commands, but will always redir ect thr ough application
contr olled tables. For example, it would be foolish to rely on system IDs
found in a document. System IDs such as rm -rf /, when run through a
Unix or Linux shell, would remove all les accessible through the local
system.
Turning SAX Events into Data
Str uctures
As described earlier, one of the great strengths of SAX is that it lets appli-
cations use appropriate data structures, instead of forcing the use of
generic data structures. In the section Push Mode Event Production in
Chapter 3, we looked at the problem of producing SAX events from data
structur es. Her e we look at the reverse process: producing data structures
fr om SAX events. This is a process that most SAX applications handle to
one degree or another. One of the most traditional names for this process
is unmarshaling; its also sometimes called deserializing. (I tend to avoid
using the latter term with Java except when talking about RMI.)
Well rst look at how to turn SAX into generic DOM (and DOM-like) data
structur es. If youre working with such data structures, you may nd its
advantageous to build them using SAX. With SAX, you can easily discard
data you dont need, ltering it out so you dont need to pay its costs.
Afterward well look briey at some of the concerns associated with work-
ing with data structures that are mor e specialized to your application.
SAX-to-DOM Consumers
Its easy to turn a SAX event stream into a complete DOM document tree,
or into a DOM-like data structure such as DOM4J or JDOM. Most open
source DOM parsers build those data structures directly from SAX event
str eams. (Xerces has the only such DOM I know that doesnt work that
3 January 2002 10:08
way.) Building a DOM document from a SAX2 event stream requir es
implementing all four event consumer interfaces: ContentHandler, of
course; LexicalHandler to report boundaries of entity refer ences and
CDATA sections as well as comments; and both DeclHandler and DTD-
Handler to provide the subset of DTD information that DOM requir es.
The implementations of those interfaces must use nonstandard DOM func-
tions, because key functionality is missing from public DOM APIs. This
means that if youre using generic code to construct a DOM tree, you
wont be able to implement every behavior DOM species. If that doesnt
seem like a feature to you, youll need builder code thats specialized to a
particular DOM implementation.
Table 4-1 shows the classes that various DOM implementations provide
for turning a SAX2 event stream into a DOM tree.
*
Most classes have con-
guration options to let you discard some of the minimally useful data,
instead saving it and making your application code ignore it later. Except
as noted, they implement all four consumer interfaces. Each one has a
way to present the DOM data it produces, usually with a getDocument()
method; consult documentation (or source code) for full information.
Table 4-1. SAX-to-DOM consumer classes
Implementation Class name Comment
Crimson or g.apache.crimson.tree.Xml-
DocumentBuilder
Implements all the
event consumer
handlers.
DOM4J or g.dom4j.io.SAXContentHandler Extends
DefaultHandler ; does
not implement
DeclHandler.
GNUJAXP gnu.xml.dom.Consumer Uses the
gnu.xml.pipeline
framework.
JDOM or g.jdom.input.SAXHandler Extends
DefaultHandler.
Example 4-1 uses the DOM implementation from Crimson to illustrate
how easy it is to construct a DOM tree from SAX events.
* As presented in Chapter 3, in the section DOM-to-SAX Event Production (and DOM4J,
JDOM), most of these packages also support DOM-to-SAX event producers.
Turning SAX Events into Data Structures 123
3 January 2002 10:08
124 Chapter 4: Consuming SAX2 Events
Example 4-1. Converting SAX events to a DOM document (Crimson)
public Document SAX2DOM (String uri)
throws SAXException, IOException
{
XmlDocumentBuilder consumer;
XMLReader producer;
consumer = new XmlDocumentBuilder ();
producer = XMLReaderFactory.createXMLReader ();
producer.setContentHandler (consumer);
producer.setDTDHandler (consumer);
producer.setProperty
("http://xml.org/sax/properties/lexical-handler",
consumer);
producer.setProperty
("http://xml.org/sax/properties/declaration-handler",
consumer);
producer.parse (uri);
return consumer.getDocument ();
}
Pr uning Noise Data from a DOM Tree
For various historical reasons, DOM provides much information that just
adds overhead to applications. When you build a DOM with SAX2, its
particularly easy to prune that information out of DOM trees: you can sim-
ply arrange never to deliver it! Similar techniques are frequently used
when feeding SAX event data to a component. Its often easier to let the
component see only parts of the Infoset that you care about than to
remove the resulting data noise later.
The simplest example of this would be just to hook up the Con-
tentHandler to a SAX parser and ignore the other three handlers. The
resulting DOM will not have DTD information, but thats no loss, because
even DOM Level 2 doesnt provide enough of the DTD information to be
useful. (You can save more complete DTD information using custom SAX
handlers, if you need it.) Because the LexicalHandler isnt provided, you
wont see comment nodes or entity refer ence nodes (or their read-only
childr en which really complicate your code). Also, any CDATA text nodes
will be transparently merged with any adjacent normal text nodes. A
DOM without such information is a lot easier to work with; your code
wont need to handle special cases that come from storing such data. It
will also need somewhat less memory and take less time to construct the
DOM tree.
3 January 2002 10:08
To further streamline your data, override ignorableWhitespace() and dis-
card whitespace characters. While such events wont always be available
even for documents that include DTDs, discarding ignorable characters
can save signicant amounts of memory. The savings vary widely based
on DTDs and documents; documents that use mostly elements with ele-
ment content models (often, but not always, data-oriented DTDs) have the
biggest savings. Space savings of ten percent arent unreasonable and are
coupled with some time savings for DOM tree construction, but such sav-
ings are highly data dependent. (You may be able to discard processing
instructions, depending on your application.)
Discarding lots of the DOM data is so common that when you use JAXP to
build a DOM tree, you can congure it to automatically discard some of
the data. (Unfortunately, the default is to include all of that data. You
might not even need to strip out the events yourself. That conguration
infor mation gets sent directly to the SAX handler code that builds the
DOM, and you can usually use it directly without needing to subclass.
Example 4-2, a modied version of the previous example, shows this less
noisy setup.
Example 4-2. Converting SAX events to DOM, discarding noise (Crimson)
public Document SAX2DOM (String uri)
throws SAXException, IOException
{
XmlDocumentBuilder consumer;
XMLReader producer;
consumer = new XmlDocumentBuilder ();
consumer.setIgnoreWhitespace (true);
producer = XMLReaderFactory.createXMLReader ();
producer.setContentHandler (consumer);
producer.parse (uri);
return consumer.getDocument ();
}
Building a Par tial DOM
Often an even better solution for working with DOM is not to build an
entir e or g.w3c.dom.Document object. You can build just the individual
subtr ees you need, never paying memory for the rest. Unfortunately, the
classes listed earlier are set up to build entire document objects, so they
wont help. However, its easy to use SAX events to assemble trees of
DOM nodes.
Turning SAX Events into Data Structures 125
3 January 2002 10:08
126 Chapter 4: Consuming SAX2 Events
Her es one way to do it. This example denes an interface that exposes an
element type using a namespace URI and a local name. It also exposes an
event handler method to call with a DOM subtree that holds only such
elements and their children. In effect, DOM subtrees are str eamed, rather
than SAX events. Such a model could work well with documents that are
huge but highly regular, if the subtrees were processed then immediately
discarded to save memory. Such structures might repr esent a series of
composite records built from database queries, for example.
Example 4-3 uses JAXP to bootstrap an empty DOM document, which is
used as a factory to create DOM elements and text nodes. The factory
should be used for attributes too, in a more complete example, and per-
haps for processing instructions. Notice how the SAX document traversal
exactly matches a walk over the DOM tree being constructed, and how
the partial DOM tree serves as only the state thats needed. Also, that
DOM handles namespaces slightly differ ently than SAX does. If you need
to build DOM trees with SAX, your code doesnt need to be much more
complicated than this (other than passing attributes along) unless you try
to implement all the gingerbread ornamenting the data model exposed by
DOM.
Example 4-3. Using SAX to stream DOM subtrees
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.*;
import org.xml.sax.*;
import org.xml.sax.helpers.DefaultHandler;
// a kind of event handler
interface DomListener
{
public String getURI ();
public String getLocalName ();
public void processTree (Element tree) throws SAXException;
}
public class DomFilter extends DefaultHandler
{
private Document factory;
private Element current;
private DomListener listener;
public DomFilter (DomListener l)
{ listener = l; }
public void startDocument ()
throws SAXException
{
3 January 2002 10:08
Example 4-3. Using SAX to stream DOM subtrees (continued)
// all this just to get an empty document;
// we need one to use as a factory
try {
factory = DocumentBuilderFactory
.newInstance ()
.newDocumentBuilder ()
.newDocument ();
} catch (Exception e) {
throw new SAXException ("cant get DOM factory", e);
}
}
public void startElement (String uri, String local,
String qName, Attributes atts)
throws SAXException
{
// start a new subtree, or ignore
if (current == null) {
if (!listener.getURI ().equals (uri))
return;
if (!listener.getLocalName ().equals (local))
return;
current = factory.createElementNS (uri, qName);
// Add to current subtree, descend.
} else {
Element e;
if ("".equals (uri))
e = factory.createElement (qName);
else
e = factory.createElementNS (uri, qName);
current.appendChild (e);
current = e;
}
// NOTE: this example discards all attributes!
// They ought to be saved to the current element.
}
public void endElement (String uri, String local, String qName)
throws SAXException
{
Node parent;
// ignore?
if (current == null)
return;
parent = current.getParentNode ();
// end subtree?
if (parent == null) {
current.normalize ();
Turning SAX Events into Data Structures 127
3 January 2002 10:08
128 Chapter 4: Consuming SAX2 Events
Example 4-3. Using SAX to stream DOM subtrees (continued)
listener.processTree (current);
current = null;
// else climb up one level
} else
current = (Element) current.getParentNode ();
}
// if saving, append and continue
public void characters (char buf [], int offset, int length)
throws SAXException
{
if (current != null)
current.appendChild (factory.createTextNode (
new String (buf, offset, length)));
}
}
You can use similar techniques to construct other kinds of data structures
and to perfor m mor e inter esting lter functions. For example, perhaps
mor e than one element type is interesting, or some types of elements
should be reported through differ ent event handler callbacks. Its also easy
to transform the data as you read it; the DOM trees you construct dont
need to match the document structure that the parser reports.
Turning SAX Events into Custom Data
Str uctures
If your application data structure or interchange syntax is already dened,
you may not be able to unmarshal it using software based on the numer-
ous schema-oriented tools. However, lots of software uses SAX to do this
ef ciently. Once you understand how SAX models data in XML docu-
ments, you can treat unmarshaling much like any other parsing problem.
Its closely associated with marshaling your data structures to XML. Here
well look at some of the issues you may want to consider when trans-
for ming XML into your data structures.
You may nd that some individual data items, such as integers and dates,
use the low-level encoding rules that are specied in Part 2 of the W3C
XML Schema specication (http://www.w3c.or g/TR/xmlschema-2/ ). Those
encodings are low-level policy decisions, and theyre conceptually inde-
pendent of the rest of the W3C Schema; you can use them even if you
dont buy the W3C approach to those schemas. Some other schema sys-
tems, such as Relax-NG, incorporate those low-level encoding policies
3 January 2002 10:08
without adopting more problematic parts of the W3C XML Schema speci-
cation. Your application might likewise want to use these policies.
One basic high-level encoding issue is how closely the XML structures and
application structures should match. For example, an element will be eas-
ier to unmarshal by mapping its attributes (or child elements) directly to
pr operties of a single application object rather than by mapping them to
pr operties of several differ ent objects. The latter design is more complex,
and for many purposes it could be much more appr opriate, but such
unmarshaling code needs more complex state.
Regularity of the various structures is another issue. Its usually less work
to handle regular structures, since its easy to create general methods and
reuse them. Bugs are less frequent and more easily found than when
every transformation involves yet another special case.
Youll need to gure out how much state you need to track and what
techniques you will use. You might be able to use extremely simple pars-
ing state machines; one of these is shown later, in Example 6-2. Some-
times it might easier to unmarshal fragments into an intermediate form (as
in the DOM subtrees example earlier), and map that form to your applica-
tion structure befor e discarding them.
Often some sort of recursive-descent parsing algorithm that explicitly
tracks the state of your parsing activities will be useful. It will often be
helpful to keep a stack of pending elements and attributes, as shown later
(in Example 5-1). But since the XML structures might not map directly to
your application structures, you might also need to stack objects youre in
various stages of unmarshaling.
The worst scenario is when neither the XML text nor the application data
structur es ar e very regular. Softwar e to work with that kind of system
quickly gets fragile as it grows, and youll probably want to change some
of your application constraints.
XML Pipelines
In Chapter 2, the section XMLWriter: an Event Consumer briey dis-
cussed the concept of an XML pipeline. In that simple case, it involved
reading, transforming, and then writing XML text. This concept is a power-
ful model for working with SAX; it is the natural framework for develop-
ing SAX components. These components wont usually be JavaBeans-style
components, intended for use with graphical code builder tools, but they
will still be specialized and easily reusable.
XML Pipelines 129
3 January 2002 10:08
130 Chapter 4: Consuming SAX2 Events
Exactly what is a SAX event pipeline? Its a series of components, each a
pipeline stage connected so consumers act as producers for the next stage,
as shown in Figure 4-1. The components pass events through, perhaps
changing them on the y to lter, reorganize, augment, or otherwise trans-
for m the data as it streams through. (The term lter is sometimes used to
mean the same thing as a stage, though its only one type of role for a
stage.) The rst producer could be a parser, or some other program com-
ponent. The last consumer will probably have some dened output, such
as XML text (XMLWriter), a DOM document (using the classes shown ear-
lier), or an application-specic data structure. Intermediate stages in the
pipeline have at least one pipeline stage as output, and they might pro-
duce other outputs such as data structures. Or they might only be used to
analyze or condition the inputs to later stages.
Producer Consumer stage1 stage2
Figur e 4-1. SAX2 event pipeline
Pipeline stages can be used to create functional layers, or they can simply
be used to dene clean module boundaries. Some stages may work well
with fragments of XML, while others may expect to process entire docu-
ments. The order in which processing tasks occur could be critically
important or largely incidental. Stages can be application specic or gen-
eral purpose. In addition to reading and writing XML, examples of such
general-purpose stages include:
Cleaning up namespace information to re-cr eate pr ex declarations
and refer ences, replace old URIs with current ones, or give unquali-
ed names a namespace.
Per forming XSLT transfor mations.
Validating against an appropriate DTD or schema.
Transfor ming input text to eliminate problematic character repr esenta-
tions. (Several recent W3C specications requir e using Unicode Nor-
malization Format C.)
Supporting the xml:base model for determining base URIs.
3 January 2002 10:08
Passing data through pipeline stages on remote servers.
Implementing XInclude or similar replacements for DTD-based exter-
nal entity processing.
Per forming well-formedness tests to guard against sloppy producers
(parsers wont need this).
Mor e application-specic pipeline stages might include:
Per forming validation using procedural logic with access to system
state.
Collecting links, to support tasks such as verifying they all work.
Unmarshaling application-specic data structures.
Stripping out data that later processing must never see. For example,
SOAP 1.1 messages must never include processing instructions or
DTDs, and some kinds of XHTML rendering engines must not see
<font> tweaks.
This process is differ ent fr om how a work ow is managed in a data struc-
tur e API such as DOM. In both cases you can assemble work-ow compo-
nents, with intermediate work products repr esented as data structures.
With SAX, those work-ow components would be pipelines; pipeline
stages wouldnt necessarily correspond to individual work-ow compo-
nents, although they might. With a data structure API, the intermediate
work products must always use that API; with SAX they can use whatever
repr esentation is convenient, including XML text or a specialized applica-
tion data structure.
Beyond dening the event consumer interfaces and how to hook them up
to XML parsers, SAX includes only limited support for pipelines. That is
primarily through the XMLFilterImpl class. The support is limited in part
because XMLFilterImpl doesnt provide full support for the two extension
handlers so that by default it wont pass enough of the XML Infoset to
support some interesting tasks (including several in the previous lists).
In the rest of this section we talk about that class, XSLT and the
javax.xml.transfor m package, and about a more complete framework (the
gnu.xml.pipeline package), to illustrate one alternative approach.
You might also be interested in the pipeline framework used in the
Apache Cocoon v2 project. Cocoon is designed for managing large web
sites based on XML. One differ ence between the current Cocoon pipeline
framework and the GNU pipeline framework is that Cocoon excludes the
two SAX DTD-handling interfaces, making Cocoon pipelines unsuitable
XML Pipelines 131
3 January 2002 10:08
132 Chapter 4: Consuming SAX2 Events
for tasks that need such DTD information. (Examples include DTD-based
validation and parts of the XML Base URI specication that requir e detec-
tion of external entity boundaries.) At this writing, Cocoon 2.0 has just
shipped its rst stable release, ending its beta cycle.
The XMLFilterImpl Class
The XMLFilterImpl class is new in SAX2, though a similar layer was in use
on top of SAX1 parsers. Think of this class as a hybrid between an event
consumer and an event producer, which can be used in either mode:
In its event consumer role, its a base class that forwards events to
another consumer. Callers push events through the lter, which post-
pr ocesses them. Subclasses would normally override methods for
those events and invoke the superclass methods when they choose to
pass them on (after postprocessing the data to be reported).
In its event producer role, its a specialized XMLReader that registers
itself as the consumer for a parent reader and delegates parsing to that
par ent. Callers pull data through the lter by calling parse(); it looks
like a SAX parser that prepr ocesses Infoset data before reporting it.
When you subclass XMLFilterImpl, youll primarily be concerned with its
role as an event consumer because youll be writing event handler code.
The bulk of the work in a lter is event handling. When you need to lter
DeclHandler or LexicalHandler events, it wont know how to handle
them. Youll have to add code to handle those events; get the code to that
SAX class, and follow the model used for ContentHandler support. The
following code snippet shows how this is set up. It supports the producer
side (parsing a document and automatically ltering its events). It also
shows the consumer-side infrastructure, meaning events are nor mally
passed through untouched, but subclasses will override methods to inter-
cept events and change how they get handled:
public class ExtendedFilter extends XMLFilterImpl
implements LexicalHandler, DeclHandler
{
DeclHandler declHandler;
LexicalHandler lexicalHandler;
private static String declID =
"http://xml.org/sax/properties/declaration-handler";
private static String lexicalID =
"http://xml.org/sax/properties/lexical-handler";
public void setProperty (String uri, Object handler)
throws SAXNotRecognizedException, SAXNotSupportedException
3 January 2002 10:08
{
if (declID.equals (uri))
declHandler = (DeclHandler) handler;
else if (lexicalID.equals (uri))
lexicalHandler = (LexicalHandler) handler;
else
super.setProperty (uri, handler);
}
// support producer mode operations
public void parse (InputSource in)
throws SAXException, IOException
{
XMLReader parent = getParent ();
if (parent != null) {
parent.setProperty (declID, this);
parent.setProperty (lexicalID, this);
}
super.parse (in);
}
// support consumer mode operations
public void comment (char buf [], int offset, int length)
throws SAXException
{
if (lexicalHandler != null)
lexicalHandler.comment (buf, offset, length);
}
// ... likewise for other LexicalHandler and DeclHandler methods
}
When youre using such a lter just as a consumer, youll have to register
it as a handler for the event classes youre inter ested in, using methods
like setContentHandler() as you would for any other event consumer. In
such a case theres never any confusion about which XMLReader to use to
parse since any lter component is only postprocessing.
When you use an XMLFilterImpl to produce events, you need to provide a
par ent parser, probably by using XMLFilter.setParent(). When you
invoke parse(), the lter sets itself up to proxy all of the SAX core event
handler methods (as shown earlier for one of the extension methods) as
well as EntityResolver and Err orHandler. Youll need to pay particular
attention that you invoke the lter, instead of that real parser. Its easy to
run into bugs that way, particularly if youre chaining multiple lters
together. Although every lter stage has a parse() method, you only want
to invoke it on the last postprocessing stage. Its easy to get confused
about that.
XML Pipelines 133
3 January 2002 10:08
134 Chapter 4: Consuming SAX2 Events
Some XMLFilter implementations only operate in producer mode. That is
unfortunate since it means that they only accept input like a parser; they
cant be used to postprocess SAX events.
XMLFilter Examples
This book includes some examples that use XMLFilterImpl as a base class,
supporting both lter modes:
Example 6-3 shows a custom handler interface, delivering application-
specic unmarshaled data. This interface can be used either to post-
pr ocess or to prepr ocess SAX events, without additional setup.
Example 6-9 replaces processing instructions with the content of an
included document so that downstream stages wont know about the
substitution. When used to postprocess events, the handler may need
to be set up with appropriate EntityHandler and Err orHandler
objects.
Sun is developing a Multi-Schema Validator engine, which uses SAX l-
ters to implement validators for schema systems including RELAX (also
called ISO RELAX), TREX, RELAX-NG (combining the best of RELAX and
TREX), and W3C XML schemas. This work ties in to the or g.iso_relax.veri-
er framework for validator APIs (at http://iso-r elax.sourcefor ge.net),
which also supports using SAX objects (such as lters and content han-
dlers) that validate schemas.
If youre using RDDL (http://www.r ddl.org) as a convention for associating
resources with XML namespaces, you may nd the or g.rddl.sax.RDDLFil-
ter class to be useful. It parses RDDL documents and lets you determine
the various resources associated with namespaces, such as a DTD, a pre-
ferr ed CSS or XSLT stylesheet, or the schema using any of several schema
languages. This is another producer-mode only lter.
The javax.xml.transfor m.sax Package
The javax.xml.transfor m APIs provide ways to apply XSLT transfor ms to
XML data. The top level APIs work with the pull model, and map one
XML repr esentation into another one with a Transformer.trans-
form(source,result) call. Those repr esentations can include XML text,
DOM trees, or some kinds of SAX event streams. Except for that SAX sup-
port, you can look at the package as supporting three-stage pipelines,
with the middle stage always XSLT (or else a null transform). The
3 January 2002 10:08
javax.xml.transfor m.sax APIs let you integrate XSLT into longer SAX
pipelines in several ways, including one exible pure push mode.
The SAXT ransformerFactory class is important for most such pipeline
usage. You could use code like this to set up to get a factory matching the
code fragments shown later:
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.sax.*;
String stylesheetURI = ...;
String documentURI = ...;
ContentHandler contentHandler = ...;
LexicalHandler lexicalHandler = ...;
TransformerFactory tf
SAXTransformerFactory stf;
SAXSource stylesheet;
tf = TransformerFactory.newInstance ();
if (!tf.getFeature (SAXTransformerFactory.FEATURE)
|| !tf.getFeature (SAXSource.FEATURE))
throw new Exception ("not enough API support");
stylesheet = new SAXSource (new InputSource (stylesheetURI));
stf = (SAXTransformerFactory) tf;
Most Java XSLT engines, such as SAXON (available at http://saxon.sour ce-
for ge.net) and Xalan (available at http://xml.apache.or g/xalan-j ) fully sup-
port the additional SAX-oriented APIs, although that is not requir ed.
SAX in Push-Mode with XSLT
The approach thats most exible involves a Transfor merHandler initial-
ized to apply a specic XSLT transfor m. These are event consumer stages,
set up to push their results through to other stages. They support only the
ContentHandler, LexicalHandler and DTDHandler inter faces, but not
DeclHandler. This is best used in conjunction with the SAXResult class,
which packages both non-DTD SAX handlers so they can collect the out-
put of a transform. After getting the factory as shown in the preceding
code, make sure it supports SAXResult, then get and use the handler in a
manner such as the following:
XMLReader producer;
SAXResult out;
TransformerHandler handler;
if (!tf.getFeature (SAXResult.FEATURE))
throw new Exception ("not enough API support");
handler = stf.newTransformerHandler (stylesheet);
XML Pipelines 135
3 January 2002 10:08
136 Chapter 4: Consuming SAX2 Events
out = new SAXResult ();
out.setContentHandler (contentHandler);
out.setLexicalHandler (lexicalHandler);
// no DTD support from the SAXResult class!!
handler.setResult (out);
producer = XMLReaderFactory.createXMLREader ();
producer.setContentHandler (handler);
producer.setDTDHandler (handler);
producer.setProperty ("http://xml.org/sax/properties/lexical-handler",
handler);
producer.parse (inputURI);
This style of usage is particularly well suited to XML pipelines. Its just a
DTD-deprived pipeline stage, except that the output setup needs a non-
SAX class. The reason that approach is particularly useful for pipeline pro-
cessing is that both the input and output to the XSLT transfor m use SAX
event streams, so it can easily be spliced between any two parts of an
event pipeline. It also means you can use push mode event producers,
which invoke SAX callbacks directly.
SAX in Pull-Mode with XSLT
You can also get an pull-style API, using an XMLFilter that is initialized to
apply a specic XSLT tran form. Such lters may be used as event produc-
ers, only at one end of a SAX pipeline. After getting the factory as shown
in the previous code listing, you would make sure it supports this func-
tionality, then get and use the lter like this.
XMLFilter producer;
if (!tf.getFeature (SAXTransformerFactory.FEATURE_XMLFILTER))
throw new Exception ("not enough API support");
producer = stf.newXMLFilter (stylesheet);
producer.setContentHandler (contentHandler);
producer.setProperty ("http://xml.org/sax/properties/lexical-handler",
lexicalHandler);
producer.parse (inputURI);
Such a call would use the XSLT stylesheet to prepr ocess input to the han-
dlers you provide. The SAXResult class, shown here, supports a similar
pr ocessing model. If your transformer can accept one of those, a pull-
mode Transformer.transform() call pushes prepr ocessed results into a
ContentHandler and LexicalHandler, like the XMLFilter.parse() call.
3 January 2002 10:08
You can also use SAX in a pull-mode Transformer.transform() call by
using a SAXSour ce object. That lets you provide an InputSour ce (as shown
earlier) as well as an XMLReader, which may be set up with a particular
Err orHandler and EntityResolver (not shown). To use that in a SAX event
pipeline, you can make that reader be an XMLFilter that prepr ocesses the
input to the XSLT transfor m.
You can combine both SAXSour ce and SAXResult objects to get a kind of
pull mode pipeline including one XSLT transfor m stage, without even
needing to use the SAXT ransformerFactory class. To get multiple XSLT
transfor m stages without needing intermediate storage (XML text, a DOM
tr ee, or so on), use the Transfor merHandler class as shown earlier, post-
pr ocessing results through in a SAXResult. Or if you prefer, package an
XMLFilter fr om a SAXT ransformerFactory to prepr ocess data through a
SAXSour ce that you provide to the Transformer.transform() call. (I rec-
ommend sticking to the pure Transfor merHandler appr oach, since its not
as confusing.)
The gnu.xml.pipeline Framework
This framework takes a differ ent appr oach to building pipelines than XML-
FilterImpl or XMLFilter. Two key characteristics are its built-in support for
all the SAX2 handlers, including the extension handlers, and its exclusive
focus on the postprocessing model. In addition, it has several utility lters
and some factory methods that can automate construction and initializa-
tion of pipelines. The core inter face is EventConsumer :
public interface EventConsumer
{
public ContentHandler getContentHandler ();
public DTDHandler getDTDHandler ();
public Object getProperty (String id)
throws SAXNotRecognizedException;
public void setErrorHandler (ErrorHandler handler);
}
With that interface, pipelines are nor mally set up beginning with the last
consumer and then working toward the rst consumer. Ther e is a formal
convention that states pipeline stages have a constructor that takes an
EventConsumer parameter, which is used to construct pipelines from sim-
ple textual descriptions (which look like Unix-style command pipelines).
That convention makes it easy to construct a pipeline by hand, as shown
XML Pipelines 137
3 January 2002 10:08
138 Chapter 4: Consuming SAX2 Events
in the following code. Stages are str ongly expected to share the same
err or handling; the error handler is normally established after the pipeline
is set up, when a pipeline is bound to an event producer.
Ther e is a class that corresponds to the pure consumer mode XMLFilter-
Impl, except that it implements all the SAX2 event consumer interfaces,
not just the ones in the core API. LexicalHandler and DeclHandler ar e
fully supported. This class also adds convenience methods such as the fol-
lowing:
public class EventFilter
implements EventConsumer, ContentHandler, DTDHandler,
LexicalHandler, DeclHandler
{
... lots omitted ...
// hook up all event consumer interfaces to the producer
// map some known EventFilters into XMLReader feature settings
public static void bind (XMLReader producer, EventConsumer consumer)
{ /* code omitted */ }
// wrap a "consumer mode" XMLFilterImpl
public void chainTo (XMLFilterImpl next)
{ /* code omitted */ }
... lots omitted ...
}
Example 4-4 shows how one simple event pipeline works using the GNU
pipeline framework. It looks like it has three pipeline components (in
addition to the parser), but in this case its likely that two of them will be
optimized away into parser feature ag settings: NSFilter restor es names-
pace-r elated infor mation that is discarded by SAX2 parser defaults (bind()
sets namespace-pr exes to true and discards that lter), and ValidationFil-
ter is a layered validator that may not be necessary if the underlying
parser can support validation (in which case the validation ag is set to
true and the lter is discarded). Apart from arranging that validation errors
ar e reported and using the GNU DOM implementation instead of Crim-
sons, this code does exactly what the rst SAX-to-DOM example above
does.
*
* Ther e is a generic DomConsumer class that bootstraps using whatever JAXP sets up as the
default DOM. Such a generic consumer cant know the implementation-specic back
doors needed to implement all the bells and whistles DOM demands.
3 January 2002 10:08
Example 4-4. SAX events to DOM document (using GNU DOM)
import gnu.xml.pipeline.*;
public Document SAX2DOM (String uri)
throws SAXException, IOException
{
DomConsumer consumer;
XMLReader producer;
consumer = new gnu.xml.dom.Consumer ();
consumer = new ValidationConsumer (consumer);
consumer = new NSFilter (consumer);
producer = XMLReaderFactory.createXMLReader ();
producer.setErrorHandler (new DefaultHandler () {
public void error (SAXParseException e)
throws SAXException
{ throw e; }
});
EventFilter.bind (producer, consumer);
producer.parse (uri);
return consumer.getDocument ();
}
Ther e ar e some interesting notions lurking in this example. For instance,
when validation is a postprocessing stage, it can be initialized with a par-
ticular DTD and hooked up to an XMLReader that walks DOM nodes.
That way, that DOM content can be incrementally validated as applica-
tions change it. Similarly, application code can produce a SAX event
str eam and validate content without saving it to a le. This same postpro-
cessing approach could be taken with validators based on any of the vari-
ous schema systems.
Ther e ar e a variety of other utility pipeline stages and support classes in
the gnu.xml.pipeline package. One is briey shown later (in Example
6-7). Others include XInclude and XSLT support, as well as a TeeCon-
sumer to send events down two pipelines (like a tee joint used in plumb-
ing). This can be useful to save output for debugging; you can write XML
text to a le, or save it as a DOM tree, and watch the events that come out
of a particular pipeline stage to nd problematic areas.
Even if you dont use that GNU framework, you should keep in mind that
SAX pipeline stages can be used to package signicant and reusable XML
pr ocessing components.
XML Pipelines 139
3 January 2002 10:08
5
Other SAX Classes
In this chapter:
Helper Classes
SAX1 Support
The preceding chapters have addressed all of the most important SAX2
classes and interfaces. You may need to use a handful of other classes,
including simple implementations of a few more inter faces and SAX1 sup-
port. This chapter briey presents those remaining classes and interfaces.
Your parser distribution should have SAX2 support, with complete javadoc
for these classes. Consult that documentation if you need more infor ma-
tion than found in this book. The API summary in Appendix A should also
be helpful.
Helper Classes
Ther e ar e several classes in the or g.xml.sax.helpers package that you will
pr obably nd useful from time to time.
The Attr ibutesImpl Class
This is a general-purpose implementation of the SAX2 Attributes inter face.
As well as reading attribute information (as dened in the interface), you
can write and modify it. This class is quite handy when your application
code is producing SAX2 events, perhaps because it is converting data
structur es to a SAX event stream.
Remember the attributes provided to the ContentHandler.startElement()
event callback are only valid for the duration of that call. If you need a
copy of those attributes for later use, its simplest to use this class; just cre-
ate a new instance using the copy constructor. That copy constructor is
one of the most widely used APIs in this class, other than the Attributes
methods.
140
3 January 2002 10:09
Its often handy to keep a stack around to track the currently open ele-
ments and attributes. If you support xml:base, youll also want to track
base URIs for the document and for any external parsed entities. This is
easy to implement using another key method provided by this class,
addAttribute(). Example 5-1 shows how to maintain such a stack with
xml:base support. It shows full support for XML namespaces, unlike
Example 2-2, which is simple and attribute-free (shown in Chapter 2 in
the section Basic ContentHandler Events).
Example 5-1. Maintaining an element and attribute stack
import java.io.IOException;
import java.net.URL;
import java.util.Hashtable;
import org.xml.sax.*;
import org.xml.sax.ext.*;
import org.xml.sax.helpers.AttributesImpl;
import org.xml.sax.helpers.DefaultHandler;
public class XStack extends DefaultHandler
implements LexicalHandler, DeclHandler
{
static class StackEntry
{
final String nsURI, localName;
final String qName;
final Attributes atts;
final StackEntry parent;
StackEntry (
String namespace, String local,
String name,
Attributes attrs,
StackEntry next
) {
this.nsURI = namespace;
this.localName = local;
this.qName = name;
this.atts = new AttributesImpl (attrs);
this.parent = next;
}
}
private Locator locator;
private StackEntry current;
private Hashtable extEntities = new Hashtable ();
private static final String xmlNamespace
= "http://www.w3.org/XML/1998/namespace";
private void addMarker (String label, String uri)
Helper Classes 141
3 January 2002 10:09
142 Chapter 5: Other SAX Classes
Example 5-1. Maintaining an element and attribute stack (continued)
throws SAXException
{
AttributesImpl atts = new AttributesImpl ();
if (locator != null && locator.getSystemId () != null)
uri = locator.getSystemId ();
// guard against InputSource objects without system IDs
if (uri == null)
throw new SAXParseException ("Entity URI is unknown", locator);
// guard against illegal relative URIs (Xerces)
try { new URL (uri); }
catch (IOException e) {
throw new SAXParseException ("parser bug: relative URI",
locator);
}
atts.addAttribute (xmlNamespace, "base", "xml:base", "CDATA", uri);
current = new StackEntry ("", "", label, atts, current);
}
// walk up stack to get values for xml:space, xml:lang, and so on
public String getInheritedAttribute (String uri, String name)
{
String retval = null;
boolean useNS = (uri != null && uri.length () != 0);
for (StackEntry here = current;
retval == null && here != null;
here = here.parent) {
if (useNS)
retval = here.atts.getValue (uri, name);
else
retval = here.atts.getValue (name);
}
return retval;
}
// knows about XML Base recommendation, and xml:base attributes
// can be used in callbacks for elements, PIs, comments,
// characters, ignorable whitespace, and so on.
public URL getBaseURI ()
throws IOException
{
return getBaseURI (current);
}
private URL getBaseURI (StackEntry here)
throws IOException
{
String uri = null;
3 January 2002 10:09
Example 5-1. Maintaining an element and attribute stack (continued)
while (uri == null && here != null) {
uri = here.atts.getValue (xmlNamespace, "base");
if (uri != null)
break;
here = here.parent;
}
// marker for document or entity boundary? absolute.
if (here.qName.charAt (0) == #)
return new URL (uri);
// else it might be a relative uri.
int offset = uri.indexOf (":/");
if (offset == -1 || uri.indexOf (:) < offset)
return new URL (getBaseURI (here.parent), uri);
else
return new URL (uri);
}
// from ContentHandler interface
public void startElement (
String namespace,
String local,
String name,
Attributes attrs
) throws SAXException
{ current = new StackEntry (namespace, local, name, attrs,
current); }
public void endElement (String namespace, String local, String name)
throws SAXException
{ current = current.parent; }
public void setDocumentLocator (Locator l)
{ locator = l; }
public void startDocument ()
throws SAXException
{ addMarker ("#DOCUMENT", null); }
public void endDocument ()
{ current = null; }
// DeclHandler interface
public void externalEntityDecl (String name, String publicId,
String systemId)
throws SAXException
{
if (name.charAt (0) == %)
return;
Helper Classes 143
3 January 2002 10:09
144 Chapter 5: Other SAX Classes
Example 5-1. Maintaining an element and attribute stack (continued)
// absolutize URL
try {
URL url = new URL (locator.getSystemId ());
systemId = new URL (url, systemId).toString ();
} catch (IOException e) {
// what could we do?
}
extEntities.put (name, systemId);
}
public void elementDecl (String name, String model) { }
public void attributeDecl (String element, String name,
String type, String mode, String defaultValue) {}
public void internalEntityDecl (String name, String value) { }
// LexicalHandler interface
public void startEntity (String name)
throws SAXException
{
String uri = (String) extEntities.get (name);
if (uri != null)
addMarker ("#ENTITY", uri);
}
public void endEntity (String name)
throws SAXException
{ current = current.parent; }
public void startDTD (String root, String publicId, String systemId) {}
public void endDTD () {}
public void startCDATA () {}
public void endCDATA () {}
public void comment (char buf[], int off, int len) {}
}
With such a stack of attributes, its easy to nd the current values of inher-
ited attributes like xml:space, xml:lang, xml:base, and their application-
specic friends. For example, an application might have a policy that all
unspecied attributes with #IMPLIED default values are inherited from
some ancestor elements value or are calculated using data found in such
a context stack.
Notice how this code added marker entries on the stack with synthetic
xml:base attributes holding the true base URIs for the the document and
exter nal general entities. That information is needed to correctly imple-
ment the recommendation, and lets the getBaseURI() work entirely from
this stack. If you need such functionality very often, you might want to
pr ovide a mor e general API, not packaged as internal to one handler
implementation.
3 January 2002 10:09
The LocatorImpl Class
This is a general-purpose implementation of the Locator inter face. As well
as reading location properties (as dened in the interface), you can write
and modify them. Its part of SAX1 and is still useful in SAX2.
The locator provided by the ContentHandler.setDocumentLocator() can be
used during any event callback, but the values it retur ns will change over
time. If you need a copy of those values for later use, its simplest to use
this class; just create a new instance using the copy constructor. Mor e typi-
cally, you will pass the locator to the constructor for some kind of SAXEx-
ception, or just save the current base URI to use with relative URIs you
nd in document (or attribute) content.
The NamespaceSuppor t Class
When your code needs to track namespaces or their prexes, use this
SAX2 class. One audience for this class is authors of XML parsers; thats
pr obably not you. More likely youre writing code that, like XPath or
W3Cs XML schemas, needs to parse prexed names when theyre found
in attribute values or element content; this class can help. Or you may be
writing code to select or generate element or attribute name prexes for
output. (If you only need to put those names in element or attribute
names, you should be able to package that work in an event lter compo-
nent that postprocesses your output and ensures that its namespace con-
tent matches XML 1.0 rules.)
What this class does is maintain a stack of namespace contexts, in which
each context holds a set of prex-to-URI mappings; the contexts normally
corr espond to an element. This is the right model to use when youre
writing an XML parser. If you try to use this class in a layer on top of a
SAX2 parser, youll notice a slight mismatch: all the prex-mapping events
for an elements namespace context pr ecede the startElement() events for
that element. That is, youll need to create and populate new contexts
befor e you see the element that signies a new context.
*
One simple way
to work around this is with a Boolean ag indicating whether a new con-
text is active yet.
To use this class with a SAX2 parser thats set to report namespace prex
mappings, you have to modify some of your ContentHandler callbacks to
* This is true unless xmlns* attributes get reported with startElement(), and you only use
that form of the prex-mapping events.
Helper Classes 145
3 January 2002 10:09
146 Chapter 5: Other SAX Classes
maintain that stack of contexts. This is done in much the same way as you
pr oduce those callbacks yourself:
1. Instantiate a NamespaceSupport object using its default constructor
(the only one). A good time to do this is when you start your event
str eam, at the ContentHandler.startDocument() event callback. When
you do this, set a Boolean contextActive ag to false, so that youll
cr eate a new context for the root element.
2. When you get (or make) a ContentHandler.startPrefixMapping(pre-
fix,uri) event, see if contextActive is true. If not, call pushCon-
text() and set that ag to true. Then call declarePrefix(prefix,uri).
(It retur ns false if you give it illegal inputs.)
3. At the end of any ContentHandler.startElement() event, see if con-
textActive is true. If not, call pushContext(). Then set that ag to
false, forcing any child elements namespace declarations to create a
new context.
4. Finally, at the end of any ContentHandler.endElement() event, call
popContext().
5. Call reset() to forcibly reset all state before you reuse the class.
Doing this at the end of the ContentHandler.endDocument() callback
should work.
If you follow these rules, you can use processName() to interpret element
and attribute names that you nd according to the current prex bindings,
or you can use getPrefix() to choose a prex given a particular names-
pace URI:
String [] processName(qName,parts,isAttribute)
Use this method to nd the namespace name corresponding to a
qualied element or attribute name (perhaps as found inside an
attribute value or element content). Parameters are:
String qName
This is the qualied name, such as units:currency or fare, that is
being examined.
String parts[3]
This is a three-element array. If this method succeeds in process-
ing the name, the rst array element will hold the namespace URI,
the second will hold the local (unprexed) name, and the third
will hold the qName you passed in. The rst and second string
may also be empty strings, if the qName has no prex and if no
default namespace URI is applicable.
3 January 2002 10:09
String isAttribute
Pass this value as true if the qName parameter identies an
attribute; otherwise, pass this as false. This information is
needed because unprexed element names are interpr eted using
any default namespace URI, but attribute names are not.
If this method succeeds, the parts parameter is lled out and
retur ned. Otherwise the name includes a refer ence to an undeclared
pr ex, and null will be retur ned.
String getPrefix(String uri)
Use this method to choose a prex to use when constructing a quali-
ed name. This retur ns a curr ently dened prex associated with the
specied namespace URI or null if no such prex is dened. When
no such prex is dened, the default namespace URI (associated with
element names that have no prexes) might still be appropriate. If so,
then getURI() will retur n this URI.
Consult the class documentation (javadoc) for full details about the meth-
ods on this class.
SAX1 Support
This section provides a brief overview of the SAX1 classes and migration
support and of differ ences between SAX1 and SAX2. SAX1 is a subset of
SAX2, so SAX2 is backward compatible. The only reason you might not
want to have the SAX2 classes and interfaces in your class path is to avoid
compiler warnings telling you when youre using now-deprecated APIs.
You shouldnt be using SAX1 APIs to write new code, but you may need
to maintain or migrate older code written using these classes. As soon as
possible, plan a maintenance step that involves switching to the new SAX2
versions of the APIs. This may include getting rid of some home-brew
solutions for namespace support. (Some applications have found previ-
ously unsuspected bugs when theyve made such changes; be alert!) This
section has been written to highlight those changes.
If your parser supplier hasnt provided SAX2 support by now, its probably
also time to switch suppliers; however, you can use the ParserAdapter
class to make these changes without changing parsers. In fact, if youre
using ParserFactory to get the system default parser and havent set a
SAX2 XMLReader default, the refer ence XMLReaderFactory distribution
will automatically wrap the SAX1 parser youve probably already identied
using the or g.xml.sax.parser system property. That is, just putting the
SAX1 Support 147
3 January 2002 10:09
148 Chapter 5: Other SAX Classes
SAX2 classes in your class path normally lets you start using SAX2 without
needing to change your application conguration. (You can go the other
way around with an XMLReaderAdapter if you want to use a more curr ent
parser while letting the application code continue to use older SAX1
APIs.)
Youll most likely be interested in these classes if youre working with an
older, SAX1-based application or tool, such as the XT 0.5 XSLT engine.
This includes applications written to the JAXP 1.0 API specication, which
doesnt include SAX2 support. If so, the main differ ence youll see is that
SAX1 has a much simpler way of naming elements and attributes: it only
needs to support the qName (qualied name) access style, not the names-
pace-awar e style. This eliminates some opportunities for confusion, unless
your e writing namespace-aware applications.
The following classes provide SAX1 support:
or g.xml.sax.Parser
This interface corresponds to the SAX2 XMLReader. It uses the Docu-
mentHandler inter face (instead of ContentHandler) and has no get-
ter methods for handlers or the entity resolver. The SAX2 feature and
pr operty management methods are not available. Ther e is a setLo-
cale() method to control the locale used with diagnostics, which was
dr opped in SAX2.
With SAX1, there was no standard way to indicate whether a parser
validated or not. SAX1 applications had to be written to not rely on
having validity errors reported, unless either a conguration mecha-
nism enforced the use of a validating parser (specifying validating or
nonvalidating classes) or use of some specic implementations alter-
native conguration mechanism was hardwired.
Similarly, SAX1 had no standard way to provide the additional infoset
data that SAX2 shows using the DeclHandler and LexicalHandler
inter faces. Applications needing such support needed to use imple-
mentation-specic APIs.
or g.xml.sax.DocumentHandler
This interface corresponds to the SAX2 ContentHandler inter face.
Namespace information is not available on the element callbacks, and
startElement() uses AttributeList. Prex-mapping scopes are not
reported. In SAX2, skipped entities are reported; this was an XML 1.0
confor mance requir ement that was not met by the SAX1 API. SAX1
will not report skipped entities even if you were to wir e it into a SAX2
envir onment.
3 January 2002 10:09
or g.xml.sax.HandlerBase
This class corresponds to the SAX2 DefaultHandler class, except that
its a core class, not a helper class. (Consider that an evolutionary
accident.) It supports the older DocumentHandler inter face.
or g.xml.sax.AttributeList
This interface corresponds to the SAX2 Attributes inter face. It doesnt
include namespace information and is accordingly much simpler. The
only name for an attribute is what the namespace specication called
the qName. (In SAX2, providing the qName is optional unless the
namespace-pr exes pr operty has been set, but most parsers provide it
at all times.)
or g.xml.sax.helpers.AttributeListImpl
This class corresponds to the SAX2 AttributesImpl class. It doesnt
include namespace information and is accordingly much simpler.
or g.xml.sax.helpers.ParserAdapter
This class is intended to help migrate SAX1 parser implementations to
the SAX2 namespace-aware API. If you have a SAX1 parser (perhaps
it turns some non-XML data into a SAX1 event stream), you can use
this class to bring it into the SAX2 world.
or g.xml.sax.helpers.ParserFactory
This class corresponds to the SAX2 XMLReaderFactory class. It
retur ns a SAX1 Parser and it is controlled only using the
or g.xml.sax.parser system property. It thr ows many more exceptions
than its SAX2 analogue.
or g.xml.sax.helpers.XMLReaderAdapter
This class supports backward migration of SAX2 parsers into
SAX1-based applications. You probably wont ever need to use it.
If your environment supports SAX1 but not SAX2, you can just add the
SAX2 version of sax.jar to your class path, somewhere befor e the older
SAX1 les. (Otherwise, you might get package-sealing violations, because
the JVM might mix versions of the package. It may be best if you remove
older copies of the SAX1 classes from your class path.) If you set the SAX1
or g.xml.sax.parser system property to point to a SAX1 parser so that appli-
cations can rely on or g.xml.sax.helpers.ParserFactory bootstrapping, youll
be glad that the SAX2 or g.xml.sax.helpers.XMLReaderFactory knows how
to use this property as a backup in case no default SAX2 parser has been
congur ed.
SAX1 Support 149
3 January 2002 10:09
6
Putting It All Together
In this chapter:
Rich Site Summary:
RSS
XML and Messaging
Inc luding
Subdocuments
The preceding chapters have shown most of what youll need to know to
use SAX2 effectively, but as individual techniques, in small bits and
pieces. In this chapter, well look at more substantial examples, which tie
those techniques together. The examples here should help you to under-
stand the kinds of modules youll need to put together similar SAX2-based
applications. Youll also see some of the options you have for building
larger processing tasks from SAX components.
Rich Site Summary: RSS
One of the rst popular XML document standards is hidden in the guts of
web site management toolsets. It dates to back when XML wasnt fully
crystallized. Back then, there was a lot of interest in using XML to address
a widespr ead pr oblem: how to tell users about updates to web sites so
they didnt need to read the site several times a day. A channel based
model was widely accepted, building on the broadcast publishers analogy
of a web site as a TV channel. Microsoft shipped an XML-like format
called Channel Denition Format (CDF), and other update formats were
also available, but the solution that caught on was from Netscape. It is
called RSS. This originally stood for RDF Site Summary,
*
but it was sim-
plied and renamed the Rich Site Summary format before it saw any
wide adoption.
* RDF stands for Resource Description Framework. For more infor mation, see
http://www.w3.or g/RDF/.
150
3 January 2002 10:09
RSS 0.91 was the mechanism used to populate one of the earliest cus-
tomizable web portals, My Netscape. The mechanism is simple: RSS pre-
sents a list of recently updated items from the web site, with summaries,
as an XML le that could be fetched across the Web. Sites could update
static summary les along with their content or generate them on the y;
site management tools could do either task automatically. It was easy for
sites to create individualized views that aggregated the latest news from
any of the numerous web sites providing RSS feeds.
Ther es essentially been a fork in the development of RSS. In recent sur-
veys, about two thirds of the RSS sites use RSS Classic, based on the 0.91
DTD and often with 0.92 extensions. (Mostly, the 0.92 spec removed limits
fr om the non-DTD parts of the 0.91 spec.) Relatively recently, New RSS
was created. Also called RSS 1.0 (though not with the support of all the
developers who had been enhancing RSS), this version is more complex.
It uses RDF and XML namespaces and includes a framework with exten-
sion modules to address the complex content syndication and aggregation
requir ements of larger web sites. RSS toolkits tend to support both for-
mats, but RDF itself is still not widely adopted. This is what part of one
RSS Classic feed looks like, from the URL http://xmlhack.com/rss.php:
<?xml version="1.0" encoding="ISO-8859-1">
<!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN"
"http://my.netscape.com/publish/formats/rss-0.91.dtd">
<rss version="0.91">
<channel>
<title>xmlhack</title>
<link>http://www.xmlhack.com</link>
<description>Developer news from the XML community</description>
<language>en-us</language>
<managingEditor>[email protected]</managingEditor>
<webMaster>[email protected]</webMaster>
<item>
<title>BEEP implementation for .NET/C#</title>
<link>http://www.xmlhack.com/read.php?item=1470</link>
</item>
<item>
<title>MinML-RPC, Sandstorm XML-RPC framework</title>
<link>http://www.xmlhack.com/read.php?item=1469</link>
</item>
<item>
<title>XSLT as query language</title>
<link>http://www.xmlhack.com/read.php?item=1467</link>
</item>
<item>
<title>Exclusive XML Canonicalization in Last Call</title>
<link>http://www.xmlhack.com/read.php?item=1466</link>
</item>
Rich Site Summary: RSS 151
3 January 2002 10:09
152 Chapter 6: Putting It All Together
<!--many items were deleted for this example-->
</channel>
</rss>
In this section we use some of the techniques weve seen earlier and will
look at both sides (client and server) of some simple RSS tools for RSS
Classic. A full RSS toolset would need to handle New RSS, and would
likely need an RDF engine to work with RDF metadata. Such RDF infras-
tructur e should let applications work more with the semantics of the data,
and would need RDF schema support. Thats all much too complex to
show here.
*
First well build a simple custom data model, then write the code to mar-
shal and unmarshal it, and nally see how those components t into com-
mon types of RSS applications. In a microcosm, this is what lots of XML
applications do: read XML into custom data structures, process them, and
then write out more XML.
Data Model for RSS Classic
Her e ar e the key parts of the RSS 0.91 DTD; it also incorporates the HTML
4.0 ISO Latin/1 character entities, which arent shown here, and various
other integrity rules that arent expressed by this DTD:
<!ELEMENT rss (channel)>
<!ATTLIST rss
version CDATA #REQUIRED> <!-- must be "0.91"> -->
<!ELEMENT channel (title | description | link | language | item+
| rating? | image? | textinput? | copyright?
| pubDate? | lastBuildDate? | docs? | managingEditor?
| webMaster? | skipHours? | skipDays?)*>
<!ELEMENT image (title | url | link | width? | height? | description?)*>
<!ELEMENT item (title | link | description)*>
<!ELEMENT textinput (title | description | name | link)*>
<!ELEMENT title (#PCDATA)>
<!ELEMENT description (#PCDATA)>
<!ELEMENT link (#PCDATA)>
<!ELEMENT url (#PCDATA)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT rating (#PCDATA)>
<!ELEMENT language (#PCDATA)>
* If youre inter ested the RDF approach, look at sites like the Open Directory Project, at
http://www.dmoz.or g/, to see one way of using RDF.
3 January 2002 10:09
<!ELEMENT width (#PCDATA)>
<!ELEMENT height (#PCDATA)>
<!ELEMENT copyright (#PCDATA)>
<!ELEMENT pubDate (#PCDATA)>
<!ELEMENT lastBuildDate (#PCDATA)>
<!ELEMENT docs (#PCDATA)>
<!ELEMENT managingEditor (#PCDATA)>
<!ELEMENT webMaster (#PCDATA)>
<!ELEMENT hour (#PCDATA)>
<!ELEMENT day (#PCDATA)>
<!ELEMENT skipHours (hour+)>
<!ELEMENT skipDays (day+)>
In short, the DTD includes a wrapper that gives the version, one channel
with some descriptive data, and a bunch of items. RSS 0.92 changes it
slightly. Data length limits (which a DTD cant describe) are removed, and
a bit more. If youre working with RSS, you should know that most RSS
feeds incorporate at least a few of those 0.92 extensions and have your
code handle the issues. And if youre generating an RSS feed for your web
site, youll want to know that many aggregators present the image as the
channels icon, along with the newest items and the text input box, to
pr ovide quick access to your site.
When you work with XML-based systems and SAX, one of the rst things
youll want to do is decide on the data structures youll use. Sometimes
youll have a pre-existing data structure that must be matched; in cases
like this RSS code, you have the luxury of a blank slate to write on. Im a
big believer in designing appropriate data structures, rather than expecting
some development tool to come up with a good answer; as a rule, a good
manual design beats code generator output in any maintainable system.
In the case of RSS Classic, simple structures like those shown in Example
6-1 can do the job:
Example 6-1. RSS data structures
import java.util.Vector;
public class RssChannel {
// (optional, not part of RSS) URI for the RSS file
public StringsourceUri;
// Five required items
public Stringdescription = "";
public Vectoritems = new Vector ();
public Stringlanguage = "";
public Stringlink = "";
public Stringtitle = "";
Rich Site Summary: RSS 153
3 January 2002 10:09
154 Chapter 6: Putting It All Together
Example 6-1. RSS data structures (continued)
// Lots of optional items
public String copyright = "";
public String docs = "";
public RssImage image;
public String lastBuildDate = "";
public String managingEditor = "";
public String pubDate = "";
public String rating = "";
// public Days skipDays;
// public Hours skipHours;
public RssTextInput textinput;
public String webMaster = "";
// channels have a bunch of items
static public class RssItem
{
public String description = "";
public String link = "";
public String title = "";
}
// Text input is used to query the channel
static public class RssTextInput
{
public String description = "";
public String link = "";
public String name = "";
public String title = "";
}
// Image used for the channel
static public class RssImage
{
public String link = "";
public String title = "";
public String url = "";
// optional
public String description = "";
public String height = "";
public String width = "";
}
}
Note that these classes didnt include any methods; methods can be added
later, as application code determines whats really necessary. There are a
variety of features that would be good to constrain this way, which youll
see if you look at the RSS specications. Even pure value objects benet
fr om such internal consistency checks. For example, you may prefer to
use beans-style accessor functions, but they would only complicate this
3 January 2002 10:09
example. (So would the class and eld documentation, which has been
deleted for simplicity.)
Ther es one type of code that is certainly needed but was intentionally put
into differ ent classes: marshaling data to RSS and unmarshaling it from
RSS. Such choices are design policies; while its good to keep marshaling
code in one place, that place doesnt need to be the data structure class
itself. Its good to separate marshaling code and data structure code
because its easier to support several differ ent kinds of input and output
syntax. Examples include differ ent versions of RSS, as well as transfers to
and from databases with JDBC. To display RSS in a web browser, dif ferent
versions of HTML may be appropriate. Sometimes, embedding a stylesheet
pr ocessing instruction into the XML text may be the way to go. Separate
marshaling code needs attention when data structures change, but good
softwar e maintenance procedur es will ensure thats never a problem.
Consuming and Producing RSS Par sing Events
Earlier chapters have touched on ways to marshal and unmarshal data
with SAX. This section shows these techniques more completely, for a
real-world application data model.
Example 6-2 shows what SAX-based unmarshaling code can look like,
without the parser hookup. In this case its set up to be the endpoint on a
pipeline. This just turns infoset atoms into RSS molecules and stops.
Note that it isnt particularly thorough in how it handles all the various
types of illegal, or just unexpected, RSS thats found on the Web, although
it handles many RSS Classic sites perfectly well. For example, the controls
to skip fetches on particular days (perhaps weekends) or hours (nonbusi-
ness hours) arent usually supported, so theyre just ignored here. With a
mor e complex DTD, unmarshaling might not be able to rely on such a
simple element stacking scheme; you might need to stack the objects
your e unmarshaling and use a more complex notion of context to deter-
mine the appropriate actions to take.
Example 6-2. Unmarshaling SAX events into RSS data
import java.util.Stack;
import RssChannel.RssItem;
import RssChannel.RssImage;
import RssChannel.RssTextInput;
public class RssConsumer extends DefaultHandler {
private RssChannel channel;
private RssItem item;
Rich Site Summary: RSS 155
3 January 2002 10:09
156 Chapter 6: Putting It All Together
Example 6-2. Unmarshaling SAX events into RSS data (continued)
private RssImage image;
private RssTextInput input;
private Stack stack;
private Locator locator;
public RssChannel getChannel ()
{ return channel; }
private String getCurrentElementName ()
{ return (String) stack.peek (); }
// only need a handful of ContentHandler methods
public void setDocumentLocator (Locator l)
{ locator = l; }
public void startDocument () throws SAXException
{
channel = new RssChannel ();
if (locator != null)
channel.sourceUri = locator.getSystemId ();
stack = new Stack ();
}
public void startElement (
String namespace,
String local,
String name,
Attributes attrs
) throws SAXException
{
stack.push (name);
if ("item".equals (name))
item = new RssItem ();
else if ("image".equals (name))
image = new RssImage ();
else if ("textinput".equals (name))
input = new RssTextInput ();
// parser misconfigured?
else if (name.length () == 0)
throw new SAXParseException ("XML names not available", locator);
}
public void characters (char buf [], int off, int len)
throws SAXException
{
String top = getCurrentElementName ();
String value = new String (buf, off, len);
if ("title".equals (top)) {
3 January 2002 10:09
Example 6-2. Unmarshaling SAX events into RSS data (continued)
if (item != null)
item.title += value;
else if (image != null)
image.title += value;
else if (input != null)
input.title += value;
else
channel.title += value;
} else if ("description".equals (top)) {
if (item != null)
item.description += value;
else if (image != null)
image.description += value;
else if (input != null)
input.description += value;
else
channel.description += value;
} else if ("link".equals (top)) {
if (item != null)
item.link += value;
else if (image != null)
image.link += value;
else if (input != null)
input.link += value;
else
channel.link += value;
} else if ("url".equals (top)) {
image.url += value;
} else if ("name".equals (top)) {
input.name += value;
} else if ("language".equals (top)) {
channel.language += value;
} else if ("managingEditor".equals (top)) {
channel.managingEditor += value;
} else if ("webMaster".equals (top)) {
channel.webMaster += value;
} else if ("copyright".equals (top)) {
channel.copyright += value;
} else if ("lastBuildDate".equals (top)) {
channel.lastBuildDate += value;
} else if ("pubDate".equals (top)) {
channel.pubDate += value;
} else if ("docs".equals (top)) {
channel.docs += value;
} else if ("rating".equals (top)) {
channel.rating += value;
} // else ignore ... skipDays and so on.
Rich Site Summary: RSS 157
3 January 2002 10:09
158 Chapter 6: Putting It All Together
Example 6-2. Unmarshaling SAX events into RSS data (continued)
}
public void endElement (
String namespace,
String local,
String name
) throws SAXException
{
if ("item".equals (name)) {
// patch item.link
channel.items.addElement (item);
item = null;
} else if ("image".equals (name)) {
// patch image.link
// (patch image.url)
channel.image = image;
image = null;
} else if ("textinput".equals (name)) {
// patch input.link
channel.textinput = input;
input = null;
} else if ("channel".equals (name)) {
// patch channel.link
}
}
}
If you think in terms of higher-level parsing events, rather than in terms of
data structures, you might want to dene an application-level event han-
dler interface and package your code as an XMLFilterImpl, as shown in
Example 6-3. This is the atoms into molecules pattern for handlers, as
sketched in Chapter 3. In the case of RSS, both item and channel might
reasonably be expected to be molecules that get reported individually as
application-level events. If you report ner grained structures (like item) it
might be it easier to assemble higher-level data structures, but we wont
show that here.
Example 6-3. Building SAX events into an RSS event handler
public interface RssHandler {
void channelUpdate (RssChannel c) throws SAXException;
}
public class RssConsumer extends XMLFilterImpl {
// ... as above (notice different base class!) but also:
private RssHandler handler;
public static String RssHandlerURI =
3 January 2002 10:09
Example 6-3. Building SAX events into an RSS event handler (continued)
"http://www.example.com/properties/rss-handler";
public void setProperty (String uri, Object value)
throws SAXNotSupportedException, SAXNotRecognizedException
{
if (RssHandlerURI.equals (uri)) {
if (value instanceof RssHandler) {
handler = (RssHandler) value;
return;
}
throw new SAXNotSupportedException ("not an RssHandler");
}
super.setProperty (uri, value);
}
public Object getProperty (String uri)
throws SAXNotSupportedException, SAXNotRecognizedException
{
if (RssHandlerURI.equals (uri))
return handler;
return super.getProperty (uri);
}
public void endDocument ()
throws SAXException
{
if (handler == null)
return;
handler.channelUpdate (getChannel ());
}
}
A lter written in that particular way can be used almost interchangeably
with the handler-only class shown earlier in Example 6-2. In fact its just a
bit more exible than that, though it may not be a good pipeline-style
component. Thats because it doesnt pass the low-level events through
consistently; the ContentHandler methods this implements dont pass their
events through to the superclass, but all the other methods do. Thats eas-
ily xed, but its likely that youd either want all the XML atoms to be visi-
ble (extending the XML Infoset with RSS-specic data abstractions) or
none of them (and use an RSS-only infoset).
Example 6-4 shows what the core marshaling code can look like, without
the hookup to an XMLWriter or the XMLWriter setup. For simplicity, this
example takes a few shortcuts: it doesnt marshal the channels icon
description or most of the other optional elds. But notice that it does
take care to write out the DTD and provide some whitespace to indent
the text. (It uses only newlines for end-of-line; output code is responsible
Rich Site Summary: RSS 159
3 January 2002 10:09
160 Chapter 6: Putting It All Together
for mapping those to CRLF or CR when needed.) Also, notice that it just
generates SAX2 events; this data could be fed to an XMLWriter, or to the
RssConsumer class, or to any other SAX-processing component.
Example 6-4. Marshaling RSS data to SAX events
import java.util.Enumeration;
import org.xml.sax.*;
import org.xml.sax.ext.LexicalHandler;
import org.xml.sax.helpers.AttributesImpl;
import RssChannel.RssItem;
public class RssProducer implements RssHandler
{
private static char lineEnd [] = { \n, \t, \t, \t };
private ContentHandler content;
private LexicalHandler lexical;
public RssProducer (ContentHandler n)
{ content = n; }
public void setLexicalHandler (LexicalHandler l)
{ lexical = l; }
private void doIndent (int n)
throws SAXException
{
n++; // NL
if (n > lineEnd.length)
n = lineEnd.length;
content.ignorableWhitespace (lineEnd, 0, n);
}
private void element (int indent, String name, String val, Attributes
atts)
throws SAXException
{
char contents [] = val.toCharArray ();
doIndent (indent);
content.startElement ("", "", name, atts);
content.characters (contents, 0, contents.length);
content.endElement ("", "", name);
}
public void channelUpdate (RssChannel channel)
throws SAXException
{
AttributesImpl atts = new AttributesImpl ();
content.startDocument ();
if (lexical != null) {
lexical.startDTD ("rss",
3 January 2002 10:09
Example 6-4. Marshaling RSS data to SAX events (continued)
"-//Netscape Communications//DTD RSS 0.91//EN",
"http://my.netscape.com/publish/formats/rss-0.91.dtd");
lexical.endDTD ();
}
atts.addAttribute ("", "", "version", "CDATA", "0.91");
content.startElement ("", "", "rss", atts);
atts.clear ();
doIndent (0);
content.startElement ("", "", "channel", atts);
// describe the channel
// four required elements
element (1, "title", channel.title, atts);
element (1, "link", channel.link, atts);
element (1, "description", channel.description, atts);
element (1, "language", channel.language, atts);
// optional elements
if ("" != channel.managingEditor)
element (1, "managingEditor", channel.managingEditor, atts);
if ("" != channel.webMaster)
element (1, "webMaster", channel.webMaster, atts);
// ... and many others, notably image/icon and text input
// channel contents: at least one item
for (Enumeration e = channel.items.elements ();
e.hasMoreElements ();
/**/) {
RssItem item = (RssItem) e.nextElement ();
doIndent (1);
content.startElement ("", "", "item", atts);
if ("" != item.title)
element (2, "title", item.title, atts);
if ("" != item.link)
element (2, "link", item.link, atts);
if ("" != item.description)
element (2, "description", item.description, atts);
doIndent (1);
content.endElement ("", "", "item");
}
content.endElement ("", "", "channel");
content.endElement ("", "", "rss");
content.endDocument ();
}
}
Since this code implements the RssHandler inter face shown earlier, an
instance of this class could be assigned as the RSS handler for the
Rich Site Summary: RSS 161
3 January 2002 10:09
162 Chapter 6: Putting It All Together
XMLFilter shown here. That could be useful if you wanted to round-trip
RSS data. Round-tripping data can be a good way to test marshaling and
unmarshaling code. You can create collections of input documents, and
automatically unmarshal or remarshal their data. If you compare inputs
and outputs, you can ensure that you havent discarded any important
infor mation or added inappropriate text.
Building Applications with RSS
One of the most fundamental things you can do in an RSS application is
act as a client: fetch a sites summary data and present it in some useful
for mat. Often, your personal view of a web site is decorated with pages or
sidebars that summarize the latest news as provided by other sites; they
fetch RSS data, cache it, and refor mat it as HTML or XHTML so your web
br owser shows it. That is, the web server acts as a client to RSS feeds and
generates individualized pages that you can read on and click on the lat-
est headlines.
Example 6-5 is a simple client that dumps its output as text. Its simple to
write a servlet or JSP that does this for a set of RSS feeds, formatting them
as nice XHTML sidebar tables so that a sites pages will be more useful.
*
One extremely important point shown here is this code uses a resolver to
force the use of a local copy of the RSS DTD. Servers should always use
local copies of DTDs. Some RSS applications got a rude reminder of that
fact in April 2001, when Netscape accidentally removed the DTD when it
reorganized its web site. Suddenly, those badly written applications
stopped working on many RSS feeds! Of course, those that were properly
set up with local copies of that DTD had no problems at all.
Example 6-5. An RSS data dump
import gnu.xml.util.Resolver;
import java.io.File;
import java.util.Hashtable;
import org.xml.sax.*;
import org.xml.sax.helpers.XMLReaderFactory;
import RssChannel.RssItem;
* If you do this in a server, you should handle one very important task thats not shown
her e: cache the RSS data! Do not make servers fetch the summary before each page view.
That makes for a very slow user experience and can overload remote RSS feeds.
Ther e ar e two basic techniques to use to create such a cache. One is to put a caching
pr oxy between your server and all the RSS feeds. The other is to write a page cache mod-
ule, preferably one that uses HTTP conditional GET (the If-Modified-Since HTTP
header eld) to avoid excess cache updates. You can save RssChannel data or store chan-
nel information in a local database, as variants of the page cache technique.
3 January 2002 10:09
Example 6-5. An RSS data dump (continued)
public class RssMain
{
private static String featurePrefix =
"http://xml.org/sax/features/";
// Invoke with one argument, a URI or filename
public static void main (String argv [])
{
if (argv.length != 1) {
System.err.println ("Usage: RssMain [file|URL]");
System.exit (1);
}
try {
XMLReader reader;
RssConsumer consumer;
Hashtable hashtable;
Resolver resolver;
reader = XMLReaderFactory.createXMLReader ();
consumer = new RssConsumer ();
reader.setContentHandler (consumer);
// handle the "official" DTD server being offline
hashtable = new Hashtable (5);
hashtable.put (
"-//Netscape Communications//DTD RSS 0.91//EN",
Resolver.fileNameToURL ("rss-0_91.dtd"));
resolver = new Resolver (hashtable);
reader.setEntityResolver (resolver);
// we rely on qNames, and 0.91 doesnt use namespaces
reader.setFeature (featurePrefix + "namespace-prefixes", true);
reader.setFeature (featurePrefix + "namespaces", false);
argv [0] = Resolver.getURL (argv [0]);
reader.parse (argv [0]);
RssChannel channel = consumer.getChannel ();
System.out.println ("Partial RSS 0.91 channel info");
System.out.println ("SOURCE = " + channel.sourceUri);
System.out.println ();
System.out.println (" Title: " + channel.title);
System.out.println (" Description: " + channel.description);
System.out.println (" Link: " + channel.link);
System.out.println (" Language: " + channel.language);
System.out.println (" WebMaster: " + channel.webMaster);
System.out.println ("ManagingEditor: "
+ channel.managingEditor);
Rich Site Summary: RSS 163
3 January 2002 10:09
164 Chapter 6: Putting It All Together
Example 6-5. An RSS data dump (continued)
System.out.println ();
System.out.println (" Item Count: " + channel.items.size ());
for (int i = 0; i < channel.items.size (); i++) {
RssItem item = (RssItem)
channel.items.elementAt (i);
System.out.println ("ITEM # " + i);
if (item != null) {
System.out.println (" Title: " + item.title);
System.out.println (" Description: "
+ item.description);
System.out.println (" Link: " + item.link);
}
}
// Good error handling is not shown here, for simplicity
} catch (Exception e) {
System.err.println ("Whoa: " + e.getMessage ());
System.exit (1);
}
System.exit (0);
}
}
Besides servlets that present RSS data in HTML form to a web sites clients,
another kind of servlet is important in the world of RSS applications:
servlets that deliver a sites own RSS feed as XML. Servers often arrange
that the current channel data is always ready to serve at a moments
notice. Youve probably worked with sites that give you HTML forms to
publish either short articles (web log entries or discussion follow-ups) or
long ones (perhaps XML DocBook source thats then formatted). When
such forms post data through a servlet, its easy to ensure the servlet
updates the sites RSS channel data when it updates other site data for
those articles.
While the mechanics of such a servlet would be specic to the procedur es
used at a given web site, almost any site could use code like that in Exam-
ple 6-6 to actually deliver the RSS feed. Notice the XML text is delivered
with an encoding that any XML parser is guaranteed to handle, using
CRLF-style line ends (part of the MIME standard for text/* content types),
and this sets the Last-Modified HTTP timestamp so it supports HTTP
caches based on either conditional GET or on explicit timestamp checks
with the HEAD request.
3 January 2002 10:09
Example 6-6. Servlet generating RSS data
import gnu.xml.util.XMLWriter;
import javax.servlet.http.*;
// a "Globals" class is used here to access channel and related data
public class RssGenServlet extend HttpServlet
{
public void doGet (HttpServletRequest request,
HttpServletResponse response)
throws IOException, ServletException
{
RssProducer producer;
XMLWriter consumer;
response.addDateHeader ("Last-Modified", Globals.channelModified);
response.setContentType ("text/xml;charset=UTF-8");
consumer = new XMLWriter (response.getWriter ());
consumer.setEOL ("\r\n");
try {
producer = new RssProducer (consumer);
producer.setLexicalHandler (consumer);
producer.channelUpdate (Globals.channel);
} catch (SAXException e) {
throw new ServletException (e);
}
}
}
As RSS 1.0 starts to become more widely supported and more RSS/RDF
modules are dened, more clever RSS-based services will become avail-
able. For example, RSS aggregator services may begin to be able to
dynamically create new channels with information ltered from many
other channels. That is, you could be able to dene a channel that the
aggr egator will ll with new articles on a particular topic, listed in any of
several hundred RSS feeds. Today, youd have to scan each feed yourself
to do that. Such smarter services would also have better reasons to cache
infor mation. Today, such a service would have a hard time knowing
which articles to remember while you were away on vacation, since there
would be far too many articles to remember them all.
XML and Messaging
Most technologies that fueled the Internet Revolution of the past few
years have been around in one form or another for decades; they were
just inaccessible to the volumes of people that were able to use them with
XML and Messaging 165
3 January 2002 10:09
166 Chapter 6: Putting It All Together
mass market web browsers. Some of those technologies are now being re-
cr eated: they are updated to work better in todays Internet, which is a
larger and more varied world than the earlier versions they were bor n
into. In this section we will look at why XML is an important part of the
re-cr eation of messaging technologies and at some of the roles Java plays
in this process. We also look at how lightweight SAX2-based infrastructure
supports XML messaging over the Web without requiring developers to
master new toolkits.
XML/Inter net Versus Older Technolog ies
Many more developers work with web servers than have ever worked
with Remote Procedur e Call (RPC) or message-queuing technologies.
However, the problem is largely unchanged: the core issue is still how to
exchange messages reliably and securely with services operated by other
organizations. The combination of XML and web-based messaging has
several basic technical benets compared to those earlier technology gen-
erations, especially most forms of RPC:
HTTP-based protocols have truly global reach
HTTP is in essence a text-based RPC protocol: clients issue requests to
objects identied by web server URIs, and those servers dynamically
compute the responses. Because its text-based, HTTP can be (and is)
easily supported by almost all programming languages. Because of
HTTPS (HTTP over SSL, a security protocol), HTTP security has been
at least as good as any available with commercial RPC services.
HTTP/HTTPS is now the most ubiquitous and functional RPC trans-
port in the world.
XML is a more accessible and extensible message-encoding technology
Pr evious technologies generally focused on binary-oriented technolo-
gies, which often rigidly dened the set of possible messages. In
practice, most technologies were restricted to particular programming
envir onments because developers needed an API toolkit to generate
the correct binary data. XML has a clear win here since essentially
every such environment supports text input and output. And unlike
other encodings, XML doesnt impose any inherent policy on what
such text means, which makes it more exible. SAX is able to lever-
age that exibility because it is data-structure agnostic. Much of the
work in XML messaging is to establish and promote particular poli-
cies; SAX can support all the important ones.
3 January 2002 10:09
The Internet biases toward lar ger, coarse-grained messages
Befor e the Internet, applications were optimized for private local area
networks (LANs) or for low-speed, application-specic wide area net-
works (WANs). Neither optimization point is a good match for todays
typical Internet link (56 kbps modem, or megabit links for some home
use and most enterprises). Two key Internet issues are network
latency and reliability. Using HTTP with XML provides an opportunity
to develop newer systems using a design policy that works with the
Inter net rather than against it: use bigger messages, less often. This is
the antithesis of many RPC systems, which bias toward constant
exchange of small messages just like they were local procedur e calls.
XML favors loose coupling
RPC-based systems were often developed to assume that clients and
servers are in the same organization. Some even assumed only one
vendors product would be used. That is, developers often aimed for
a monocultur e and tended to characterize diversity as either a com-
mercial threat, an inefciency, a security problem, or just a support
headache. Actually, diversity is a source of strength: human groups
that are diverse are mor e adaptable and more resilient because they
have more resources to draw on. Because XML messaging focuses on
pr otocols and message formats, rather than vendor-specic implemen-
tations or APIs, it promotes diversity. That reduces inappropriate cou-
pling and makes systems less vulnerable to the problems of any par-
ticular implementation.
In short, as the limitations of earlier messaging infrastructures became well
known, organizations of all sizes were investing in new, web-based tech-
nology. Internet-savvy applications were developed with HTTP technol-
ogy, and the exibility of XML as well as its introduction to the web
developer community, made it the inevitable choice for the most widely
deployed messaging technologies.
While much of the current work is focused on business applications,
notably business-to-business integration, thats hardly the only type of
application it benets. Theres also interest in peer-to-peer (P2P) protocols
built with XML. P2P is usefully viewed as just messaging policies for appli-
cations that have nally escaped from the client or server straitjacket.
Now, essentially anyone can run a server and act as a publisher for infor-
mation they have produced. These new publishing systems are most natu-
rally built with the same XML and HTTP technologies adopted elsewhere.
Another interesting way to compare these models is that while the RPC
model moves computations to where the data lives, the Web model moves
XML and Messaging 167
3 January 2002 10:09
168 Chapter 6: Putting It All Together
the data to where the computation lives. That has been called the repr e-
sentational state transfer (REST) model. When code is downloaded, a
third model can be said to come into effect. The design of distributed sys-
tems needs to balance among all these alternatives and not focus exclu-
sively on any single model.
Roles for Java in XML Messaging
Since Java was the rst true Internet-integrated programming environ-
ment and had XML support very early, its no surprise that a huge amount
of XML messaging work is done in Java. There are a variety of higher-level
XML APIs and tools, all of which dene particular messaging policies and
frameworks. This book may seem somewhat iconoclastic in its perspective
on such tools: many of them are overkill. Most applications will be ne
without any of the heavier-weight items on the API smorgasbord (for any
language!); a lighter meal will often be the healthier solution, even on an
expense account budget. Theres plenty of scope for innovative applica-
tions written without such toolkits, and its easier to spread them if they
dont depend on rst deploying lots of complex new infrastructure.
Fr om an interoperability perspective, the most interesting work is lan-
guage-neutral development of protocols. Some such initiatives hide or
limit use of XML, such as XML-RPC. Others, notably BEEP and SOAP, let
applications provide their own payloads, although SOAP is usually cou-
pled with synchronous RPC-style messaging and payloads using W3C XML
schema and precluding full use of XML, such as DTDs. BEEP is a stan-
dards-track peer-to-peer Internet protocol, building on decades of commu-
nity experience and supporting both synchronous and asynchronous
messaging models. http://www.beepcor e.org has a wealth of relevant infor-
mation, including protocol specications and toolkits in many languages
including Java. And as presented in various parts of this book, its easy to
use HTTP/HTTPS directly with SAX; that approach is very lightweight.
Many applications can dene XML messages and pass them using HTTP
without needing additional policies or APIs; its only a small stretch to use
SMTP and email queues if you need asynchronous queuing.
To develop lightweight XML-based applications, get a JDK, an HTTP/
HTTPS servlet engine, an XML toolset with SAX2 support, and probably a
relational database that you can access through JDBC. Thats enough for
quite a lot of web services. When you need to get beyond HTTP-centric
models, look at protocol frameworks like BEEP, which has long had Java
support. Remember to carefully document and review your XML messages
and protocols and to keep that documentation current. That is important
3 January 2002 10:09
for maintaining your software, and such good practices will help uncover
design bugs early in system life cycles, when theyre easy to x.
XML Messaging over HTTP with SAX2
HTTP is a request/r esponse pr otocol, loosely called an RPC transport.
Strictly speaking, RPC touches on APIs in some programming language
and makes them location transparent, but here we use the term in a
br oader request/r esponse sense. HTTP has several operations, sent to a
particular server port (typically 80 for nonencrypted HTTP) and directed
to a particular URI. For the purposes of XML messaging, the most impor-
tant HTTP operations are GET and POST.
HTTPs GET request asks the server to retur n the data associated with the
requests URI, as modied by various header elds. Other than the request
itself, this is a one-way data transfer, from server to client; the data is
retur ned using MIME as a typed envelope. For the purposes of this book,
that data is most interesting when its XML text. Web browsers normally
issue GET to retrieve documents, and in Java when you read data from a
java.net.URL you are nor mally issuing a GET. In particular, when a client
passes a URI to the SAX XMLReader.parse(uri) call, the call uses GET
under neath. Its easy to dynamically generate XML content from Java
servlets, as shown in the section Building Applications with RSS earlier
in this chapter.
HTTPs POST request is more inter esting. POST is very similar in structure
to GET, but it provides something that GET doesnt have: the request
includes MIME-encoded data. (Again, thats most interesting when its XML
text.) That is, unlike GET, POST is a two-way data transfer; XML can be
sent to the server as part of the request, as well as retur ned in the
response. Another key differ ence is that GET is idempotent: clients are
expected to reissue GET calls, which must not change signicant server
state. If you wanted to transfer money between bank accounts, POST is
the call to use since its expected to execute exactly once.
Its a bit messy to issue XML-in/XML-out POST requests from Java clients.
Well discuss how to do this in relatively pure SAX, but rst lets look at
this process using a SAX-friendly API library. No matter how you actually
transfer this data, the real work of your application will be to turn the SAX
events into application work. Youll likely connect code resembling this
example to application-specic code that marshals and unmarshals custom
data structures needed to do its work.
XML and Messaging 169
3 January 2002 10:09
170 Chapter 6: Putting It All Together
The gnu.xml.pipeline.CallFilter class packages the entire process as a
pipeline stage, sending its input events as a POST request and parsing the
POST response to produce output events. That makes it easy to use POST
as a generic processing component. For example, in a batch processing
scenario you might want to POST an XML le to a server and print its
response. Such a server might schedule work as described in the particu-
lar document, and it could easily have access to resources or privileges
unavailable to your client. This request can be issued programmatically or
in some cases using a standard command-line tool.
Example 6-7 shows one way to send an XML le to a server and save its
XML response as another le. As mentioned, the NSFilter class can be
(and in this case, is) optimized away. Its just making sure the namespace
pr ex infor mation in the event stream isnt missing anything important.
Example 6-7. Exchanging XML with a server (GNU pipeline version)
import gnu.xml.pipeline.*;
import gnu.xml.util.Resolver;
import org.xml.sax.*;
import org.xml.sax.helpers.XMLReaderFactory;
public class CallFile
{
// argv [0] == in.xml (filename)
// argv [1] == url for posting service
public static void main (String argv [])
{
try {
EventConsumer out;
XMLReader in;
out = new TextConsumer (System.out);
out = new NSFilter (out);
out = new CallFilter (argv [1], out);
out = new NSFilter (out);
in = XMLReaderFactory.createXMLReader ();
EventFilter.bind (in, out);
in.parse (Resolver.fileNameToURL (argv [0]));
} catch (Exception e) {
e.printStackTrace ();
System.exit (1);
}
}
}
3 January 2002 10:09
If you want to do the same thing without using that pipeline framework,
you have more work to do. Youll be driving the java.net.URLConnection
dir ectly, ensuring the text encodings are corr ect. And you wont have a
generic way to group all SAX handlers together; youd need to create an
analogue of gnu.xml.pipeline.EventConsumer or, as shown in Example
6-8, write code that knows the specic output class its talking to.
Example 6-8. Exchanging XML with a server (SAX-only version)
import java.io.*;
import java.net.*;
import gnu.xml.util.Resolver;
import gnu.xml.util.XMLWriter;
import org.xml.sax.*;
import org.xml.sax.helpers.XMLReaderFactory;
public class CallFile
{
// argv [0] == in.xml (filename)
// argv [1] == url for posting service
public static void main (String argv [])
{
try {
XMLReader in;
Caller caller;
XMLWriter out;
out = new XMLWriter (System.out);
caller = new CallWriter (new URL (argv [1]), out);
in = XMLReaderFactory.createXMLReader ();
in.setFeature (featurePrefix + "namespace-prefixes", true);
bindAll (in, caller);
in.parse (Resolver.fileNameToURL (argv [0]));
} catch (Exception e) {
e.printStackTrace ();
System.exit (1);
}
}
private static void bindAll (XMLReader in, Object out)
throws SAXException
{
if (out instanceof ContentHandler)
in.setContentHandler ((ContentHandler) out);
if (out instanceof DTDHandler)
in.setDTDHandler ((DTDHandler) out);
try {
if (out instanceof DeclHandler)
XML and Messaging 171
3 January 2002 10:09
172 Chapter 6: Putting It All Together
Example 6-8. Exchanging XML with a server (SAX-only version) (continued)
in.setProperty
("http://xml.org/sax/properties/
declaration-handler", out);
} catch (SAXNotRecognizedException e) { /* IGNORE */ }
try {
if (out instanceof LexicalHandler)
in.setProperty
("http://xml.org/sax/properties/
lexical-handler", out);
} catch (SAXNotRecognizedException e) { /* IGNORE */ }
}
// print input to server
// block till response
// print output as XML text to stdout
private static class CallWriter extends XMLWriter
{
private URL target;
private URLConnection conn;
private XMLWriternext;
CallWriter (URL url, XMLWriter out)
{
super ((Writer)null);
target = url;
next = out;
}
// Connect to remote object and set up to send it XML text
public synchronized void startDocument () throws SAXException
{
try {
conn = target.openConnection ();
conn.setDoOutput (true);
// "text/*" expects DOS-style EOL
next.setEOL ("\r\n");
conn.setRequestProperty ("Content-Type",
"text/xml;charset=UTF-8");
setWriter (new OutputStreamWriter (
conn.getOutputStream (),
"UTF8"), "UTF-8");
} catch (IOException e) {
fatal ("cant write (POST) to URI: " + target, e);
}
super.startDocument ();
}
// finish sending request
// receive the POST response
3 January 2002 10:09
Example 6-8. Exchanging XML with a server (SAX-only version) (continued)
public void endDocument () throws SAXException
{
super.endDocument ();
try {
InputSource source = new InputSource
(conn.getInputStream ());
XMLReader producer = XMLReaderFactory.createXMLReader ();
String encoding;
producer.setFeature (featurePrefix +
"namespace-prefixes", true);
encoding = Resolver.getEncoding (conn.getContentType ());
if (encoding != null)
source.setEncoding (encoding);
bindAll (producer, next);
producer.parse (source);
} catch (IOException e) {
fatal ("I/O Exception reading response, "
+ e.getMessage (), e);
}
}
}
}
In that example scenario, you might also be able to just use binary le I/O
and trust that the inputs and outputs are actually XML. But in general,
inputs wont be sitting in les, and output processing will involve more
than creating a new le. Both the CallFilter and CallWriter classes shown
her e ar e structur ed to be reusable.
On the server side, its also easy to handle POST. In fact, youve seen all
you need to know already! We saw how to pull XML data out of the POST
request using the XmlReader.parse(InputSource) method in Chapter 3, in
the section Providing entity text. Writing XML data in the response works
exactly like it does for a GET, as shown earlier in this chapter in the sec-
tion Building Applications with RSS. The main XML-specic issue is to
handle the character encoding correctly, as shown in both of those exam-
ples. (UTF-8 is the safest over-the-wir e encoding.) Its safe to use the
application/xml MIME content type whenever you pass XML using HTTP,
since there are fewer things that can (and will!) go wrong. You should
also make sure to use CRLF-style line ends whenever you use a text/*
MIME content type. You might want to pay attention to some servlet-spe-
cic issues, such as structuring your code to support connection
keepalives or (less commonly) on-the-y compression of response data.
XML and Messaging 173
3 January 2002 10:09
174 Chapter 6: Putting It All Together
In many cases its probably good to have servlets doPost() methods save
input to persistent storage, so that some other thread can pick it up as
work item, and then just use the response data to acknowledge the
request. The client would collect any additional requests later, either when
it polled or when the server called back to the client (with another POST).
That approach avoids tying up connections for a long time and creates a
framework whereby many component failures will be transparently recov-
er ed fr om. Using such an atomic transaction model correctly can let you
avoid the need for transactional roll-back mechanisms to recover from
common system failure modes.
Inc luding Subdocuments
In XML, external parsed entities are used to merge one le into another.
This mechanism is used to partition larger XML documents (such as this
book) into smaller ones (such as this chapter). Such external entities arent
quite the same as actual XML documents. They do not have DTDs; they
have zero or mor e top-level elements instead of exactly one; and they
have text declarations at the top instead of XML declarations.
*
Those entities are in some ways awkward to use. Some people dont like
to use DTDs, and their tools might not let them declare and create refer-
ences to such entities. In any case, DTDs add the requir ement that such
entities be declared in advance. When youre building big documents out
of little ones, widely spreading such knowledge can be undesirable. Its
often easier to keep a local refer ence accurate than to update the remote
declarations it depends on. Also, documents nest inside others, and small
changes nested inside one document could force updates to many DTDs if
the document is included in several others. In short, external parsed enti-
ties arent as easy or natural to use as the #include filename" syntax
widely known to C/C++ developers. This is often viewed as a problem.
The response is obvious: use some other part of XML syntax to dene a
mor e natural inclusion construct. Theres a W3C draft called XInclude,
which doesnt quite do this (in the most current draft). XInclude uses ele-
ment syntax, which is ne, but it doesnt just dene a simple and familiar
* These might show only the text encoding <?xml encoding=Big5?> is a legal text decla-
ration. To be an XML declaration, it would need to include a version rst, like ver-
sion=1.0; its good practice is to include both. Documents that use encoding declara-
tions with no version number cannot be opened as XML directly. They can only be
included in XML documents by way of an entity.
3 January 2002 10:09
inclusion mechanism. XInclude supports the XPointer superset of XPath to
embed almost arbitrary fragments of XML text. In effect, W3Cs XInclude is
a generalized linking model, and one which depends on signicant infras-
tructur e. The model hasnt met with widespread acceptance, and in any
case is too complex to use for an example here. Thats really too bad; nor-
mal inclusion is a strict streaming model, ideal for implementing with SAX,
and the model of including fragments is exotic pretty much everywhere
except within the linking community.
Her e we show how to implement a variant of XInclude, which can
replace many uses of external entities because it doesnt use XPointer. To
emphasize the differ ence, well use a differ ent syntax:
<?XInclude http://www.example.com/data/included.xml?>
<!-- instead of what XInclude uses: -->
<xi:include
xmlns:xi="http://www.w3.org/2001/XInclude"
href="http://www.example.com/data/included.xml"
parse=xml
encoding=euc-jp
>
content of xi:include is ignored,
the whole element gets replaced
</xi:include>
This example highlights several differ ent SAX2 mechanisms. It uses the
XMLFilterImpl class in two differ ent modes and pays careful attention to
the data it passes through. The differ ent modes are as follows:
The outer lter must be used as a mixed event producer and con-
sumer, with access to the full stream of event data as well as any
Err orHandler and EntityResolver objects in use. If its not used this
way, it wont be correct; its hard to know such things about a SAX
Filter unless they are discussed in the class documentation.
The outer lter proxies the Locator so that applications see the right
event locations and base URIs. It usually forwards the events to the
true recipient of the event stream, but it will also handle nested inclu-
sions when they are sent from the inner lter.
The inner lter is used as a pure event consumer. It cooperates with
the outer lter to keep the proxy working correctly, and is set up to
strip out DTD-related events and forward the rest to the outer lter.
The code in Example 6-9 takes a few shortcuts but implements the essen-
tial inclusion functionality.
Inc luding Subdocuments 175
3 January 2002 10:09
176 Chapter 6: Putting It All Together
Example 6-9. XInclude processing instruction
import java.io.IOException;
import java.net.URL;
import java.util.Vector;
import org.xml.sax.*;
import org.xml.sax.ext.*;
import org.xml.sax.helpers.XMLFilterImpl;
import org.xml.sax.helpers.XMLReaderFactory;
public final class XI extends XMLFilterImpl
implements LexicalHandler, Locator
{
// Act as a proxy for whatever the current locator is.
private Locator locator;
// to avoid circular inclusion
private Vector pending = new Vector (5, 5);
private LexicalHandler lexicalHandler;
private static String lexicalID =
"http://xml.org/sax/properties/lexical-handler";
public void setDocumentLocator (Locator l)
{
locator = l;
super.setDocumentLocator (this);
}
public String getSystemId ()
{ return (locator == null) ? null : locator.getSystemId (); }
public String getPublicId ()
{ return (locator == null) ? null : locator.getPublicId (); }
public int getLineNumber ()
{ return (locator == null) ? -1 : locator.getLineNumber (); }
public int getColumnNumber ()
{ return (locator == null) ? -1 : locator.getColumnNumber (); }
// Inner Filter Class: manage the current locator,
// and filter out events that would be incorrect to report
private class Scrubber extends XMLFilterImpl implements LexicalHandler
{
Locator savedLocator;
LexicalHandler next;
Scrubber (Locator l, LexicalHandler n)
{ savedLocator = l; next = n; }
// maintain proxy locator
// only one startDocument()/endDocument() pair per event stream
public void setDocumentLocator (Locator l)
{ locator = l; }
public void startDocument ()
3 January 2002 10:09
Example 6-9. XInclude processing instruction (continued)
{ }
public void endDocument ()
{ locator = savedLocator; }
private void reject (String message) throws SAXException
{ throw new SAXParseException (message, locator); }
// only the DTD from the base document gets reported
public void startDTD (String root, String publicId, String systemId)
throws SAXException
{ reject ("DTD: " + systemId); }
public void endDTD ()
throws SAXException
{ reject ("DTD"); }
// ... so this should never happen
public void skippedEntity (String name) throws SAXException
{ reject ("entity: " + name); }
// since we rejected DTDs, only built-in entities can be reported
public void startEntity (String name)
throws SAXException
{ next.startEntity (name); }
public void endEntity (String name)
throws SAXException
{ next.endEntity (name); }
// other lexical events cause no worries
public void startCDATA () throws SAXException
{ next.startCDATA (); }
public void endCDATA () throws SAXException
{ next.endCDATA (); }
public void comment (char buf[], int off, int len)
throws SAXException
{ next.comment (buf, off, len); }
}
// count is zero in the document prologue and epilogue
private int count;
public void startElement (String u, String l, String q, Attributes a)
throws SAXException
{ count++; super.startElement (u, l, q, a); }
public void endElement (String u, String l, String q)
throws SAXException
{ --count; super.endElement (u, l, q); }
public void startDocument () throws SAXException
{ pending.addElement (locator.getSystemId ());
super.startDocument (); }
Inc luding Subdocuments 177
3 January 2002 10:09
178 Chapter 6: Putting It All Together
Example 6-9. XInclude processing instruction (continued)
public void endDocument () throws SAXException
{ pending.clear (); super.endDocument (); }
// handle <?XInclude URI> processing instructions
public void processingInstruction (String target, String data)
throws SAXException
{
if ("XInclude".equals (target)) {
// this should do full XML base processing
// instead we just handle relative and absolute URLs
try {
URL url = new URL (getSystemId ());
url = new URL (url, data.trim ());
data = url.toString ();
} catch (Exception e) {
throw new SAXParseException (
"XInclude, cant use URI: " + data, locator, e);
}
xinclude (data);
} else
super.processingInstruction (target, data);
}
// this might be called from startElement too
private void xinclude (String uri)
throws SAXException
{
XMLReader helper;
Scrubber scrubber;
if (count == 0)
throw new SAXParseException (
"XInclude, illegal location", locator);
if (pending.contains (uri))
throw new SAXParseException (
"XInclude, circular inclusion", locator);
// start with another parser acting just like us
helper = XMLReaderFactory.createXMLReader ();
helper.setEntityResolver (this);
helper.setErrorHandler (this);
// Set up the proxy locator and inner filter.
scrubber = new Scrubber (locator, this);
locator = null;
scrubber.setContentHandler (this);
helper.setContentHandler (scrubber);
helper.setProperty (lexicalID, scrubber);
// we INTEND to discard DTDHandler and DeclHandler events
3 January 2002 10:09
Example 6-9. XInclude processing instruction (continued)
// Merge the included document, except its DTD
try {
pending.addElement (uri);
helper.parse (uri);
} catch (java.io.IOException e) {
SAXParseException err;
ErrorHandler handler;
err = new SAXParseException (uri, locator, e);
handler = getErrorHandler ();
if (handler != null)
handler.fatalError (err);
throw err;
} finally {
pending.removeElement (uri);
}
}
// LexicalHandler interface
public void startEntity (String name)
throws SAXException
{ if (lexicalHandler != null) lexicalHandler.startEntity (name); }
public void endEntity (String name)
throws SAXException
{ if (lexicalHandler != null) lexicalHandler.endEntity (name); }
public void startDTD (String root, String publicId, String systemId)
throws SAXException
{ if (lexicalHandler != null) lexicalHandler.startDTD (root, publicId,
systemId); }
public void endDTD () throws SAXException
{ if (lexicalHandler != null) lexicalHandler.endDTD (); }
public void startCDATA () throws SAXException
{ if (lexicalHandler != null) lexicalHandler.startCDATA (); }
public void endCDATA () throws SAXException
{ if (lexicalHandler != null) lexicalHandler.endCDATA (); }
public void comment (char buf[], int off, int len) throws SAXException
{ if (lexicalHandler != null) lexicalHandler.comment (buf, off, len); }
// so this works as a "consumer"
public void setProperty (String uri, Object handler)
throws SAXNotRecognizedException, SAXNotSupportedException
{
if (lexicalID.equals (uri))
lexicalHandler = (LexicalHandler) handler;
else
super.setProperty (uri, handler);
}
// so this works as a "producer"
Inc luding Subdocuments 179
3 January 2002 10:09
180 Chapter 6: Putting It All Together
Example 6-9. XInclude processing instruction (continued)
public void parse (InputSource in)
throws SAXException, IOException
{
XMLReader parent = getParent ();
if (parent != null)
parent.setProperty (lexicalID, this);
super.parse (in);
}
}
The most signicant shortcut in this code is that, to simplify the example,
XML Base isnt supported. Thats easily xed using the technique shown
earlier, in Example 5-1. Similarly, the namespace reporting and validation
modes of the default parser are assumed to be OK; they should be copied
or specied as part of this event consumers API.
Merging SAX event streams from two differ ent sources is quite simple,
except for DTD-related information. One basic problem is structural: DTD
events may be reported only at the beginning of a SAX event stream, and
the chance to do that has been lost by the time an included document is
pr ocessed. Another basic problem is semantic: the events from the two
sources could easily conict with each other. Neither of those problems
can be solved with a pure str eam pr ocessing model, unless the included
documents use the same DTD as the base document. Accordingly, this
example treats DTD events from included streams as errors.
The best way to use XML inclusions is with XML text that doesnt use
DTDs, perhaps using XML 1.0 plus Namespaces rules to help assign
meaning to individual elements and attributes. Eliminating DTDs means
some important bits of the XML Infoset will be unavailable, such as the
attribute-typing information that tells you which elements are used as IDs.
If all the les in question are themselves well-formed XML documents
with both version and encoding in any XML declarations (and without a
DTD), they can easily be included without signicant restrictions. Such an
inclusion facility can be convenient in a variety of application contexts,
such as template-driven document processing and other cases where its
important to build larger documents from smaller ones.
3 January 2002 10:09
A
SAX2 API Summary
This appendix provides a quick refer ence to each of the SAX2 APIs pre-
sented in this book. It shows API signatures and provides a brief overview
for each interface, class, and exception in alphabetical order.
Full documentation for these APIs is available for you to download or
br owse online at the SAX web site (http://sax.sour ceforge.net/ ), and it
should also be available with documentation for your SAX parser.
The org.xml.sax Package
The or g.xml.sax package holds the interfaces and exceptions that are at
the core of SAX, including some deprecated SAX1 APIs.
The Attr ibuteList Interface
This SAX1 interface is not used in SAX2; the Attributes inter face, which
supports namespace identiers, is used instead.
For more infor mation, refer to the section SAX1 Support in Chapter 5.
public interface AttributeList
{
public int getLength();
public String getType(int index);
public String getValue(int index);
// access to name info
public String getName(int index);
181
3 January 2002 10:05
182 Appendix A: SAX2 API Summary
// access by XML 1.0-style names
public String getType(String qName);
public String getValue(String qName);
}
The Attr ibutes Interface
This interface groups all the attributes associated with a given element in
the ContentHandler.startElement() call. Attribute characteristics are fre-
quently accessed using an indexed access model, though you can also
deter mine the index, type, or value of an attribute given XML 1.0style
(qName) or namespace-style (uri, localName) versions of its name.
For more infor mation, refer to the section The Attributes Interface in
Chapter 2.
public interface Attributes
{
public int getLength();
public String getType(int index);
public String getValue(int index);
// access to name info
public String getQName(int index);
public String getLocalName(int index);
public String getURI(int index);
// access by XML Namespace-style names
public int getIndex(String uri, String localName);
public String getType(String uri, String localName);
public String getValue(String uri, String localName);
// access by XML 1.0-style names
public int getIndex(String qName);
public String getType(String qName);
public String getValue(String qName);
}
The ContentHandler Interface
This is the primary SAX2 handler interface, which is used in almost all
applications.
For more infor mation, refer to the section Essential ContentHandler Call-
backs, and the section ContentHandler and Prex Mappings, both in
Chapter 2, as well as the section Other ContentHandler Methods in
Chapter 4.
public interface ContentHandler
{
// bookkeeping
3 January 2002 10:05
public void setDocumentLocator(Locator locator);
public void startDocument() throws SAXException;
public void endDocument() throws SAXException;
// content events
public void startElement(String uri, String localName, String qName,
Attributes attributes)
throws SAXException;
public void endElement(String uri, String localName, String qName)
throws SAXException;
public void characters(char buf[], int offset, int length)
throws SAXException;
public void processingInstruction(String target, String data)
throws SAXException;
// extra info
public void ignorableWhitespace(char buf[], int offset, int length)
throws SAXException;
public void startPrefixMapping(String prefix, String uri)
throws SAXException;
public void endPrefixMapping(String prefix) throws SAXException;
public void skippedEntity(String name) throws SAXException;
}
The DocumentHandler Interface
This SAX1 interface is not used in SAX2; the ContentHandler inter face,
which reports namespace identiers and scopes as well as skipped enti-
ties, is used instead.
For more infor mation, refer to the section SAX1 Support in Chapter 5.
public interface DocumentHandler
{
// bookkeeping
public void setDocumentLocator(Locator locator);
public void startDocument() throws SAXException;
public void endDocument() throws SAXException;
// content events
public void startElement(String qName, AttributeList attributes)
throws SAXException;
public void endElement(String qName) throws SAXException;
public void characters(char buf[], int offset, int length)
throws SAXException;
public void processingInstruction(String target, String data)
throws SAXException;
// extra info
public void ignorableWhitespace(char buf[], int offset, int length)
throws SAXException;
}
Appendix A: SAX2 API Summary 183
3 January 2002 10:05
184 Appendix A: SAX2 API Summary
The DTDHandler Interface
This interface is used to report information that is useful to some SGML-
derived applications.
For more infor mation, refer to the section The DTDHandler Interface in
Chapter 4.
public interface DTDHandler
{
public void notationDecl(String notationName,
String publicId, String systemId)
throws SAXException;
public void unparsedEntityDecl(String entityName,
String publicId, String systemId, String notationName)
throws SAXException;
}
The EntityResolver Interface
This interface encapsulates a strategy for resolving public or system identi-
ers for parsed entities into data that a parser will read. It is commonly
used to ensure that local copies of DTDs are used, instead of DTDs
accessed across a network link that may be saturated or unavailable. It
can resolve general entities, used to store non-DTD parts of a document
in separate storage units.
For more infor mation, refer to the section The EntityResolver Interface in
Chapter 3.
public interface EntityResolver
{
public InputSource resolveEntity(String publicId,
String systemId)
throws SAXException, java.io.IOException;
}
The ErrorHandler Interface
This interface encapsulates a strategy for handling differ ent kinds of
err ors. Parsers use its methods when reporting errors, and have default
policies that are used if the applications strategy doesnt result in throw-
ing an exception. Applications can benet from sharing the same mecha-
nism to report their own errors. Implementations typically use the
pr oblems severity to choose rst whether to emit a diagnostic, and then
3 January 2002 10:05
whether to throw the parameter (to terminate processing) or retur n (to
continue processing).
For more infor mation, refer to the section ErrorHandler Interface in
Chapter 2.
public interface ErrorHandler
{
public void error(SAXParseException x) throws SAXException;
public void fatalError(SAXParseException x) throws SAXException;
public void warning(SAXParseException x) throws SAXException;
}
The HandlerBase Interface
This SAX1 class is not used in SAX2; the or g.xml.sax.helpers.Default-
handler class, which supports SAX2 features, is used instead.
For more infor mation, refer to the section SAX1 Support in Chapter 5.
public class HandlerBase
implements EntityResolver, DTDHandler,
DocumentHandler, ErrorHandler
{
public HandlerBase();
// DocumentHandler (SAX1)
public void setDocumentLocator(Locator locator);
public void startDocument() throws SAXException;
public void endDocument() throws SAXException;
public void startElement(String qName,
AttributeList attributes)
throws SAXException;
public void endElement(String qName) throws SAXException;
public void characters(char buf[], int offset, length)
throws SAXException;
public void ignorableWhitespace(char buf[], int offset,
length)
throws SAXException;
public void processingInstruction(String target,
String data)
throws SAXException;
// DTDHandler ... NOTE: no "throws SAXException"!
public void notationDecl(String notationName,
String publicId, String publicId);
public void unparsedEntityDecl(String entityName,
String publicId, String publicId, notationName);
// EntityResolver
public InputSource resolveEntity(String publicId,
String publicId);
throws SAXException;
Appendix A: SAX2 API Summary 185
3 January 2002 10:05
186 Appendix A: SAX2 API Summary
// ErrorHandler
public void error(SAXParseException x) throws SAXException;
public void fatalError(SAXParseException x) throws SAXException;
public void warning(SAXParseException x) throws SAXException;
}
The InputSour ce Class
This class is used to encapsulate entities for consumption by an XML-
Reader (or a SAX1 Parser). Applications should make every effort to pro-
vide a usable system identier (an absolute URI, rather than null). This
will ensure that relative URIs can be properly resolved so that diagnostics
ar e meaningful. Given that identier, SAX parsers can do the rest, possibly
with assistance from an EntityResolver.
For more infor mation, refer to the section The InputSource Class in
Chapter 3.
public class InputSource {
public InputSource();
public InputSource(String systemId);
public InputSource(java.io.InputStream in);
public InputSource(java.io.Reader in);
// getters
public String getPublicId();
public String getSystemId();
public java.io.InputStream getByteStream();
public String getEncoding();
public java.io.Reader getCharacterStream();
// setters
public void setPublicId(String publicId);
public void setSystemId(String systemId);
public void setByteStream(java.io.InputStream in);
public void setEncoding(String encodingName);
public void setCharacterStream(java.io.Reader in);
}
The Locator Interface
An event producer may invoke the ContentHandler.setDocumentLocator()
call to provide one of these objects. It may then be used inside event call-
backs, until the nal ContentHandler.endDocument() call, to determine the
location of the data that triggered the event. A common use is to gure
out the base URI used to resolve relative URIs found in document content.
3 January 2002 10:05
This is true even when xml:base attributes have been used to override the
real base URI of the document. Another common use is to construct SAX-
ParseException objects to construct application-level diagnostics.
For more infor mation, refer to the section The Locator Interface in Chap-
ter 4.
public interface Locator
{
public String getPublicId();
public String getSystemId();
public int getLineNumber();
public int getColumnNumber();
}
The Par ser Interface
This SAX1 interface is no longer used in SAX2; the XMLReader is used
instead.
For more infor mation, refer to the section SAX1 Support in Chapter 5.
public interface Parser
{
// setters
public void setLocale(java.util.Locale locale)
throws SAXException;
public void setEntityResolver(EntityResolver resolver);
public void setDTDHandler(DTDHandler dtdHandler);
public void setDocumentHandler(DocumentHandler docHandler);
public void setErrorHandler(ErrorHandler errHandler);
// parsing
public void parse(InputSource in) throws SAXException,
java.io.IOException;
public void parse(String uri) throws SAXException,
java.io.IOException;
}
SAXException
This is the base SAX exception class. It can wrap another exception, and
(like most exceptions) a descriptive message.
For more infor mation, refer to the section SAX2 Exception Classes in
Chapter 2.
Appendix A: SAX2 API Summary 187
3 January 2002 10:05
188 Appendix A: SAX2 API Summary
public class SAXException extends Exception {
public SAXException(String message);
public SAXException(Exception cause);
public SAXException(String message, Exception cause);
// getters
public Exception getException();
public String getMessage();
}
SAXNotRecognizedException
This exception is used to report that the identier for a feature ag, or
parser property, is not recognized. When this doesnt indicate a mistyped
identier, it means that the parser isnt exposing that particular informa-
tion.
For more infor mation, refer to the section SAX2 Exception Classes in
Chapter 2.
public class SAXNotRecognizedException extends SAXException {
public SAXNotRecognizedException(String message);
}
SAXNotSuppor tedException
This exception is used to report that while the identier for a feature ag,
or parser property, was recognized, setting the value was not practical.
For example, read-only values cant be changed to nondefault values, han-
dler properties need to implement the appropriate interface, and some
values cant be changed while parsing, or accessed except while parsing.
For more infor mation, refer to the section SAX2 Exception Classes in
Chapter 2.
public class SAXNotSupportedException extends SAXException {
public SAXNotSupportedException(String message);
}
SAXPar seException
This type of SAXException is reported to the Err orHandler, and adds
infor mation that is useful for locating (and hence xing) problems in input
text. That information may include the line and character offset in the
entity being parsed, the entitys URL, and any public identier associated
3 January 2002 10:05
with the entity. When used to report application-level errors, any excep-
tion caused by problematic data can be encapsulated, and associated with
such Locator infor mation to help pinpoint problematic input data. Its safe
to provide null Locator or Exception objects.
For more infor mation, refer to the section SAX2 Exception Classes as
well as the section Errors and Diagnostics, which are both located in
Chapter 2.
public class SAXParseException extends SAXException {
public SAXParseException(String message, Locator where);
public SAXParseException(String message, Locator where,
Exception cause);
public SAXParseException(String message,
String publicId, String systemId, int line, int column);
public SAXParseException(String message,
String publicId, String systemId, int line, int column,
Exception cause);
// getters
public String getPublicId();
public String getSystemId();
public int getLineNumber();
public int getColumnNumber();
}
The XMLFilter Interface
This interface encapsulates the notion that one XMLReader may process
the output of another one before delivering it. The XMLFilterImpl helper
class is substantially more inter esting, since it does the real work and can
be used in a postprocessing mode as well.
For more infor mation, refer to the section The XMLFilter Interface in
Chapter 3.
public interface XMLFilter extends XMLReader
{
public void setParent(XMLReader parent);
public XMLReader getParent();
}
The XMLReader Interface
A SAX2 parser will normally be packaged as an implementation of this
inter face. Such a parser often takes XML text as input, though it need not.
Appendix A: SAX2 API Summary 189
3 January 2002 10:05
190 Appendix A: SAX2 API Summary
Some parsers need a DOM Document as input, and others parse non-XML
text and report it as if it were XML, to leverage the SAX event processing
model.
For more infor mation, refer to the section The XMLReader Interface in
Chapter 3. For quick refer ence to the standard SAX2 feature and property
identiers, refer to the section XMLReader Feature Flags and the section
XMLReader Properties, both in Chapter 3.
public interface XMLReader
{
// getters
public ContentHandler getContentHandler();
public DTDHandler getDTDHandler();
public EntityResolver getEntityResolver();
public ErrorHandler getErrorHandler();
public boolean getFeature(String uri)
throws SAXNotRecognizedException,
SAXNotSupportedException;
public Object getProperty(String uri)
throws SAXNotRecognizedException,
SAXNotSupportedException;
// setters
public void setContentHandler(ContentHandler contentHandelr);
public void setDTDHandler(DTDHandler dtdHandler);
public void setEntityResolver(EntityResolver resolver);
public void setErrorHandler(ErrorHandler errHandler);
public void setFeature(String uri, boolean value)
throws SAXNotRecognizedException,
SAXNotSupportedException;
public void setProperty(String uri, Object value)
throws SAXNotRecognizedException,
SAXNotSupportedException;
// parsing
public void parse(InputSource in)
throws java.io.IOException, SAXException;
public void parse(String uri)
throws java.io.IOException, SAXException;
}
The org.xml.sax.helper s Package
The or g.xml.sax.helpers package holds support classes, including vendor-
neutral bootstrapping support and some support for the original SAX1
APIs. These classes are in a sense optional but are provided by all widely
available implementations. Theyre also requir ed for conformance with
Suns JAXP API.
3 January 2002 10:05
The Attr ibuteListImpl Interface
This SAX1 class is not used in SAX2; the AttributesImpl class is used
instead.
For more infor mation, refer to the section SAX1 Support in Chapter 5.
public class AttributeListImpl implements AttributeList {
public AttributeListImpl();
public AttributeListImpl(AttributeList original);
// AttributeList (accessors only)
public int getLength();
public String getName(int index);
public String getType(int index);
public String getValue(int index);
public String getType(String qName);
public String getValue(String qName);
// mutators
public void setAttributeList(AttributeList original);
public void addAttribute(String qName, String type, String value);
public void removeAttribute(String qName);
public void clear();
}
The Attr ibutesImpl Class
This class can be a convenient way to snapshot attribute information using
the copy constructor. Since the attributes provided by an event producer
ar e only valid during the particular ContentHandler.startElement() call
that provides them, applications may need such snapshots. The class also
supports construction of arbitrary attribute sets for ltering or event pro-
duction.
For more infor mation, refer to the section The AttributesImpl Class in
Chapter 5.
public class AttributesImpl implements Attributes {
public AttributesImpl();
public AttributesImpl(Attributes original);
// Attributes (accessors only)
public int getLength();
public String getURI(int index);
public String getLocalName(int index);
public String getQName(int index);
public String getType(int index);
public String getValue(int index);
public int getIndex(String uri, String localName);
public int getIndex(String qName);
public String getType(String uri, String localName);
public String getType(String qName);
Appendix A: SAX2 API Summary 191
3 January 2002 10:05
192 Appendix A: SAX2 API Summary
public String getValue(String uri, String localName);
public String getValue(String qName);
// setters
public void setLocalName(int index, String localName);
public void setQName(int index, String qName);
public void setType(int index, String type);
public void setURI(int index, String uri);
public void setValue(int index, String value);
// mutators
public void addAttribute(String uri, String localName,
String qName,
String type, String value);
public void clear();
public void removeAttribute(int index);
public void setAttribute(int index, String, String,
String, String, String);
public void setAttributes(Attributes original);
}
The DefaultHandler Class
This class provides stub implementations of all the standard SAX2 han-
dlers, including Err orHandler, and the EntityResolver. Those stub imple-
mentations do nothing, except that ErrorHandler.fatalError() thr ows its
argument. Extension handler callbacks are not supported; if you need
DeclHandler or LexicalHandler stubs youll need to provide them yourself
(perhaps by subclassing).
For more infor mation, refer to the section The DefaultHandler Class in
Chapter 2.
public class DefaultHandler
implements EntityResolver, DTDHandler, ContentHandler,
ErrorHandler
{
public DefaultHandler();
// ContentHandler
public void setDocumentLocator(Locator locator);
public void startDocument() throws SAXException;
public void endDocument() throws SAXException;
public void startElement(String uri, String localName,
String qName, Attributes)
throws SAXException;
public void endElement(String uri, String localName,
String qName)
throws SAXException;
public void characters(char buf[], int offset, int length)
throws SAXException;
public void ignorableWhitespace(char buf[], int offset,
int length)
3 January 2002 10:05
throws SAXException;
public void processingInstruction(String target, String data)
throws SAXException;
public void startPrefixMapping(String prefix, String uri)
throws SAXException;
public void endPrefixMapping(String prefix) throws
SAXException;
public void skippedEntity(String name) throws SAXException;
// DTDHandler
public void notationDecl(String notationName,
String publicId, String systemId)
throws SAXException;
public void unparsedEntityDecl(String entityName,
String publicId, String systemId, String
notationName)
throws SAXException;
// EntityResolver
public InputSource resolveEntity(String publicId,
String publicId);
throws SAXException;
// ErrorHandler
public void error(SAXParseException x) throws SAXException;
public void fatalError(SAXParseException x) throws
SAXException;
public void warning(SAXParseException x) throws SAXException;
}
The LocatorImpl Class
This class can provide a convenient way to snapshot locator information.
Since the locator provided by an event producer may report differ ent val-
ues during each event callback, applications may need such snapshots.
For more infor mation, refer to the section The LocatorImpl Class in
Chapter 5.
public class LocatorImpl implements Locator {
public LocatorImpl();
public LocatorImpl(Locator);
// Locator
public String getPublicId();
public String getSystemId();
public int getLineNumber();
public int getColumnNumber();
// setters
public void setPublicId(String publicId);
public void setSystemId(String systemId);
public void setLineNumber(int line);
public void setColumnNumber(int column);
}
Appendix A: SAX2 API Summary 193
3 January 2002 10:05
194 Appendix A: SAX2 API Summary
The NamespaceSuppor t Class
This class helps implement stacks of XML namespace context data. Its
mostly useful for applications that need to handle element or attribute
names within document content (including attributes) or for parser writers.
For more infor mation, refer to the section The NamespaceSupport Class
in Chapter 5.
public class NamespaceSupport {
// fixed uri for the "xml" prefix
public static final String XMLNS;
public NamespaceSupport();
// manipulate binding stack
public void reset();
public void pushContext();
public void popContext();
public boolean declarePrefix(String prefix, String uri);
public String [] processName(String qName, String parts[],
boolean isAttribute);
// access currently visible prefix bindings
public String getURI(String prefix);
public java.util.Enumeration getPrefixes();
public String getPrefix(String uri);
public java.util.Enumeration getPrefixes(String uri);
public java.util.Enumeration getDeclaredPrefixes();
}
The Par serAdapter Class
This class is used to convert SAX1 Parser objects into XMLReader objects
by converting SAX1 event callbacks into SAX2 callbacks. It uses the
NamespaceSupport class internally to track namespaces so it can report
them for elements and attributes, as requir ed by SAX2. If you need to
make a SAX1 parser report handling of validation or external entities
though feature ags, you can subclass ParserAdapter and override the
appr opriate methods.
For more infor mation, refer to the section SAX1 Support in Chapter 5.
public class ParserAdapter implements XMLReader,
DocumentHandler {
public helpers.ParserAdapter() throws SAXException;
public helpers.ParserAdapter(Parser sax1);
// XMLReader getters
public boolean getFeature(String uri)
throws SAXNotRecognizedException,
SAXNotSupportedException;
public ContentHandler getContentHandler();
3 January 2002 10:05
public DTDHandler getDTDHandler();
public EntityResolver getEntityResolver();
public ErrorHandler getErrorHandler();
public Object getProperty(String uri)
throws SAXNotRecognizedException,
SAXNotSupportedException;
// XMLReader setters
public void setContentHandler(ContentHandler contentHandler);
public void setDTDHandler(DTDHandler dtdHandler);
public void setEntityResolver(EntityResolver resolver);
public void setErrorHandler(ErrorHandler errHandler);
public void setFeature(String uri, boolean value)
throws SAXNotRecognizedException,
SAXNotSupportedException;
public void setProperty(String uri, Object value)
throws SAXNotRecognizedException,
SAXNotSupportedException;
// XMLReader parsing
public void parse(String uri) throws java.io.IOException,
SAXException;
public void parse(InputSource in) throws java.io.IOException,
SAXException;
// DocumentHandler (internals -- dont use)
public void setDocumentLocator(Locator locator);
public void startDocument() throws SAXException;
public void endDocument() throws SAXException;
public void startElement(String qName, AttributeList
attributes)
throws SAXException;
public void endElement(String qName) throws SAXException;
public void characters(char buf[], int offset, int length)
throws SAXException;
public void ignorableWhitespace(char buf[], int offset,
int length)
throws SAXException;
public void processingInstruction(String target,
String data)
throws SAXException;
}
The Par serFactor y Class
This SAX1 interface is not used in SAX2; the XMLReaderFactory is used
instead. The or g.xml.sax.parser system property was used to congure the
default SAX1 parser.
For more infor mation, refer to the section SAX1 Support in Chapter 5.
public class ParserFactory {
public static Parser makeParser()
throws ClassNotFoundException, IllegalAccessException,
InstantiationException, NullPointerException,
Appendix A: SAX2 API Summary 195
3 January 2002 10:05
196 Appendix A: SAX2 API Summary
ClassCastException;
public static Parser makeParser(String classname)
throws ClassNotFoundException, IllegalAccessException,
InstantiationException, ClassCastException;
}
The XMLFilterImpl Class
This class implements all the standard SAX2 events received from its par-
ent XMLReader by passing them on to the handlers (or EntityResolver)
register ed with it. It only supports ltering core events, because it ignores
the two extension handlers for declaration and lexical events.
This means you can use it in two modes. First, it can be a base class for
simple consumer pipelines, unless you need information thats provided
using extension handlers. Second, you can package a lter with a parser,
so it can produce events like an XMLReader that just happens to do a bit
of extra work (such as cleaning up input data).
For more infor mation, refer to the section The XMLFilterImpl Class in
Chapter 4.
public class XMLFilterImpl
implements XMLFilter, EntityResolver, DTDHandler,
ContentHandler, ErrorHandler
{
public XMLFilterImpl();
public XMLFilterImpl(XMLReader parent);
public void setParent(XMLReader parent);
// EntityResolver
public InputSource resolveEntity(String publicId,
String publicId);
throws SAXException;
// DTDHandler
public void notationDecl(String notationName,
String publicId, String systemId)
throws SAXException;
public void unparsedEntityDecl(String entityName,
String publicId, String systemId,
String notationName)
throws SAXException;
// ContentHandler
public void setDocumentLocator(Locator locator);
public void startDocument() throws SAXException;
public void endDocument() throws SAXException;
public void startElement(String uri, String localName,
3 January 2002 10:05
String qName,
Attributes attributes)
throws SAXException;
public void endElement(String uri, String localName,
String qName)
throws SAXException;
public void characters(char buf[], int offset, int length)
throws SAXException;
public void ignorableWhitespace(char buf[], int offset,
int length)
throws SAXException;
public void processingInstruction(String target, String data)
throws SAXException;
public void startPrefixMapping(String prefix, String uri)
throws SAXException;
public void endPrefixMapping(String prefix) throws
SAXException;
public void skippedEntity(String name) throws SAXException;
// ErrorHandler
public void error(SAXParseException x) throws SAXException;
public void fatalError(SAXParseException x) throws
SAXException;
public void warning(SAXParseException x) throws SAXException;
// XMLFilter
public XMLReader getParent();
// XMLReader
public ContentHandler getContentHandler();
public DTDHandler getDTDHandler();
public EntityResolver getEntityResolver();
public ErrorHandler getErrorHandler();
public boolean getFeature(String uri)
throws SAXNotRecognizedException,
SAXNotSupportedException;
public Object getProperty(String uri)
throws SAXNotRecognizedException,
SAXNotSupportedException;
public void setContentHandler(ContentHandler contentHandler);
public void setDTDHandler(DTDHandler dtdHandler);
public void setEntityResolver(EntityResolver resolver);
public void setErrorHandler(ErrorHandler errHandler);
public void setFeature(String uri, boolean value)
throws SAXNotRecognizedException,
SAXNotSupportedException;
public void setProperty(String uri, Object value)
throws SAXNotRecognizedException,
SAXNotSupportedException;
public void parse(InputSource in)
throws java.io.IOException, SAXException;
public void parse(String uri)
throws java.io.IOException, SAXException;
}
Appendix A: SAX2 API Summary 197
3 January 2002 10:05
198 Appendix A: SAX2 API Summary
The XMLReaderAdapter Class
This class is used to convert SAX2 XMLReader objects into Parser objects
by converting SAX2 event callbacks into SAX1 callbacks.
For more infor mation, refer to the section SAX1 Support in Chapter 5.
public class XMLReaderAdapter implements Parser,
ContentHandler {
public XMLReaderAdapter() throws SAXException;
public XMLReaderAdapter(XMLReader reader);
// Parser
public void setLocale(java.util.Locale locale) throws
SAXException;
public void setEntityResolver(EntityResolver resolver);
public void setDTDHandler(DTDHandler dtdHandler);
public void setDocumentHandler(DocumentHandler docHandler);
public void setErrorHandler(ErrorHandler errHandler);
public void parse(String uri) throws java.io.IOException,
SAXException;
public void parse(InputSource in) throws java.io.IOException,
SAXException;
// ContentHandler (internals -- dont use)
public void setDocumentLocator(Locator locator);
public void startDocument() throws SAXException;
public void endDocument() throws SAXException;
public void startElement(String uri, String localName,
String qName,
Attributes attributes)
throws SAXException;
public void endElement(String uri, String localName,
String qName)
throws SAXException;
public void characters(char buf[], int offset, int length)
throws SAXException;
public void ignorableWhitespace(char buf[], int offset,
int length)
throws SAXException;
public void processingInstruction(String target,
String data)
throws SAXException;
public void startPrefixMapping(String prefix, String uri);
public void endPrefixMapping(String prefix);
public void skippedEntity(String name) throws SAXException;
}
3 January 2002 10:05
The XMLReaderFactor y Class
This factory is the parser-independent bootstrapping API for SAX2. The
refer ence implementation uses the or g.xml.sax.driver system property (or
META-INF/services/or g.xml.sax.driver resource in the class path) to deter-
mine the package-qualied name of the environments default implemen-
tation for the no-parameters call. Most implementations maintain that
behavior, but some resource-constrained environments can use simpler
policies with less congurability.
For more infor mation, refer to the section The XMLReaderFactory Class
in Chapter 3.
public final class XMLReaderFactory {
public static XMLReader createXMLReader() throws
SAXException;
public static XMLReader createXMLReader(String classname)
throws SAXException;
}
The org.xml.sax.ext Package
The or g.xml.sax.ext package holds extension interfaces that not all SAX2
parsers are expected to implement. These classes are in a sense optional
but are provided by all widely used implementations and are requir ed by
Suns JAXP API.
Unlike the handlers in the SAX core, these handlers do not have type-safe
routines to bind them to XMLReader objects. They are identied using
URIs, and bindings are accessed using the getProperty() and setProp-
erty() methods.
The Dec lHandler class
This is the primary way SAX2 exposes typing constraints from an XML
Document Type Declaration. It also reports entity declarations.
For more infor mation, refer to the section The DeclHandler Interface in
Chapter 4.
Appendix A: SAX2 API Summary 199
3 January 2002 10:05
200 Appendix A: SAX2 API Summary
public interface DeclHandler
{
// data typing
public void attributeDecl(String element, String string
attribute,
String type, String mode, String defaultValue )
throws SAXException;
public void elementDecl(String name, String model) throws
SAXException;
// entity info
public void externalEntityDecl(String name,
String publicId, String systemId)
throws SAXException;
public void internalEntityDecl(String name, String value)
throws SAXException;
}
The LexicalHandler Interface
Many parsers expose certain data even though it is, for most purposes, not
part of the information that an XML document intends to convey to appli-
cations. This interface exposes some such data.
For more infor mation, refer to the section The LexicalHandler Interface
in Chapter 4.
public interface LexicalHandler
{
public void startDTD(String root, String publicId,
String systemId)
throws SAXException;
public void endDTD() throws SAXException;
public void startEntity(String name) throws SAXException;
public void endEntity(String name) throws SAXException;
public void startCDATA() throws SAXException;
public void endCDATA() throws SAXException;
public void comment(char buf[], int offset, int length)
throws SAXException;
}
3 January 2002 10:05
B
SAX2 and the XML Infoset
This appendix shows how the various parts of the XML Infoset are made
available through the SAX2 event consumer APIs. Think of it as a struc-
tural index for concepts in SAX2, or for the underlying XML information
structur e. Use it when youre trying to develop SAX2-based software that
needs access to particular data. It can also be viewed as an Infoset confor-
mance statement for SAX2; it will help you to understand what parts of
the XML Infoset arent supported by SAX2 and to see where SAX2 lets you
access information beyond what the Infoset addresses. The Infoset is not a
data structure; whats important is that the information be provided, not
randomly accessible.
The presentation here is the same as used in the Infoset specication
itself; the structure and order are identical. Infor mation items ar e similar to
object types, and each is presented in its own section. Information items
consist of sets of named [pr operties], each of which is presented in a table.
Pr operties can have one or more values, sometimes ordered, which are
pr ovided in SAX2 using consumer callbacks. You should be able to make
sense of this without reading the infoset specication if you know XML,
but youll need it to understand some details.
As of this writing, the XML Infoset (http://www.w3.or g/TR/xml-infoset/ ) has
recently been nalized. This appendix was written using the 24 October
2001 Recommendation, which omits almost all declarations found in the
DTD. Some other W3C specications use related data models, like the
XPath Data Model. The W3C approach to XML Schemas augments this
cor e Infoset with additional data-typing information items, dening the
Post-Schema-Validation Infoset (PSVI) items and properties associated
201
3 January 2002 10:06
202 Appendix B: SAX2 and the XML Infoset
with schema-valid XML text. Most of those PSVI properties relate to data-
typing models.
Event Producer Issues
Although the focus of this appendix is on how SAX2 event consumers see
Infoset data, you may also need to pay attention to some producer-side
issues beyond ensuring that the event stream itself is legal (and perhaps
valid). As the Infoset specication puts it, synthetic infosets might have
inconsistencies that real ones (from XML documents) dont. If you pro-
duce a synthetic infoset, by writing SAX events directly rather than by
using a parser, make sure the event stream is properly constructed.
As noted earlier, you should make sure you always provide the document
URI when you invoke XMLReader.parse(). Not only is this needed to cor-
rectly absolutize relative URIs found in the documents DTD (for notations
and all types of external entities) and to provide accurate diagnostics, but
it is essential for computing [base URI] properties in the document entity.
The namespace-pr exes featur e on XMLReader instances has a problem-
atic default; set its value to true unless youre comfortable with parsers
hiding [namespace attributes] and [prex] properties. (In this book, this is
called mixed mode namespace support.) SAX2 parsers arent requir ed to
support setting this feature value to true, but most do. If your parser
doesnt support this, you can re-cr eate pr exes and declarations, but they
nor mally wont correspond to the original versions. This appendix
assumes you kept the default setting (true) for the namespaces featur e
ag.
Some SAX2 XMLReader implementations may not produce all of this infor-
mation. Most of todays widely used SAX2 parsers are fully featured, so in
practice this wont be a common problem. However, infor mation pr ovided
thr ough the optional SAX2 extension callbacks DeclHandler or Lexical-
Handler might not be available. Similarly, reporting of [base URI] ingredi-
ents through a Locator is also optional.
The SAX2 Err orHandler exposes some data that is not addressed by the
XML Infoset: validity and well-formedness errors. Exposing such informa-
tion is requir ed for parser conformance to the XML 1.0 specication.
3 January 2002 10:06
Event Consumer Issues
The primary Infoset concern for SAX2 event consumers is to understand
how the stream of events repr esents the information structures used in the
Infoset. Applications need to track some state if they need access to some
of those structures or random access to anything. Its typical to track only
a few items, and ignore the rest as being incidental background noise.
Str eaming pr ocessing discards items as soon as possible.
You really shouldnt care, but since the String datatype cant handle more
than two gigabytes of data, and strings are used to pass certain document
data to applications, theres a chance that some documents could cause
tr ouble by overowing that limit. If you encounter such a document, con-
sult a pathologist. There really isnt much you can do about this.
Str uctural Issues
The [children] properties are arbitrarily sized, ordered sequences of infor-
mation items, which are presented in document order by SAX2 event call-
backs. Most other information items are not ordered, such as [notations],
[unparsed entities], and [attributes] properties. Only [children] properties
would need to be stored in order-pr eserving data structures.
While most information items are provided through a single callback,
some of the more complex ones involve matched, and (except in one
case) cleanly nested, pairs of calls to start() and end() the item. Such
items include the Document itself, its Document Type Declaration, Ele-
ments, and Namespace Information. To track those items, applications
implement some kind of context stack tracking.
The [parent] properties of some information items are implicitly encoded
thr ough such SAX2 nested event reports. Except for items that can be
dir ect childr en of the Document or Document Type Information Items,
applications often push stack entries when startElement() is called and
pop them when endElement() is called.
The children of Document and Document Type Information Items have
curious restrictions: they dont always match the actual text structure. For
example, information items for notations and unparsed entities are found
in the Document Information Item, but theyre textually part of the Docu-
ment Type; and comments are stripped out of DTDs. You can use more
natural structures in your applications if the descriptive Infoset structure
seems awkward.
Appendix B: SAX2 and the XML Infoset 203
3 January 2002 10:06
204 Appendix B: SAX2 and the XML Infoset
Other complex information items are implicitly decoded from DTD decla-
rations. To track such items, applications must save declarations during
DTD processing, to ensure that they can be correlated with information in
the body of a document. Examples of such items include [notation] prop-
erties for Unparsed Entities and processing instructions, most properties
for Unexpanded Entity References, and [refer ences] pr operties of
attributes.
Base URIs, xml:base, and Locator Data
Some information items have a [base URI] property that is computed
according to xml:base rules. Except for two cases, these rules amount to
using Locator.getSystemId() to nd the absolute base URI; the producer
needs to provide this information. SAX2 effectively augments every infor-
mation item with this information, as well as line and column location
within such entities. (However, applications can cause this information to
be lost if they provide InputSour ce objects without including those base
URIs as the system IDs.)
The two exceptional cases are for Elements and for processing instructions
within the document element. In these instances, the computation is com-
plex because xml:base attributes can play a role; it is demonstrated in
Example 5-1. Consumers must be able to invoke Locator.getSystemId()
to get the entitys URI in LexicalHandler.startEntity() when the entity is
shown to be external using DeclHandler.externalEntityDecl(). And they
must also maintain a stack of URIs, augmenting it with xml:base values.
Application code should use Locator infor mation to generate meaningful
diagnostics. However, confor ming applications will use the URI computed
with xml:base when absolutizing relative URIs found in attribute values,
character data, processing instructions, or (primarily for HTML legacy data
models) comments. Except for the startDTD() call, all system identiers
reported through SAX are deliver ed as absolute URIs. An upcoming exten-
sion feature ag will probably let that behavior be changed, so you can
choose whether the parser or the application absolutizes the URIs. Mean-
while, you should be aware that some SAX parsers have bugs in how they
report such identiers.
3 January 2002 10:06
Document Infor mation Item
The Document Information Item is the root of the information found in an
XML document. There is only one such root item.
This information item begins with the ContentHandler.startDocument()
call and ends with the ContentHandler.endDocument() call. Many SAX2
event calls are used to construct its children or constituents.
Proper ty Callbacks Explanation
[childr en] See the sections for each
type of Information Item:
Document Type Declara-
tion (one, if present), Ele-
ment (one), processing
instruction (possibly
many), Comment (possibly
many).
[document element] This is the element in the
[childr en] pr operty.
[notations] See the section on Nota-
tion Information Items.
(Unorder ed.)
[unparsed entities] See the section on
Unparsed Entity Informa-
tion Items. (Unordered.)
[base URI] Locator.getSystemId(), or
XMLReader.parse()
Locator may be used dur-
ing the startDocument()
callback (and earlier call-
backs, unless they were
made in the context of an
exter nal parameter entity).
Alter natively, for any
parsers that dont provide
a Locator, applications
using an XMLReader ar e
responsible for providing
this information (if it
exists) to the parse()
method. This is passed
dir ectly as the string
parameter or indirectly as
the systemId pr operty of
an InputSour ce.
Appendix B: SAX2 and the XML Infoset 205
3 January 2002 10:06
206 Appendix B: SAX2 and the XML Infoset
Proper ty Callbacks Explanation
[character encoding
scheme]
unavailable; or Input-
Source.getEncoding()
Nor mally this property is
unavailable; it wont affect
the interpretation of char-
acter data in Java. How-
ever, applications will in
rar e cases provide this to
the parser when they call
XMLReader.parse(Input-
Source) to start parsing.
Its likely that an upcom-
ing extension API will pro-
vide this information.
[standalone] XMLReader.getFeature() Its likely that an upcom-
ing extension API will pro-
vide this information using
an is-standalone featur e
ag.
[version] unavailable You can probably assume
the value of this property
is 1.0 for now. Its likely
that an upcoming exten-
sion API will provide this
infor mation.
[all declarations
pr ocessed]
ContentHandler.skipped-
Entity(): LexicalHan-
dler.endDTD()
When endDTD() is invoked,
the value of this property
is known. If no external
parameter entities are
reported as skipped, then
the value is true. If the
parser doesnt support the
lexical handler, then the
later call to start-
Element() may be used
instead of endDTD().
Because text in Java is always accessed using UTF-16 character strings or
arrays, most applications wont need to worry about encoding issues; the
SAX2 parser handles that. However, ther e ar e cases when encoding may
matter:
Input normalization
Some recent XML standards requir e that text be normalized. For
example, XML Canonicalization (as used in digital signature applica-
tions) requir es the use of Unicode Normalization Form C; some other
3 January 2002 10:06
W3C specications have the same requir ement. Text originally repr e-
sented in UTF-8 or UTF-16 might need further normalization to
remove some deprecated character codes that can be repr esented
using those encodings.
Such encoding data is requir ed on a per-entity basis, not a per-docu-
ment basis as implied by the Infoset specication. And for internal
entity expansions or defaulted attributes, youll need to normalize if
the encoding associated with the original denition supported denor-
malized text.
Output encoding
When using an output encoding that is not based on the Unicode
character set, you may not be able to repr esent XML names that use
particular characters. For example, ASCII cannot handle element or
attribute names using accented characters (used in Europe and Latin
America) or using ideographic characters (used in Asia).
The preferr ed encoding solution is to always use UTF-8 or UTF-16
when outputting XML, so that such problems cannot occur and so that
all XML processors can work with such output. Similar logic applies to
display systems like window systems: prefer font rendering systems
that use Unicode over those tied to some specic encoding.
Element Infor mation Items
An Element Information Item holds the most frequently needed data in an
XML document. There is one top-level element, associated with the Docu-
ment Information Item, and all but a handful of information items are its
descendants.
This information item starts with a ContentHandler.startElement() call,
and ends with a ContentHandler.endElement() call.
Proper ty Callbacks Explanation
[namespace name] ContentHandler.start-
Element(), namespaceURI
parameter
[local name] ContentHandler.start-
Element(), localName
parameter
Appendix B: SAX2 and the XML Infoset 207
3 January 2002 10:06
208 Appendix B: SAX2 and the XML Infoset
Proper ty Callbacks Explanation
[pr ex] ContentHandler.start-
Element(), qName parame-
ter (when available)
The QName (namespace-
pr exed name) includes
any prex available; for
example, a QName
xhtml:a uses the prex
xhtml.
[childr en] See the sections for each
type of information item:
Element, Processing
Instruction, Unexpanded
Entity Ref, Character,
Comment.
[attributes] ContentHandler.start-
Element(), attributes
parameter, DeclHan-
dler.attributeDecl()
When the
[namespace attributes]
pr operty value is accessi-
ble, both groups of
attributes are inter mixed.
Values that are #IMPLIED,
but not specied in the
document text, are only
visible through the
attributeDecl() callback.
If you need to know
about such attributes,
record them during DTD
pr ocessing.
[namespace attributes] ContentHandler.start-
Element(), attributes
parameter (when available)
If the namespace-pr exes
featur e ag is true, these
attributes are mixed with
the [attributes] property.
Theyr e the ones with
QName values of xmlns,
or starting with xmlns:.
a
Otherwise, this data is
unavailable.
[in-scope names-
paces]
See the section on
Namespace Information
Items.
[base URI] computed using xml:base In the absence of
xml:base attributes, this is
nor mally the value that
Locator.getSystemId( )
exposes during the
startElement() callback.
[par ent] Applications must keep
track of this information
item if it is needed.
a
Manually associate these with the namespace URI http://www.w3.or g/2000/xmlns/.
3 January 2002 10:06
Attr ibute Infor mation Items
The Attribute Information Items are the contents of the [attributes] prop-
erty in the element information item. Although the attributes are presented
in an order through the Attributes class, there is no expectation that this
order reects an order in the document or its DTD.
Proper ty Callbacks Explanation
[namespace name] Attributes.getURI()
[local name] Attributes.getLocalName()
[pr ex] Attributes.getQName()
(when available)
The QName (namespace-pre-
xed name) includes any
pr ex available; for example,
a QName xhtml:href uses
the prex xhtml.
[nor malized value] Attributes.getValue() If youre generating a stream
of Infoset data programmati-
cally, dont forget to normal-
ize these values correctly.
The XML specication
explains how to normalize
this text; it mostly translates
whitespace (but not character
refer ences) into space char-
acters and eliminates
unneeded spaces for values
that arent CDATA.
[specied] unavailable SAX2 does not distinguish
between attribute values that
wer e specied in document
text and those that have been
defaulted from a DTD. Its
likely that an upcoming
extension API will provide
this information.
Appendix B: SAX2 and the XML Infoset 209
3 January 2002 10:06
210 Appendix B: SAX2 and the XML Infoset
Proper ty Callbacks Explanation
[attribute type] Attributes.getType(),
DeclHandler.attribute-
Decl()
For most types of attribute,
getType() gives all the type
data needed, but you may
want to distinguish types that
ar e actual CDATA versus
(invalid) ones that just look
like CDATA because the
attribute was not declared.
Attribute values that are con-
strained to an enumerated set
ar e reported with special syn-
tax in attributeDecl() call-
backs. Enumerations use a
par enthesized syntax, like
(true|false), to enumerate
all possibilities. NOTATION
enumerations prepend the
string "NOTATION (with a
space) to that syntax.
[r efer ences] For NOTATION type values,
see the section on Notation
Infor mation Items. For
ENTITY or ENTITIES type
values, see the section on
Unparsed Entity Information
Items. For IDREF or IDREFS
type values, applications
must track attributes by using
the [attribute type] IDs
reported as keys to applica-
tion-specic repr esentations
of elements, and they must
be ready to handle forward
refer ences. (ENTITIES and
IDREFS values must be tok-
enized by the application.)
[owner element] Attributes are associated with
the element signied by the
startElement() call providing
the Attributes object.
Note that DOM extends this information item to expose entities (expanded
or not) within attribute values. That is not widely believed to be a useful
featur e. Since SAX doesnt extend the Infoset in that way, you cant imple-
ment that part of DOM using pure SAX.
3 January 2002 10:06
Processing Instruction
Infor mation Items
Pr ocessing instructions (PIs) are used within XML documents to capture
infor mation that doesnt necessarily t into the nested structure found
elsewher e. Such data doesnt need to relate to processing tasks, although
thats one historical use for such constructs.
Proper ty Callbacks Explanation
[target] ContentHandler.process-
ingInstruction(), target
parameter
[content] ContentHandler.process-
ingInstruction(), target
parameter
[base URI] computed using xml:base In the absence of xml:base attributes,
this property is normally the value
that Locator.getSystemId() exposes
during the processingInstruction()
callback.
[notation] See the section on Notation Informa-
tion Items. Tracking notations is the
responsibility of applications.
[par ent] When startElement() is invoked with
no matching endElement(), the parent
is the current element. Between calls
to LexicalHandler.startDTD() and
LexicalHandler.endDTD(), the parent
is the Document Type Declaration.
Otherwise, the document itself is the
par ent.
Some applications use a convention that PI target names are matched
against notation declarations, and the notations public (or system) IDs are
used to deduce the meaning behind a given PI. For example, such an ID
might indicate a particular tool to use on receipt of a document (prefer-
ably redir ecting thr ough a table to facilitate useful security constraints).
This is purely a convention, but its recognized by the XML specication.
It is not an XML error if such notations are undeclar ed. Mor eover, PIs can
pr ecede notation declarations in the DTD.
If the SAX2 implementation doesnt support the LexicalHandler, then
ther e is no way to determing whether processing instructions are part of
the DTD or a part of another section of the document prologue.
Appendix B: SAX2 and the XML Infoset 211
3 January 2002 10:06
212 Appendix B: SAX2 and the XML Infoset
Unexpanded Entity Reference
Infor mation Items
For any nonvalidating XML parser that doesnt read all external entities
possibly because it was congured not to do so or because it didnt
choose to implement that featurethe XML specication says you need to
be able to indicate when an entity that would normally be parsed wasnt
actually processed. These unexpanded entities are not the same as
unparsed entities, although neither kind of entity gets parsed.
The XML Infoset describes some information that can be made available in
one of those cases: when the entity was an external general entity. For
exter nal parameter entities, the Infoset is silent beyond dening a docu-
ment information item property to expose whether all declarations have
been processed; no declarations are exposed.
Proper ty Callbacks Explanation
[name] Content.skippedEntity(), name
parameter
SAX2 makes this callback
for all entities that have
been skipped, including
parameter and internal enti-
ties.
[system iden-
tier]
DeclHandler.externalEntity-
Decl(), systemId parameter
If [all declarations pro-
cessed] is false, this infor-
mation may be unavailable.
Otherwise, the application
must have recorded this
infor mation for later use.
Note that SAX parsers abso-
lutize this property against
the appropriate base URI
befor e reporting it. How-
ever, some parsers have a
bug here, and dont absolu-
tize this URI.
[public iden-
tier]
DeclHandler.externalEntity-
Decl(), publicId parameter
If [all declarations pro-
cessed] is false, this infor-
mation may be unavailable.
Otherwise, the application
must have recorded this
infor mation for later use.
3 January 2002 10:06
Proper ty Callbacks Explanation
[declaration base
URI]
Locator.getSystemId() If [all declarations pro-
cessed] is false, this infor-
mation may be unavailable.
Otherwise, the application
must have recorded this
infor mation for later use,
when this entity was
reported through a
DeclHandler.external-
EntityDecl() callback.
(xml:base does not apply.)
[par ent] Applications must keep
track of this information
item if it is needed.
SAX2 effectively denes new types of information items for internal and
exter nal entities. (So does DOM Level 1.) The XML Infoset doesnt expose
such entities except for this one case (for external entities), but applica-
tions may use those extension information items for other purposes if
appr opriate.
Character Infor mation Items
Along with element and attribute information items, characters are one of
the core types of information used by XML applications. SAX2 reports
characters in groups, rather than one at a time.
Proper ty Callbacks Explanation
[character code] ContentHandler.characters(),
ContentHandler.ignorable-
Whitespace()
These calls provide one
or more characters in the
UTF-16 encoding. Nor-
mally, each Java char is a
single [character code],
but surrogate pairs are
used to encode characters
fr om the Astral Planes,
which dont t into 16
bits. (No whitespace char-
acters need surrogate
pairs.)
Appendix B: SAX2 and the XML Infoset 213
3 January 2002 10:06
214 Appendix B: SAX2 and the XML Infoset
Proper ty Callbacks Explanation
[element content
whitespace]
When known, this
Boolean property is
encoded by using the
ignorableWhitespace()
callback instead of char-
acters(). Most SAX
parsers report this prop-
erty even when they
ar ent validating, though
thats not requir ed. (If any
exter nal parameter enti-
ties are skipped, it is not
possible to reliably pro-
vide this information.)
[par ent] Applications must keep
track of this information
item if it is needed.
SAX2 permits reporting of a character property that the XML Infoset
doesnt address: whether the characters are in a CDATA section. (DOM
requir es this information.) Such section boundaries are reported using
methods in the LexicalHandler class.
Comment Infor mation Items
Comments are intended for human consumption; processing instructions
ar e reserved for application data. The main curiosity here is that the
Infoset doesnt believe in comments within DTDs, perhaps on the grounds
that theyd need to be associated with the declarations they describe.
(Some DTD documentation tools rely on magic comment syntax, much
like javadoc.)
Proper ty Callbacks Explanation
[content] LexicalHandler.comment() The characters identied in this
callback are the contents of the
comment.
[par ent] When startElement() is invoked
with no matching endElement(),
the parent is the current element.
The Infoset ignores comments
reported between calls to Lexical-
Handler.startDTD() and Lexical-
Handler.endDTD(). Otherwise, the
document itself is the parent.
3 January 2002 10:06
Some legacy applications use comments to repr esent the sort of informa-
tion that processing instructions were designed to hold; an example is
wrapping of CSS rendering hints in HTML comments.
Document Type Declaration
Infor mation Item
This is a curious item in the Infoset, because it doesnt expose all the DTD
infor mation. In particular, it doesnt include any declarations (including
the expected root element name) or comments in DTDs.
This information item starts with a LexicalHandler.startDTD() call and
ends with a LexicalHandler.endDTD() call.
Proper ty Callbacks Explanation
[system identier] LexicalHandler.startDTD(),
systemId parameter
If the DTD includes an exter-
nal subset, this is its system
identier. Note that this URI
is not absolutized.
[public identier] LexicalHandler.startDTD(),
publicId parameter
Exter nal subsets are not
requir ed to have public iden-
tiers. When provided, this
value is normalized.
[childr en] See the section on Processing
Instruction Information Items.
Comments within DTDs are
not part of the Infoset, and
the few declarations that are
included (notations and
unparsed entities) are sepa-
rated from the DTD.
[par ent] This is the Document Infor-
mation Item.
SAX2 exposes more infor mation than the Infoset describes, though some-
what less than XML allows. Comments may be reported using the Lexical-
Handler. Element and attribute declarations, as well as external and
inter nal entity declarations, may be reported using the DeclHandler.
Unpar sed Entity Infor mation Items
When unparsed entities are used, these information items are nor mally
saved by applications during DTD processing (keyed by entity name) and
then accessed on demand. Unparsed entities are used only with attribute
values of type ENTITY or ENTITIES.
Appendix B: SAX2 and the XML Infoset 215
3 January 2002 10:06
216 Appendix B: SAX2 and the XML Infoset
Callbacks Explanation Proper ty
[name] DTDHandler.unparsed-
EntityDecl(), name param-
eter
[system identier] DTDHandler.unparsed-
EntityDecl(), systemId
parameter
This ID should be absolu-
tized by the parser. How-
ever, some parsers have a
bug here and dont absol-
utize this URI.
[public identier] DTDHandler.unparsed-
EntityDecl(), publicId
parameter
Unparsed entities are not
requir ed to have public
identiers. When pro-
vided, this value is nor-
malized.
[declaration base URI] Locator.getSystemId() If a SAX parser provides a
Locator, it may be used to
deter mine the current
base URI during parser
callbacks. (xml:base does
not apply.)
[notation name] DTDHandler.unparsed-
EntityDecl(), notation-
Name parameter
[notation] See the section on Nota-
tion Information Items.
Locating notations is the
responsibility of applica-
tions. Its best not to try
accessing this property
until all declarations have
been processed.
Notation Infor mation Items
When notations are used, these information items are nor mally saved by
applications during DTD processing (keyed by notation name) and then
accessed on demand. Notations are used with NOTATION attributes (at
most one per element) or with unparsed entities, and perhaps with pro-
cessing instruction target names.
3 January 2002 10:06
Proper ty Callbacks Explanation
[name] DTDHandler.notationDecl(),
name parameter
[system identier] DTDHandler.notationDecl(),
systemId parameter
Notations are not requir ed to
have system identiers if they
have a public identier. This
ID should be absolutized by
the parser. However, some
parsers have a bug here and
dont absolutize this URI
although, because of an issue
with early versions of the
SAX1 and SAX2 specica-
tions, some parsers might
absolutize such URIs.
[public identier] DTDHandler.notationDecl(),
publicId parameter
Notations are not requir ed to
have public identiers if they
have a system identier.
When provided, this value is
nor malized.
[declaration base URI] Locator.getSystemId() If a SAX event producer pro-
vides a Locator, it can be
used to determine the current
base URI during parser call-
backs. (xml:base does not
apply.)
Namespace Infor mation Items
These information items expose namespace identiers and the prexes
curr ently used to associate element or attribute names with those identi-
ers. With SAX2, applications that track these prexes need to use a stack
to handle the lexical scoping rules: in the context of one element and its
childr en, a prex may indicate a differ ent namespace than in parent ele-
ments because of a locally scoped redenition. You can use the Names-
paceSupport helper class to manage this stack or write something of your
own.
These information items start with a ContentHandler.startPrefixMap-
ping() call and end with a ContentHandler.endPrefixMapping() call. These
ar e the only two start/end calls that SAX2 doesnt requir e to be cleanly
nested. Alternatively, if the namespaces featur e ag is false, this informa-
tion can be reconstructed from the xmlns and xmlns:* element attributes.
Appendix B: SAX2 and the XML Infoset 217
3 January 2002 10:06
218 Appendix B: SAX2 and the XML Infoset
Proper ty Callbacks Explanation
[pr ex] ContentHandler.startPre-
fixMapping(), prefix
parameter
[namespace name] ContentHandler.startPre-
fixMapping(), uri parame-
ter
Since these values arent
der efer enced, they are
exactly as provided in the
XML source text. Dont
assume derefer encing such
URIs lets you do anything
useful.
If the namespaces featur e is set to false (its default is true) this information
is not made available except implicitly through the element [attributes]
pr operty, which will implicitly include all [namespace attributes]. (It is ille-
gal to set namespaces to false without setting namespace-pr exes to true.)
3 January 2002 10:06
Index
Symbols
<!- - . . . - -> (comments), using
comment( ) to report characters
inside, 112
!= (not equal to), testing string
equality, 86
== (equal to), testing string
equality, 86
% (percent sign) in parameter
entities, 85, 118
Number s
80/20 rule for application
requir ements, 10
A
addAttribute( ) function, 141
lfr ed Java parser, 9, 15
XMLReader feature ags and, 85
always validating parsers, 46
Apache
Cocoon v2 project, 131
Softwar e License, 17
XML project, 17
APIs (Application Programming
Inter faces), 1-22
consumers and producers, 24
Wed like to hear your suggestions for improving our indexes. Send email to
index@or eilly.com.
high-level, exibility of, 5
JAXP (Java API for XML
Pr ocessing), 13
types of XML, 2
Application Programming Interface
(see API)
Astral Planes, 110
Attribute Information Items, 209
Attribute interface, 182
attributeDecl( ) callback function, 116
AttributeList interface, 181
AttributeListImpl interface, 191
attributes, 38-41
indexes, 40
namespaces and, 56, 58-63
naming, 38, 57, 61
Attributes atts element, 42
Attributes method, using DeclHandler
inter face, 116
Attributes.getLocalName( )
function, 209
Attributes.getT ype( ) function, 41
Attributes.getURI( ) function, 209
AttributesImpl class, 42, 98, 140-144,
191
219
3 January 2002 10:10
220 Index
B
base URIs, 204
BEEP, 168
BLOB (Binary Large Object), storing
text as, 75
Blueberry, 21, 110
boolean feature ags, 81, 145
boolean values, 84
Bosak, Jon, 10
Bray, Tim, 9
buf [ ] character array in LexicalHandler
class, 112
bytes, in XML text, 6
C
C programming language, 13
C++ programming language, 13
callbacks (event), 103-139
CallFilter class, 173
CallWriter class, 173
Canonicalization (XML), 206
CDATA attribute, 116
CDF (Channel Denition Format), 150
char buf[ ] character array, 43
Character Information Items, 213
characters( ) function, 34
ignorableWhitespace( ) parameters
and, 106
startCDATA( )/endCDATA( )
functions and, 114
Chemical Markup Language (CML), 9
Clark, James, 9
class paths, installing SAX 2.0, 18
classes, 26
exception, 50
CML (Chemical Markup Language), 9
Cocoon pipeline framework, 131
command lines, installing SAX 2.0, 18
Comment Information Items, 214
comment( ) function, 112
comments, 11
using comment( ) to report
characters inside, 112
compound objects, 102
consumers, 24
SAX-to-DOM, 122
XMLFilterImpl class and, 132
ContentHandler class, 28, 33, 65,
103-111, 182
callbacks, 41-43, 103-107
DOM documents, building, 123
events, 33-43
example of elements and text, 34
inter nationalization and, 110, 111
push mode with XSLT and, 135
ContentHandler.characters( )
function, 213
ContentHandler.endElement( )
function, 146
ContentHandler.ignorableWhitespace( )
function, 213
ContentHandler.pr ocessingInstruction( ), 211
ContentHandler.skippedEntity( )
function, 206
ContentHandler.startDocument( )
function, 146
ContentHandler.startElement( )
function, 207
AttributesImpl class and, 140
parameters for element names, 60
ContentHandler.startPr exMapping( )
function, 146
Content.skippedEntity( ) function, 212
coupling (loose), 167
Crimson parser, 13, 16, 123
XMLReader feature ags and, 84
CSV (Comma Separated Values) les,
tur ning into SAX events, 95-98
custom data structures, turning SAX
event into, 128
D
data modeling, 99
data parameter for
pr ocessingInstruction( )
function, 106
data structures, 3
custom, turning events into, 128
exibility, 5
SAX events, turning into, 122-129
DeclHandler interface, 116-120, 199
DOM documents, building, 123
push mode with XSLT and, 135
3 January 2002 10:10
DeclHandler.attributeDecl( )
function, 41
DeclHandler.exter nalEntityDexl( )
function, 212
DefaultHandler class, 27, 34, 69, 192
deserializing, 122
design tools, affecting runtime, 5
diagnostics, 53-56
Document Information Items, 205-207
Document Object Model (see DOM)
Document Type Declarations (see
DTDs)
DocumentHandler interface, 148, 183
DOM (Document Object Model), 3
building partial, 125-128
consumer classes, 122-129
event production and, 92-94
memory consumption with SAX, 6
DOM trees, 122-129
constructing with SAX 1.0, 10
pruning noise data from, 124
SAX events, turning into, 93
DOM4J, 123
tr ees, tur ning into SAX events, 93
DOS
lenames, turning into URIs, 75
XML output, writing, 32
DTD-based validation, 44
DTDHandler interface, 29, 69, 120-122,
184
DOM documents, building, 123
push mode with XSLT and, 135
DTDHandler.notationDecl( )
function, 105, 217
DTDHandler.unparsedEntityDexl( )
function, 216
DTDs (Document Type
Declarations), 20, 115-122
EntityReslover class, using, 89
Infor mation Items, 215
pr oducer-side validation and, 44
subdocuments, including, 174
E
EBCDIC (Extended Binary Coded
Decimal Interchange Code)
encodings, 73
elementDecl( ) function, 118
elements, 34-38
Infor mation Items, 207
naming, 57-61
with namespaces, 58-63
endCDATA( ) function, 114
endDocument( ) function, 97, 104
endDTD( ) function, 113
endElement( ) function, 37, 42, 66
endEntity( ) function, 114
endPr exMapping( ) function, 64, 66,
86
ENTITIES attribute, 116
unparsedEntityDecl( ) and, 121
ENTITY attribute, 116
unparsedEntityDecl( ) and, 121
EntityResolver interface, 69, 84, 88-91,
184
InputSource objects, creating
with, 75
EntityResolver objects, 175
enumerated values in
attributeDecl( ), 116
err or handling, 5, 44
diagnostics and, 53-56
pr oducer-side validation and, 48, 49
err or( ) method, 52
Err orHandler class, 28, 51-54, 69, 184
Locator class and, 108
pr oducer-side validation, handling
err ors, 48
Err orHandler object, 175
Err orHandler.err or( ) function, 64
event pipelines, 32
event producers, 67-76, 202
dening custom, 5
DOM-to-SAX, 92-94
push modes, 94-100
event streams, producing, 100
EventConsumer interface, 137
events, 103-139
consumer issues, 203
CSV les, turning into, 95-98
custom data structures, turning
into, 128
data structures, turning into, 122-129
DOM trees, turning into, 93
objects, turning into, 98
pipelines, 130
Index 221
3 January 2002 10:10
222 Index
events (continued)
XMLFilterImpl class and, 132
exception handling, 49-56
classes, 50
Extended Binary Coded Decimal
Interchange Code (EBCDIC)
encodings, 73
Extensible Markup Language (see
XML)
extensions directory, adding to when
installing SAX 2.0, 18
exter nal subsets, 113
exter nalEntityDecl( ) callback
function, 118
F
fatal errors, handling, 5
fatalErr or( ) method, 52
featur e ags, 44-48
namespaces, 63-65
XMLReader class, 84-88
le: (scheme), 46
lenames, 75
lters (pipelines), 130
#FIXED attribute, 117
ags, 44-48
namespaces, 63-65
XMLReader class, 84-88
at le text formats, 95
For mal Public Identiers (FPIs), 89
SGML, 119
forward-only event streams, 8
FPIs (Formal Public Identiers), 89
ftp: (scheme), 46
G
GCC Java (GCJ), 16
GCJ (GNU General Public License), 16
GET request, 169
get*( ) method, 81
getBaseURI( ) function, 144
getCause( ) method, 51
getColumnNumber( ) function, 108
getDTDHandler( ) function, 69
getEntityResolver( ) function, 69
getErr orHandler( ) function, 69
getException( ) function, 51
getFeatur e( ) function, 69
namespace feature ags, 63
getIndex( ) function, 39
getLength( ) function, 40
getLineNumber( ) function, 108
getLocalName( ) function, naming
attributes, 61
getPr ex( ) function, 147
getPr operty( ) function, 69
getPublicId( ) function, 108
getQName( ) function, naming
attributes, 61
getSystemId( ) function, 108
getT ype( ) function, 39
getURI( ) function, naming
attributes, 61
getValue( ) function, 38
GNU Classpath Extensions project, 15
GNU General Public License (GCJ), 16
GNU pipeline framework, 131
GNUJAXP, 123
gnujaxp.jar le, 16
XMLWriter and, 31
gnu.xml.pipeline framework, 137-139
gnu.xml.pipeline.CallFilter class, 170
gun.xml.dom.Consumer class, 123
H
HandlerBase interface, 185
handlers, 24, 29, 30
ContentHandler callbacks, 41-43
err ors, 44
diagnostics and, 53-56
validity (producer-side), 48, 49
exception, 49-56
helper classes, 140-149
high surrogate (Java char values), 110
high-level APIs, 2
exibility of, 5
HotJava web browser, 16
HTML (HyperText Markup
Language), 92
HTML Tidy tool, 92
HTTP protocol, 46, 166
messaging and, 169-174
HyperText Markup Language
(HTML), 92
3 January 2002 10:10
I
IANA Internet encoding names, 73
ID attribute, 116
IDREF attribute, 116
IDREFS attribute, 116
IETF (Internet Engineering Task
Force), 9
ignorableWhitespace( ) callback
function, 106, 125
#IMPLIED attribute, 117, 144
indexes, looking up attributes witg, 40
Infoset (XML), 5, 20, 201-218
input normalization, 206
InputSource class, 70-75, 186
entity text, providing, 71-75
EntityResolver interface and, 89
methods for, 71
Inputsource.getEncoding( )
function, 205
InputStr eamReader class, providing
entity text, 72
inter face facility, working with multiple
pr oducts, 2
inter faces, 26
Err orHandler, 51-53
inter nal subsets, 113
inter nalEntityDecl( ) function, 119
Inter national Standards Organization
(ISO), 12
inter nationalization, 110
Inter net, 166-168
Inter net Engineering Task Force
(IETF), 9
Inter net xml-dev mailing list,
developing SAX 1.0, 9
ISO (International Standards
Organization), 12
J
J2EE (Java2 Enterprise Edition), 13
Java in XML messaging, 168
Java parsers, 9
Java Project X, 16
Java2 Enterprise Edition (J2EE), 13
java.io.CharArrayReader method for
InputSource class, 72
java.io.InputStr eam in method for
InputStr eam class, 72
java.io.IOException class, 89
java.io.Reader class, 27, 67
InputSource class and, 72
installing SAX 2.0, 19
java.lang.Character class, using
surr ogate pairs, 110
java.lang.String class, using surrogate
pairs, 110
JavaScript language, 13
javax.xml.parsers.SAXParserFactory,
using JAXP, 79
javax.xml.transfor m.sax
package, 134-137
JAXP (Java API for XML
Pr ocessing), 13, 79-81
jaxp.jar le, 17
JDOM, 123
tr ees, tur ning into SAX events, 94
.jpeg les, declaring with NDATA
annotations, 121
Jumbo XML browser, 9
L
LANs (local area networks), 167
Lark Java parser, 9
late binding of handlers, 69
LexicalHandler class, 85, 111-114, 200
DOM documents, building, 123
DTD data and, 115
push mode with XSLT and, 135
LexicalHandler.comment( )
function, 214
LexicalHandler.endDTD( )
function, 206
LexicalHandler.startDTD( )
function, 215
LexicalHandler.startEntity( )
function, 119
local area networks (LANs), 167
localName parameter, 207
Locator class, 104, 107-110, 186
locator data, 204
Locator.getSystemId( ) function, 205,
212, 217
LocatorImpl class, 145, 193
Index 223
3 January 2002 10:10
224 Index
loose coupling, 167
low surrogate (Java char values), 110
M
Macintosh
lenames, turning into URIs, 75
XML output, writing, 32
marshalling, 99
MathML, 110
Megginson, David, 9
memory
consuming, with SAX and DOM, 6
SAX parsers, using, 4
messaging, 165-174
HTTP with SAX 2.0 and, 169-174
Java, roles for, 168
methods
attribute names, accessing, 61
InputSource class, 71
Micr osoft, shipping CDFs, 150
mixed mode, 64
MSXML Java parser, 9
multithr eaded applications, using SAX
with, 7
My Netscape, using RSS, 151
N
namespace ag, 64
namespace-pr exes featur e, 86
namespaces, 20, 46, 56-66
featur e ags, 63-65
Infor mation Items, 217
NamespaceSupport class
and, 145-147
naming attributes and elements
with, 58-63
specication of, 11
NamespaceSupport class, 99, 145-147,
194
namespaceURI parameter, 207
naming
attributes, 38, 61
elements, 57-61
with namespaces, 58-63
NDATA attribute, 121
Netscape, shipping RSS, 150
NMTOKEN attribute, 116
NMTOKENS attribute, 116
noise data, pruning from DOM
tr ees, 124
nonvalidating parsers, 46
NOTATION attribute, 120
Notation Information Items, 216
notationDecl( ) callback function, 120
NSFilter class, 170
null, setting handlers, 69
NullPointerException class, 69
O
OASIS group, 10
OASIS SGML/Open Catalog
(SOCAT), 91
object tree parsers API, 3
object values, 84
objects, turning into SAX events, 98
Open Directory Project, 152
optionally validating parsers, 44
org.apache.crimson.tr ee.Xml-
DocumentBuilder class, 123
org.dom4j.io.SAXContentHandler
class, 123
org.dom4j.io.XMLWriter, 31
org.jdom.input.SAXHandler, 123
org.xml.sax package, 10, 14, 181-190
org.xml.sax.AttributeList class, 149
org.xml.sax.ContentHandler, 29
org.xml.sax.DocumentHandler
class, 148
org.xml.sax.driver class, 77
org.xml.sax.DTDHandler, 29
org.xml.sax.Err orHandler, 29
org.xml.sax.ext package, 14, 199
org.xml.sax.ext.DeclHandler, 29
org.xml.sax.ext.LexicalHandler, 30
org.xml.sax.HandlerBase class, 149
org.xml.sax.helpers package, 14,
190-199
org.xml.sax.helpers.AttributeListImpl
class, 149
org.xml.sax.helpers.ParserAdapter
class, 149
org.xml.sax.helpers.ParserFactory
class, 149
org.xml.sax.helpers.XMLReaderAdapter
class, 149
3 January 2002 10:10
org.xml.sax.Parser class, 67, 148
org.xml.sax.SAXException class, 50, 89
org.xml.sax.SAXNotRecognizedException
class, 50
org.xml.sax.SAXNotSupportedException
class, 50
org.xml.sax.SAXParseException
class, 50
org.xml.sax.XMLReader class, 67
output encoding, 207
output, when writing XML text, 32
P
P2P (peer-to-peer), 167
parse( ) function, 50, 67, 68
EntityResolver class, passing
InputSource objects to, 75
InputSource class and, 70
parser conguration, 44
Parser interface, 187
ParserAdapter class, 147, 194
ParserFactory class, 147, 195
parser-level APIs, 2
parsers, 1
advantages of SAX, 4
distributions for SAX 2.0, 14-17
featur e ags and, 44
SAX 2.0, 11
installing, 17-19
Pascal language, 13
passive APIs, 2
peer-to-peer (P2P), 167
percent sign (%) in parameter
entities, 85, 118
Perl language, 13
PHP, syntax of, 106
pipeline stage, 130
pipelines, 129-139
events, 32
using SAX parsers, 4
.png les, declaring with NDATA
annotations, 121
popContext( ) function, 146
POST (HTTP) request, 169
Post-Schema-Validation Infoset
(PSVI), 44, 201
pr ex mapping, 65
pr ocedural logic, 44
Pr ocessing Instruction Information
Items, 211
pr ocessingInstruction( ) function, 105
pr ocessName( ) function, 146
pr oducer.parse ( ), 28
pr oducers, 24
pr oducer-side validation, 44-49
handling errors for, 48, 49
pr oducing events, 67-102
pr operty objects, 81
PSVI (Post-Schema-Validation
Infoset), 44, 201
public IDs for entities, 119
publicId parameter, 215
pull mode event producer
with XMLReader, 67-76
with XSLT, 136
pull-to-push adapter, 67
push mode event producers, 67,
94-100
with XSLT, 135
pushContext( ) function, 146
Python language, 13
Q
qualied names, naming attributes and
elements, 58
R
random access, providing for XML
data, 8
RDF (Resource Description
Framework) Site Summary, 150
Remote Procedur e Call (RPC), 166
#REQUIRED attribute, 117
reset( ) function, 146
Resource Description Framework
(RDF) Site Summary, 150
Rich Site Summary (see RSS)
RPC (Remote Procedur e Call), 166
RSS (Rich Site Summary), 150-165
applications, building with, 162-165
data model for, 152-155
parsing events, consuming and
pr oducing, 155-162
RssConsumer class, 159
RssHandler interface, 161
Index 225
3 January 2002 10:10
226 Index
rule-based logic, 44
S
SAX 1.0, 9
support, 147-149
SAX 2.0, 10-13
extensions, 11
installing, 17-19
intr oducing, 23
parser distributions for, 14-17
SAX parsers, 67
SAX (Simple API for XML)
APIs and, 1-9
history of, 9-14
SAXException class, 49, 51, 68, 187
err or and diagnostics when using, 53
SAXNotRecognizedException, 188
SAXNotSupportedException, 188
SAXON Java XSLT engine, 135
SAXParseException, 188
SAXParseException class, 54
Locator class and, 108
SAXResult class, 135
SAXTransfor merFactory class, 135, 137
schemes, 46
EntityResolver interface and, 88
serialization, 94
set*( ) method, 81
setDocumentLocator( ) callback
method, 104
setEncoding( ) function, 72
setFeatur e( ) function, 48, 69
namespace feature ags and, 63
setLocale( ) function, 148
setPr operty( ) function, 69, 77
setSystemId( ) function, 72
SGML Formal Public Identiers, 119
skippedEntity( ) function, 107
SOAP, 168
SOCAT (OASIS SGML/Open
Catalog), 91
sockets network API, 13
stages (pipeline), 130
startCDATA( ) function, 114
startDocument( ) function, 104
startDTD( ) function, 113
startElement( ) function, 35, 37, 40, 42,
66
attributeDecl( ) and, 117
using for elements and naming, 57
NamespaceSupport class and, 145
XMLReader feature ags and, 84
startEntity( ) function, 114
startPr exMapping( ) function, 64, 66,
86
str eam validator in lfred parsers, 16
str eam-based pr ocessing, 4
str eaming parsers API, 2
String class, parsing data with no
URI, 70
String local element, 42
String qName element, 42
String uri element, 42
StringBuf fers, 43
String.equals( ) function, 86
String.inter n( ) function, 86
StringReader class, 109
subdocuments, including, 174-180
Sun, using JAXP, 13
surr ogate pairs (Java char values), 110
system ID for notations, 121
systemId parameter, 70, 215
EntityResolver interface and, 88
System.setPr operty( ) function, 77
T
target parameter for
pr ocessingInstruction( )
function, 105
text, 34-38
entity text, providing with
InputSource, 71-75
input normalization, 206
writing XML output, 32
Transfor merHandler class, 135
Transfor mer.transfor m( ) function, 134,
136
TRAX API, 13
3 January 2002 10:10
U
Unexpanded Entity Reference
Infor mation Items, 212
Unicode for internationalization, 110
Unicode Normalization Form C, 206
inter nationalizing character
encodings, 111
universal names for attributes and
elements, 58
Universal Resource Identier (see URI)
Universal Resource Locator (URL), 46
Universal Resource Name (see URN)
UNIX, writing XML output, 32
unknown validation behavior, 46
unmarshaling, 99, 122
unparsed entity information items, 215
unparsedEntityDecl( ) callback
function, 121
URI (Universal Resource Identier), 46
base, 204
dening new handlers and
featur es, 81
lenames vs., 75
InputSource class and, 70
namespaces and, 57
URL (Universal Resource Locator), 46
URNs (Universal Resource Names), 46
FPIs (Formal Public Identiers)
and, 89
UTF-8/16 character encoding, 33, 73,
206
V
validation (producer-side), 44-49
handling errors for, 48, 49
W
W3C Character Model,
inter nationalizing character
encodings, 111
war ning( ) method, 52
web site for SAX, 9
whitespace, 106
wrapped exceptions, 51
X
Xerces parser, 17
XMLReader feature ags and, 84, 87
XHTML, 92
XInclude, 20, 174
XML 1.0
Blueberry (see Blueberry)
XML 1.0 mode, 64
XML Canonicalization, 206
XML (Extensible Markup Language)
APIs for, 1-22
Infoset, 5, 20, 201-218
Inter net versus older
technologies, 166-168
Java, roles for in messaging, 168
JAXP, using for, 13
messaging and, 165-174
namespaces (see namespaces)
piplines, 129-139
related standards of, 19-22
SAX parsers and, 4
validity and, 44
XML Infoset, 20
XML Namespaces, 20
specication of, 11
XML plus namespace mode, 64
XML Schema Datatypes (XSD), 65
XML4J (XML for Java) parser, 17
xml:base attribute, 21, 141, 144, 204
XMLFilter interface, 101, 189
pull modes with XSLT and, 136
XMLFilterImpl class, 102, 132-134, 158,
196
examples of, 134
xml.jar, installing SAX 2.0, 17
xml:lang attribute, 144
xmlns attributes, using XMLReader
featur e ags, 86
XMLReader class, 26, 189
conguring behavior, 81-88
featur e ags, 84-88
functional groups, 68-70
JAXP, using, 79
obtaining, 76-81
pr operties, 81
pull mode event production
and, 67-76
Index 227
3 January 2002 10:10
228 Index
XMLReaderAdapter class, 198
XMLReaderFactory class, 76-78, 147,
199
XMLReaderFactory.cr eateXMLReader( ), 27
XMLReader.getFeatur e( ) function, 206
XMLReader.parse( ) function, 67, 169,
173, 202, 205
InputSource class and, 70
XMLReader.setDTDHandler( ) function,
binding DTDHandler to
parsers, 120
XMLReader.setEntityResolver( )
method, 89
XML-RPC, 168
xml:space attribute, 144
XMLWriter, 31-33
event pipelines and, 32, 130
XP Java parser, 9
XPath, 174
XPath data model, reading
comments, 112
XPointer, 174
XSD (XML Schema Datatypes), 65
XSLT, 134
push mode with XSLT, 135
3 January 2002 10:10

You might also like