Generic XML Stream Parser API: An Easier Way to Use SAX and Xerces

Ian E. Gorman
iegorman@sourceforge.net

Abstract

Event-based Java parsers, like SAX and Xerces, are extended to a more powerful state-based XML stream parser API. Resource use is not much greater than for SAX, and application development is not much more difficult than with DOM and XSLT. The new API offers a good compromise in those applications where low resource requirements and ease-of-use are each a sufficiently important goal to justify a small sacrifice in achievement of the other goal.

Keywords: DOM; SAX; Parsing; API; Java

Ian E. Gorman

Ian is a software developer who occasionally writes about his work.

Generic XML Stream Parser API

An Easier Way to Use SAX and Xerces

Ian E. Gorman [none]

Extreme Markup Languages 2004® (Montréal, Québec)

Copyright © 2004, Ian E. Gorman. Reproduced with permission.

Introduction

Java developers use two kinds of parsers, DOM parsers that produce a memory-based tree containing the full document, and SAX parsers that report a series of "events" as the document is scanned, but without retaining any information. DOM parsers are easier to work with because they present the data in a way that more closely matches a designer's view of XML, and because they are supported by powerful tools like XSLT. On the other hand, SAX parsers can handle documents of unlimited size, and can parse a document faster than a DOM parser.

This paper presents an XML stream parser API that can be implemented over a SAX parser, making the SAX parser easier to use without greatly reducing the advantages of a SAX parser. The new API brings some of the advantages of the DOM parser, by presenting a data in a way that more closely follows a designer's view of an XML document, and by adding some supporting tools to make design easier.

    The paper comprises four sections:
  • Background material describing relevant features of DOM and SAX parsers, with a comparison to the XML Stream parser API
  • Detailed description of the XML stream parser API
  • How the XML stream parser API is implemented
  • A brief conclusion

Background

Java and XML represent reality in different ways. This section illustrates how DOM parsers and SAX parsers support Java representations of XML documents and why the difference between the Java representations and the XML representation sometimes causes problems.

XML Representation of Document

XML markup identifies the components of a document by inserting structural information, like tags, between or around parts of the actual data in the document. Some of the components, like character entities, are "atomic" and can be represented as simple objects that combine markup and data. Other components, like elements or CDATA sections, are not atomic. Those components are represented as complex objects, consisting of boundaries (start and end) and content. The content may contain other components, which can be either simple or complex

XML markup requires a document to be "well-formed", in the sense that the document has a hierarchy of complete components and, for any two components, one component contains the other or the components do not overlap at all. Current SAX and DOM parsers (and the “XML Stream Parser API” presented by this paper) recognize, use, and depend on the well-formed property of XML documents.

DOM Parser

DOM parsers load XML components into memory as a tree of Java objects, beginning with org.w3c.xml.Document as the root. Each XML component, whether simple or complex, is mapped to a single Java object as shown in Table 1.

Table 1: XML Components Mapped to DOM Classes
XML Component DOM Class
attribute Attr
CDATA section CDATASection
comment Comment
document Document
document type name DocumentType
element Element
entity Entity
notation Notation
processing instruction ProcessingInstruction
character data Text

The Java representation and the XML representation of the document differ in a significant way. In a book, titles will be handled differently for the book, chapters, and sections. In XML the <title> element will appear as a child of <book>, <chapter>, and <section> elements. The <title> element is contained in its parent element as a part of that parent element. In Java, the parent object identifies its children, but does not contain them. DOM parsers add additional information to the Java objects, so that the parent, child, predecessor, or successor of any element is easily identified.

DOM parsers store the entire document in memory before any of the application code is executed. This is both a strength and a weakness. The memory storage supports random access to all parts of the document, but very large documents may occupy too much memory, and no output can be produced until after reaching the end of the document.

DOM application developers can write code that reflects their understanding of the document and of XML, instead of code that requires them to know the (possibly esoteric) internal details of the tools they use. Application developers do not even need to know exactly how the document tree is maintained in memory, because they can use XPATH to describe the data they want, and to operate on the data they request. Using XPATH, a developer can describe an transformation in an XSLT file and then, in three simple operations, load the input document, load the XSLT file, and apply the XSLT file to the input document to produce the required output.

SAX Parser

SAX parsers scan the document once and invoke event methods to as they encounter each atomic XML components and each boundary of a large XML component, as shown in Table 2.

Table 2: XML Components that Generate SAX Events
XML Component SAX Event
attribute none1
CDATA section LexicalHandler.startCDATA
LexicalHandler.endCDATA
comment LexicalHandler.comment
document ContentHandler.startDocument
ContentHandler.endDocument
document type name none2
DTD LexicalHandler.startDTD
LexicalHandler.endDTD
element ContentHandler.startElement
ContentHandler.endElement
entity LexicalHandler.startEntity
LexicalHandler.endEntity
notation DTDHandler.notationDecl
processing instruction ContentHandler.processingInstruction
character data ContentHandler.characters
  1. Attributes are supplied with the element and are available only during the startElement event
  2. Document type name is supplied with the DTD and is available only during the startDTD event

Information from the SAX parser is available only during the event that presents the information. Consequently, documents that require simple processing can be of unlimited size, because memory resources depend only on the activity needed to translate one stream of input data to a second stream of output data. Moreover, output can begin as soon as enough of the input has been read to provide data for the beginning of output. However, more complex processing will require extensive work by the application designer, who must work without the range of tools that would be available for use with a DOM parser.

The events that a SAX parser produces from an XML document are not very close to the view that an application designer has of the document. The application designer thinks in terms of whole elements that contain other elements. When the SAX parser encounters a hierarchy of nested elements, it will first report the start of each element and then, in reverse order, the end of each element. The application must construct a view of the elements (and their relationship) from these events. Moreover, the application design is extremely vulnerable to changes in document structure, because the application view of the structure will almost certainly require change when the DTD changes.

Limitations of Current Parsers

DOM parsers and SAX parsers are limited in the sense that each kind lacks some of the strengths held by the other. Table 3 illustrates some of the differences between DOM parser API and the SAX parser API.

Table 3: Comparison of DOM and SAX Parser APIs
XML Component DOM Class SAX Event
attribute Attr none1
CDATA section CDATASection LexicalHandler.startCDATA
LexicalHandler.endCDATA
comment Comment LexicalHandler.comment
document Document ContentHandler.startDocument
ContentHandler.endDocument
document type name DocumentType none2
DTD none3 LexicalHandler.startDTD
LexicalHandler.endDTD
element Element ContentHandler.startElement
ContentHandler.endElement
entity Entity LexicalHandler.startEntity
LexicalHandler.endEntity
notation Notation DTDHandler.notationDecl
processing instruction ProcessingInstruction ContentHandler.processingInstruction
character data Text ContentHandler.characters
  1. Attributes are supplied with the element and are available only during the startElement event
  2. Document type name is supplied with the DTD and is available only during the startDTD event
  3. DOM parsers discard all of the input after parsing the input document into a memory-based tree.

Table 3 shows that the DOM parser API is closer than the SAX parser API to representing the application designer's view of an XML document. The DOM parser API represents CDATA sections, elements, entities, and even the entire document as single objects. The SAX parser API reports, as separate events, the entry to, and exit from each of these XML components.

Table 3 does not show some other differences:

  1. DOM constructs a document tree that can supply much useful information, like the predecessors, successors, parents, and children of each element.
  2. DOM provides access to all of the document at once, which is very useful for handling ID/IDREF attributes.
  3. DOM application developers can use tools like XSLT and XPATH that allow them to focus on XML objects rather than on finding ways to make their tools operate on XML objects.
  4. SAX is open-ended in the size of a document that can be processed.
  5. SAX produces transient data, valid only within the event that presents the data.

Although DOM and SAX parsers have proven to be good tools, they will be found wanting in situations that require a combination of the strengths from both DOM and SAX. Consequently, there is room for a stream parser that offers a compromise between SAX and DOM.

A Third Alternative: the XML Stream Parser

The XML Stream Parser API (Section “XML Stream Parser API”) offers an alternative midway between DOM and SAX in terms of speed, ease of use, and resource requirements. The differences between the three alternatives can be illustrated with a simple application that processes the simple document in Figure 1 to produce the output in Figure 2. The document and the application requirements are designed to create more difficulty in using a SAX parser than in using a DOM parser. The application is required to print the titles of sections without printing the document title or any other text. To do this, the application must recognize the document structure, which is shown in Figure 3.

Figure 1: A Simple Document Example
            <?xml version='1.0' encoding='us-ascii'?>
            <!DOCTYPE document  [
            <!ELEMENT document (title, section*)>
            <!ELEMENT title (#PCDATA)>
            <!ELEMENT section (title, para*)>
            <!ELEMENT para (#PCDATA)>
            <!ATTLIST section>
            ]>
            <document>
              <title>Document</title>
            <section>
              <title>Section One</title>
            </section>
            <section>
              <title>Section Two</title>
              <para>Some text</para>
            </section>
            </document>
          
Figure 2: Required Output from the Simple Document
            Section One
            Section Two
          
Figure 3: Document Tree for the Simple Document
                         document
                        /   |   \
                       /    |    \
                  title  section  section
                 /       /        |      \
                /       /         |       \
         #PCDATA     title      title      para
                       |          |          |
                       |          |          |
                    #PCDATA    #PCDATA    #PCDATA
          

Figure 4 shows how the document can be parsed in Java using a SAX parser. The document is parsed very quickly, but the logic for recognizing the document structure is scattered over several different event methods. That programming style introduces a kind of complexity that increases very rapidly as the size of the application increases. Moreover, the program is likely to be very sensitive to changes in a DTD because the program is, in effect, duplicating information in the DTD, and the information in both places must be consistent.

Figure 4: Parsing the Simple Document with SAX
Java code:

        saxParser.parse(inputSource, this);
        writer.close();

        public void startElement(String uri, String localName, String qName,
                                Attributes attributes)
                throws SAXException
        {
            if (qName.equals("section")) {
                isOutputEnabled = true;
            }
        }
    
        public void endElement(String uri, String localName, String qName)
                throws SAXException
        {
            try {
                if (qName.equals("section")) {
                    isOutputEnabled = false;
                } else if (qName.equals("title") && isOutputEnabled) {
                    writer.write(lineSeparator);
                    isOutputEnabled = false;
                }
            } catch(IOException ex) {
                throw new SAXException(ex);
            }
        }
    
        public void characters(char[] ch, int start, int length)
                throws SAXException
        {
            try {
                if (isOutputEnabled)  writer.write(ch, start, length);
            } catch(IOException ex) {
                throw new SAXException(ex);
            }
        }

Output:

    >java SaxParse ../input.xml
    Section One
    Section Two
    Elapsed time: 7 milliseconds
          

Figure 5 shows how the document can be parsed in Java using a DOM parser and an XSLT stylesheet. Structural information appears in only one place, as the "section/title" match condition in one template and the structural information is directly related to the application requirement. Other information, like the parentage of an element, is available when needed.

Figure 5: Parsing the Simple Document with DOM and XSLT
Java code:

        StreamSource stylesource
                = new StreamSource(new File(arg[0]));
        Transformer transformer
                = transformerFactory.newTransformer(stylesource);
        document = documentBuilder.parse(new File(arg[1]));
        transformer.transform(new DOMSource(document),
                              new StreamResult(System.out));

XSLT stylesheet:

        <?xml version="1.0"?>
        <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
          <xsl:output method="text"/>
          <xsl:template match="*">
            <xsl:apply-templates select="*"/>
          </xsl:template>
          <xsl:template match="section/title">
            <xsl:apply-templates select="text()"/>
            <xsl:text>
        </xsl:text>
          </xsl:template>
        </xsl:stylesheet>

Output:

    >java DomParse stylesheet.xml ../input.xml
    Section One
    Section Two
    Elapsed time (stylesheet+parse+transform): 91 milliseconds
    Elapsed time (parse+transform): 11 milliseconds
          

Figure 6 shows how the document can be parsed in Java using the XML Stream Parser API (Section “XML Stream Parser API”. As with the XSLT stylesheet, structural information appears in only one place, in the element_title method, where output is enabled only for titles that have a section as parent. The logic for handling an element is entirely within a single method that handles that element. Information is neither as transient as with SAX, nor as long-lived as with DOM. For example, the attributes of any element remain available on a stack until the end of that element, not disappearing with the start tag, like it does with SAX, nor remaining available until the end of processing, like it does with DOM.

Figure 6: Parsing the Simple Document with Stream Parser API
Java code:

        xmlParser.parse(input);
        writerStack.close();

    public void characters(ParseReader reader)
            throws ListenerException, IOException {
        reader.writeTo(writerStack);
    }

    public void doDocument(CurrentDocument document) {
        try {
            writerStack.push(nullOutputWriter); // disable output                    
            document.parseContent();            // parse
            writerStack.pop();                  // re-enable output
            writerStack.close();                // cleanup
        } catch(Exception e) {
            e.printStackTrace();
        }
    }

    public void defaultElement(CurrentElement element)
            throws ListenerException, IOException {
        element.parseContent();
    }

    public void element_title(CurrentElement element)
            throws ListenerException, IOException {
        if (element.getParent().getQName().equals(SECTION)) {
            Writer writer = writerStack.pop();  // pop nullOutputWriter
            try {
                element.parseContent();     // parse with output enabled
                writerStack.write(lineSeparator);
            } finally {
                writerStack.push(writer);   // push nullOutputWriter
            }
        } else {
            element.parseContent();     // parse with output disabled
        }
    }

Output:

    >java -cp ~/j/lib/gxparse.jar:. GxParse ../input.xml
    Section One
    Section Two
    Elapsed time (including reflection): 12 milliseconds
    Elapsed time (excluding reflection): 11 milliseconds
          

An essential difference between the XML Stream parser API and the SAX and DOM APIs is that the Stream Parser API is flow-oriented and state-based. The flow of data is controlled by changing state as elements are entered and exited. In Figure 6, the characters method simply transfers characters from the input stream to an output stream. The output stream is actually a stack of output streams, and the actual destination is controlled by element methods that push and pop output streams.

The doDocument method, which is invoked before any element method, disables output by pushing a NullOutputWriter on the stack. When a title element is encountered, the element_title method is invoked. Upon entry, the method checks to see if the element is a child of a section element. If so, the NullOutputWriter is popped to enable output, the content of the element is parsed, and the NullOutputWriter is pushed again to disable output. Otherwise, output remains disabled while the element content is parsed.

The difference in the Stream Parser API supports a programming style that can be very convenient, even though the style may initially seem strange to a SAX or DOM developer. That programming style depends on both explicit and implicit use of stacks to save and restore state [Knuth-1973]. The programming style views XML processing as an activity on a serial stream of data, with processing controlled indirectly by changes of state that affect the output process, rather than by allowing the output process itself to make input-based decisions about what to do. Element methods exercise this control by pushing original state and modifying current state when an element is entered, and popping the original state when an element is exited. A change of state can be done, as in Figure 6 by actually pushing and popping objects on a stack, or the change of state can be done by maintaining data in local variables that the method discards upon exit. In this programming style, an element method is a good "object" for representing an element, because

  • The method corresponds directly to the activity of parsing the element
  • The nested invocation of methods corresponds to the hierarchy of elements in the document
  • The nesting of processing states created by the nested invocation of methods corresponds to the nesting of semantic interpretations implied by the hierarchy of elements and their attributes

If we can accept the notion of a method as an "object", the XML Stream Parser API can be considered to be an object-oriented approache to processing XML documents.

A Brief Comparison of Alternatives

The author was surprised to find that, at least in some circumstances, DOM is not much slower than SAX, and is just as fast as the Stream Parser API.

The execution times in the examples were obtained by preloading all classes to eliminate time consumed by the Java class loader. Following the preload, the DOM and Stream Parser API examples ran the parser twice, first to get processing times including one-time startup costs and then to get processing times excluding the startup costs. Startup cost for DOM with XLST is compilation of the XSLT stylesheet which only needs to be done once for any number of documents that require the same processing. Startup cost for the Stream Parser is looking up element methods by Java reflection, which only needs to be done the first time the element is encountered.

Table 4 summarizes the differences in processing time for the simple document example (Figure 1. After deducting startup costs, DOM took fifty percent longer than SAX, and took the same time as the XML Stream Parser API:

Table 4: Comparison of API Processing Times on the Simple Example
API Used Startup Costs Repetitive Processing One-time Processing
DOM (Figure 5) 80 mSec. 11 mSec. 91 mSec.
SAX (Figure 4) 0 mSec. 7 mSec. 7 mSec.
Stream Parser API (Figure 6) 1 mSec. 11 mSec. 12 mSec.

In repetitive processing, DOM seems to be as fast as the Stream Parser API, and SAX does not have enough extra speed to justify its greater development cost in many applications. In non-repetitive processing, DOM is significantly slower than SAX or the Stream Parser API, and the Stream Parser API does not lose much of its speed relative to SAX.

Class loading times were not measured precisely, but were about 120 mSec for DOM and about 150 mSec for SAX and the Stream Parser API.

These results indicate that, for repetitive processing, the Stream Parser API would need to be justified on some ground other than speed of processing. The justification is unlikely to be the programming style, because the Stream Parser API, while comparable to XSLT in ease-of-use, does not offer some of the features (like access to sibling elements) provided by XSLT.

The Stream Parser API becomes a good alternative when processing non-repetitive input, or when processing large documents. For non-repetitive processing, the Stream Parser API may offer speed advantages over DOM due to lower startup costs. For large documents, the Stream Parser API offers advantages over DOM because memory requirements are not much greater than for SAX. In both cases the Stream Parser API offers a programming style that, like XSLT, is oriented toward a designer's view of XML documents. In these circumstances, designers can obtain a good compromise between the low resource requirements of SAX and the ease-of-use provided by XSLT.

XML Stream Parser API

This section describes a stream parser API that can be implemented on event-based parsers like the SAX parser. The new API is implemented over an existing parser because event-based parsers are already doing part of the essential work of stream parsing. They recognize XML structure, finding and reporting the boundaries of XML components like elements and sequences of character data. All that needs to be added is the reporting of these components as entire objects, retention of important data (like the attributes and the parents of an element), easy access to the retained data, and an easier programming style.

XML Stream Parser Operation

DOM parsers and SAX parsers treat the incoming stream in different ways. DOM parsers view a document as a static object to be reconstructed from a serial stream. SAX parsers view a document as a stream to be translated during reception, where the precise nature of the translation is controlled and dynamically modified by the embedded markup. Because of this, a SAX parser provides no information at all about any part of the document that has already gone by. A SAX parser does not report an XML element as a single object, but reports the start and end of an element as two separate events, leaving it to the application to perceive a relationship between the two events. DOM parsers, on the other hand, reconstruct the document as a tree of static objects, which are represented in Java as class instances. Although the Java objects do not precisely represent corresponding XML objects, they are close enough to support XPATH syntax, which does represent XML objects well. Consequently, it is much easier to develop an XML application with DOm than with SAX, whenever the resource requirements and processing times of DOM are acceptable.

SAX parsers are used when it is more important to minimize the use of resources than it is to work easily with the XML document structure. DOM parsers are used when it is more important to have convenient access to the XML structure. A stream parser will be easier to use if it offers access to XML structure as conveniently as the DOM parser.

A DOM parser, receiving the simple document example (Figure 1), will reconstruct the hierarchy as a tree that provides access to all parts of the document and its structural information (see Figure 3).

A SAX parser, receiving the same document, will produce a sequence of events, interspersed with character data. If the application needs to recognize structural information, such as elements and the nesting of elements, that information must be reconstructed by the application, because the information is not supplied by the SAX parser. Figure 7 shows how an XML application would use SAX events from parsing the simple document example (Figure 1) to recognize the structure of that document (Figure 3).

Figure 7: SAX Events from the Simple Document Example

        SAX Event                         XML Element

        startElement(document)  --------------------------------+ document
                                                                |
        startElement(title)     ------------+ title             |
                                            |                   |
        characters()            -- #PCDATA  |                   |
                                            |                   |
        endElement(title)       ------------+                   |
                                                                |
        startElement(section)   ---------------------+ section  |
                                                     |          |
        startElement(title)     ------------+ title  |          |
                                            |        |          |
        characters()            -- #PCDATA  |        |          |
                                            |        |          |
        endElement(title)       ------------+        |          |
                                                     |          |
        endElement(section)     ---------------------+          |
                                                                |
        startElement(section)   ---------------------+ section  |
                                                     |          |
        startElement(title)     ------------+ title  |          |
                                            |        |          |
        characters()            -- #PCDATA  |        |          |
                                            |        |          |
        endElement(title)       ------------+        |          |
                                                     |          |
        startElement(para)      ------------+ para   |          |
                                            |        |          |
        characters()            -- #PCDATA  |        |          |
                                            |        |          |
        endElement(para)        ------------+        |          |
                                                     |          |
        endElement(section)     ---------------------+          |
                                                                |
        endElement(document)    --------------------------------+

The XML stream parser API allows each element to be handled by a single "element" method, instead of by two event methods (start and end of element). The element method is invoked when the start tag has been read, and has access to all of the information in the start tag. The element method does any necessary processing at the start of the element, and then invokes a parseContent method to return control to the parser, and to parse the content of the element. The parseContent method returns when the end tag of the element has been reached. At this time, the element method still has access to all of the start tag information, which will only be discarded after the element method returns. While the element content is being parsed, other parts of the application can access the start tag information by getting it from an element stack maintained by the API.

The simple document example (Figure 1) would be parsed by methods like those shown in Figure 6. Figure 8 shows how the methods would be invoked when parsing the simple document example. Each method name is simply the element name prefixed by "element_" (the prefix is chosen by the application developer).

A comparison of Figure 7 and Figure 8 will show that each element start event has been replaced by an element method invocation, and that the corresponding end event has been replaced by a return from the element method.

Figure 8: Stream Parser API Method Invocations from the Simple Document Example

        Stream API Method                 XML Element

        element_document()      --------------------------------+ document
                                                                |
        element_title()         ------------+ title             |
                                            |                   |
        characters()            -- #PCDATA  |                   |
                                            |                   |
                                <-----------+                   |
                                                                |
        element_section()       ---------------------+ section  |
                                                     |          |
        element_title           ------------+ title  |          |
                                            |        |          |
        characters()            -- #PCDATA  |        |          |
                                            |        |          |
                                <-----------+        |          |
                                                     |          |
                                <--------------------+          |
                                                                |
        element_section()       ---------------------+ section  |
                                                     |          |
        element_title           ------------+ title  |          |
                                            |        |          |
        characters()            -- #PCDATA  |        |          |
                                            |        |          |
                                <-----------+        |          |
                                                     |          |
        element_para()          ------------+ para   |          |
                                            |        |          |
        characters()            -- #PCDATA  |        |          |
                                            |        |          |
                                <-----------+        |          |
                                                     |          |
                                <--------------------+          |
                                                                |
                                <-------------------------------+

The XML stream parser API represents an element by a Java method, instead of by a Java class. This is appropriate, because a stream parser is producing data for an activity (translating of a flowing stream of characters) instead of producing static data (a document tree in memory) for input to a subsequent phase of processing. Each element method, upon entry, sets up a new processing state for the translation of characters, and upon exit, restores the previous state.

The processing state for an element would include all of the information about the enclosing elements. When method element_section is invoked in Figure 8, method element_document is suspended, but information from the document element is still available from an element stack, and can be obtained requesting the parent of the current element (document). Upon entry to the characters, the element stack contains three elements (document, section, and title) with all of the start tag information for each element. Each of the suspended element methods may have also stored some application-specific information that will be available to the characters method.

Table 5 lists the methods in XML Stream Parser API, and contrasts those methods with the objects produced by a DOM parser and the events produced by a SAX parser. Start and End methods in the SAX API become single methods in the stream parser API, and the new methods correspond to objects produced by a DOM parser. The Stream parser API can be implemented on a SAX parser, giving most of the advantages obtained by direct use of the SAX parser, but also allowing an application to deal explicitly with XML components, just as it could if using a DOM parser.

Table 5: XML Components Mapped to DOM Classes, SAX Events, and Stream Parser Methods
XML Component DOM Class SAX Event Stream Parser Method
attribute Attr none1 none2
CDATA section CDATASection LexicalHandler.startCDATA Listener.cdata
LexicalHandler.endCDATA
comment Comment LexicalHandler.comment Listener.comment
document Document ContentHandler.startDocument Listener.doDocument
ContentHandler.endDocument
document type name DocumentType none3 none4
DTD none5 LexicalHandler.startDTD Listener.doDtd
LexicalHandler.endDTD
element Element ContentHandler.startElement Listener.doElement
ContentHandler.endElement
entity Entity LexicalHandler.startEntity Listener.doEntity
LexicalHandler.endEntity
notation Notation DTDHandler.notationDecl Listener.notationDecl
processing instruction ProcessingInstruction ContentHandler.processingInstruction Listener.processingInstruction
character data Text ContentHandler.characters Listener.characters
  1. Attributes are supplied with the element and are available only during the startElement event
  2. Attributes are supplied with the element and are available in the scope of the doElement method
  3. Document type name is supplied with the DTD and is available only during the startDTD event
  4. Document type name is supplied with the DTD and is available in the scope of the doDtd method
  5. DOM parsers discard all of the input after parsing the input document into a memory-based tree.

Interface Details

The XML Stream Parser API can be implemented in a number of different languages, like python, perl, or C++. However, the author is more familiar with Java than perl or python, and finds Java easier than C++ to use for XML work. Consequently, the interface details are presented as they appear in a Java implementation.

Java package ca.gorman.xml.parse contains a large number of interfaces and classes, but three interfaces and two classes will be sufficient to illustrate the principles of the stream parser API

Parser Interface

A Parser provides

  • information about XML components to a Listener (Figure 10) as the components are encountered in the input stream.
  • access to a stack of elements (from the document root to the most recently entered element) that are currently being parsed
  • tests to find out whether an element with a specific name is one of the elements on the element stack
Figure 9: public interface Parser
public interface Parser
{
    public Listener getListener();

    public void setListener(Listener listener);

    public void parse(Input input)
            throws ListenerException, ParseException, IOException;

    public Element getCurrentElement();

    public Element [] getElementStack();

    public boolean isParsing(String prefixedName);

    public boolean isParsing(QName qName);

    public Location getLocation();
}

Listener Interface

A Listener

  • receives information from the Parser (Figure 9) about XML components and passes the information to the application for processing.
  • throws exceptions from the application back to the Parser.

In particular, the doElement method receives a CurrentElement (Figure 11) containing information that remains available from the time when the start tag has been read until the time the end tag has been read. The CurrentElement is at the top of a stack that contains all of the parent elements. Consequently, the application can get at this information at any time, but never needs to explicitly store the information.

The doElement method receives all elements, and uses the ElementMapper (Figure 13) to invoke the element method that should be invoked on any particular element.

In contrast to the corresponding SAX method, the characters method receives only the characters to be processed, and receives them as an input stream (a ParseReader, Figure 11), instead of receiving the characters in an array.

Figure 10: public interface Listener
public interface Listener
         extends ElementListener, EntityListener, DtdListener
{
    public void doDocument(CurrentDocument document)
            throws ListenerException, IOException;

    public void startPrefixMapping(final String prefix, final String uri)
            throws ListenerException, IOException;

    public void endPrefixMapping(final String prefix)
            throws ListenerException, IOException;

    public void characters(final ParseReader reader)
            throws ListenerException, IOException;

    public void ignorableWhitespace(final ParseReader reader)
            throws ListenerException, IOException;

    public void processingInstruction(final String target, final String data)
            throws ListenerException, IOException;

    public void skippedEntity(final String name)
            throws ListenerException, IOException;

    public Input resolveEntity(final String publicId, final String systemId)
            throws ListenerException, IOException;

    public void cdata(final CurrentMarkedSection cdata)
            throws ListenerException, IOException;

    public void comment(final ParseReader reader)
            throws ListenerException, IOException;
}

public interface ElementListener
{
    public void doElement(CurrentElement element)
            throws ListenerException, IOException;
}

public interface EntityListener
{
    public void doEntity(CurrentEntity entity)
            throws ListenerException, IOException;
}

/* DtdListener is seldom used */

public interface MessageListener
{
    public void warning(final String description, final Location location,
                        final Exception exception)
            throws ListenerException, IOException;

    public void error(final String description, final Location location,
                      final Exception exception)
            throws ListenerException, IOException;

    public void fatalError(final String description, final Location location,
                           final Exception exception)
            throws ListenerException, IOException;
}

CurrentElement Interface

Each invocation of an element method receives a CurrentElement that contains information about the element, and that provides access to information from the stack of enclosing elements, back to the document root element. Additionally, the CurrentElement contains a parseContent method that returns control to the parser, to parse the content of the element. The parseContent method will throw an exception if invoked more than once. The API will throw an exception if the element method returns without invoking the parseContent once.

The element method is invoked after the start tag has been read. No further input is read until the parseContent method is invoked. The parseContent method returns when the end tag has been read. At that time, the CurrentElement is still valid, and still provides all of the information that was available when the element method was invoked.

The caching of information differs from DOM in two ways. First, the only cached information is the start tags of the elements. In particular, character content is never cached by by the Stream Parser API, although the API does make it easy to cache text when required (see Section “Redirecting Output”). Second, only those elements from the element currently being parsed, back to the document root element are cached. This means that memory requirements are much less than for DOM, but it also means that the API provides less information at any one time than can be obtained from DOM.

Figure 11: public interface CurrentElement
public interface CurrentElement extends Element
{
    public int getDocumentLevel();

    public void parseContent();
}

public interface Element
{
    public String getPrefixedName();

    public QName getQName();

    public Element getParent();

    public Attribute[] getAttributes();

    public Attribute getAttribute(String prefixedName);

    public Attribute getAttribute(QName qName);

    public boolean hasParent(String prefixedName);

    public boolean hasParent(QName name);

    public boolean hasAncestor(String prefixedName);

    public boolean hasAncestor(QName name);

    public boolean equals(Object obj);

    public int hashCode();
}
          

ParseReader

Each invocation of the characters method in the Listener receives a ParseReader that contains characters to be processed by the application. In contrast to SAX API, the characters are in a Java stream (a subclass of java.io.Reader), instead of in a sub array of an array of characters. Also in contrast to the SAX API, the ParseReader contains only the characters of interest. Nevertheless, the presentation of characters in this way is consistent with the SAX API, because SAX views the input as a stream. The new API merely extends this view to the application side of the API.

ParseReader extends java.io.Reader) by adding some methods for more convenient access to the character data.

Figure 12: public class Reader
public class ParseReader extends java.io.Reader
{
    public String toString()

    public char [] toCharArray()

    public void appendTo(StringBuffer stringBuffer)

    public void writeTo(Writer writer) throws IOException
}

The Element Mapper

The Listener (Figure 10) invokes the same doElement for every element. This would not be very convenient for application developers, who would have to test every incoming element and invoke the appropriate element method. The XML Stream Parser API provides an ElementMapper (Figure 13) to avoid this inconvenience.

The first time a new element is encountered, ElementMapper uses Java reflection to find a method to invoke on that element. The method is stored in a hash table with the element name as key, so that the lookup only needs to be done once, no matter how many times that element is encountered. The default ElementMapper can map an XML name (including one in a namespace) to a Java method even when the XML name is not a valid Java method name.

Figure 13: public class ElementMapper
public abstract class ElementMapper implements ElementListener
{
    public static ElementMapper newInstance(String methodNamePrefix,
                                            Object defaultClassInstance,
                                            String defaultMethodName)
            throws NoSuchMethodException, IllegalAccessException {}

    public void doElement(CurrentElement element)
            throws ListenerException, IOException    {}
}
          

Figure 14 shows how ElementMapper would be used in an application that produces HTML from the simple docoument example in Figure 1.

One very strong advantage of the ElementMapper is that an element method can be added or removed without any other change in the program. If the method is present, it will be used. If the method is not present, a default method will be used instead.

Figure 14: Using the Element Mapper
        // Create the ElementMapper, mapping element names to methods in this
        // class, methods names prefixed by "element_", and mapping the element
        // to method "defaultElement" when there is no specific method defined
        elementMapper = ElementMapper.newInstance("element_", this,
                                                  "defaultElement");

    public void doDocument(CurrentDocument document)
            throws ListenerException, IOException {
        writerStack.write("<html><body>");
        document.parseContent();
        writerStack.write("</body></html>");
    }

    public void characters(ParseReader reader)
            throws ListenerException, IOException {
        // copy all text to the output
        reader.writeTo(writerStack);
    }

    public void doElement(CurrentElement element)
            throws ListenerException, IOException {
        // let the element mapper find the appropriate method
        elementMapper.doElement(element);
    }

    // use this element method when no specific element method is defined for
    // a particular element name
    public void defaultElement (CurrentElement element)
            throws ListenerException, IOException {
        element.parseContent();
    }

    public void element_title (CurrentElement element)
            throws ListenerException, IOException {
        writerStack.write("<h1>");
        element.parseContent();     // write the title text
        writerStack.write("</h1>");
    }

    public void element_para (CurrentElement element)
            throws ListenerException, IOException {
        writerStack.write("<p>");
        element.parseContent();     // write the paragraph text
        writerStack.write("</p>");
    }

Supplementary Features

The XML Stream Parser API delivers all characters to a single characters method in the Listener (Figure 10), which can be inconvenient when text requires different treatment, depending on which element contains the text.

Unlike DOM, the XML Stream Parser API does not offer random access to the input document, which would be very useful for applications that must handle ID and IDREF attributes during output to a sequential stream.

The XML Stream parser API supports output redirection to solve the first problem by providing control over the destination of output from the characters method, and output resequencing to solve the second problem by providing a kind of pseudo-random access to a sequential output stream. Section “Resequencing the Output” will show that random access output is just as useful as random access input for resolving ID and IDREF attributes.

Redirecting Output

Management of processing state can be enhanced by providing a Unix-like redirection facility within Java. A stack of Java Writers (WriterStack) can be used as the output, supporting temporary redirection by pushing and popping other Writers. Figure 15 shows how a WriterStack can be used to collect the content of one element and save the content until the end of the document for output, without a test inside the characters method to see which element is being processed.

Figure 15: Using the Writer Stack
String appendix = null;
WriterStack writerStack = new WriterStack(new StreamWriter(System.out));

doDocument(CurrentDocument document) {
    document.parseContent();  // parse the whole document
    writerStack.write(appendix);  // write data saved from element a
}

// redirect output while in element a
element_a(CurrentElement element) {
    writerStack.push(new CharArrayWriter());
    element.parseContent();  // give control back to the parser
    appendix = ((CharArrayWriter) writer.pop()).toString();
}

// write to current destination
characters(ParseReader reader) {
    reader.writeTo(writerStack);
}
          

Resequencing the Output

In an XML document, all forward and backward references are declared by attaching ID and IDREF attributes to the appropriate elements. These references can be resolved without using random access to the input, because ID and IDREF attributes obey two rules:

  1. A particular value can appear only once in an ID attribute anywhere in the document
  2. Any value that occurs in an IDREF attribute must appear in an ID attribute somewhere else in the document

These two rules make it possible to use a technique that has been used for almost 50 years to link separate code files into a single executable program:

  1. Write all output to a temporary storage area
  2. Whenever a symbol definition (the ID attribute) is encountered, store the definition
  3. Whenever a symbol use is encountered in the output, write a marker that identifies the symbol
  4. Reprocess the data in temporary storage to replace all the symbol markers by their corresponding definitions.

An example of this technique in the early years of microcomputers was the CP/M linker [Link-80].

It does not matter whether the second or third operation occurs first, because there is no need for the symbol value until the fourth operation commences, at which time all markers will have been written. The first three operations can be done together in a single sequential process, like parsing a document with the XML Stream Parser API. The last operation can be done automatically after that sequential process is completed.

All four operations are handled in the XML Stream Parser API by a Writer that implements the Resequencer interface. This is a straight-forward extension of java.io.Writer that uses the linking technique above to provide a virtual random access to the output:

  • All output from Resequencer goes to a temporary holding area. The temporary holding area is a memory block of configurable size, combined with a temporary file for data that overflows the memory allocation.
  • Normal output is written by the methods of the base class (java.io.Writer)
  • Out-of-sequence data is stored in memory, to be patched into the normal output during final processing. The Resequencer interface provides a mark method to store the data in memory, and a writeMark method to mark the places in the normal output that need to be patched with out-of-sequence data.
  • When an ID attribute is encountered, the application invokes a mark method to store some text using the ID value as the key for retrieval
  • When an IDREF attribute is encountered, the application invokes a writeMark method to write a marker to the output, which has the effect of embedding the attribute value as a key to be used for later retrieval of text
  • When the close method is invoked, ResequencingWriter copies the data from temporary storage to the final destination replacing all the key marks by the text that has been associated with each key

Since normal data can overflow to a file during temporary storage, and out-of-sequence data will usually be a very small proportion of the data, the Resequencer can handle documents of unlimited size, without requiring large amounts of memory.

Figure 16 shows a simple example of a document with ID and IDREF attributes. Figure 17 shows the required output from that document. Figure 18 shows how the output can be obtained using a DOM parser with XSLT. Figure 19 shows how to get the same output with the XML Stream Parser API. It is much easier to get this output from SAX indirectly via the XML Stream Parser API, than it is to get the output directly via the SAX API.

Figure 16: A Simple Example with ID and IDREF Attributes
<?xml version='1.0' encoding='us-ascii'?>
<!DOCTYPE document  [
<!ELEMENT document (para*)>
<!ELEMENT para (title?,(text|ref)*)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT text (#PCDATA)>
<!ELEMENT ref EMPTY>
<!ATTLIST para
          id ID #IMPLIED> 
<!ATTLIST ref
          idref IDREF #REQUIRED> 
]>
<document>
<para id="p1">
<title>The first title</title>
<ref idref="p2"/>
<text>The first para
</text>
</para>
<para id="p2">
<title>The last  title</title>
<text>The last para
</text>
<ref idref="p1"/>
</para>
</document>
Figure 17: Required Output from ID/IDREF Example
The first title
See "The last  title"
The first para
The last  title
The last para
See "The first title"
          

The quoted titles are obtained from forward or backward references via the ref elements.

Figure 18: Handling ID and IDREF Attributes with XSLT
XSLT stylesheet:

    <?xml version="1.0"?>
    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
      <xsl:output method="text"/>
      <xsl:template match="*">
        <xsl:apply-templates select="*"/>
        <xsl:text>
    </xsl:text>
      </xsl:template>
      <xsl:template match="para">
        <xsl:apply-templates select="*"/>
      </xsl:template>
      <xsl:template match="title">
        <xsl:apply-templates select="text()"/>
      </xsl:template>
      <xsl:template match="text">
        <xsl:text>
    </xsl:text>
        <xsl:apply-templates select="text()"/>
      </xsl:template>
      <xsl:template match="ref[@idref]">
        <xsl:param name="idvalue" select="@idref"/>
        <xsl:text>
    See "</xsl:text>
        <xsl:apply-templates select="//para[@id=$idvalue]/title"/>
        <xsl:text>"</xsl:text>
      </xsl:template>
    </xsl:stylesheet>

Output:
    >java  Idrefs stylesheet.xml ../input.xml
    The first title
    See "The last  title"
    The first para
    The last  title
    The last para
    
    See "The first title"
Figure 19: Handling ID and IDREF Attributes with the XML Stream Parser API
        // create an element mapper to automatically find the appropriate
        // method for each element
        elementMapper = ElementMapper.newInstance("element_", this,
                                                  "defaultElement");

        // create a Writer that supports both resequencing and redirection
        writerStack = new ResequencingWriterStack<String>(
                                        new ResequencingWriter<String>(
                                                new PrintWriter(System.out)));

            // get parser
            SAXParserFactory factory = SAXParserFactory.newInstance();
            factory.setValidating(true);
            xmlParser = new XMLParser(factory);
            xmlParser.setListener(this);

            // parse, translate, and write out the document
            xmlParser.parse(input);

            writerStack.close();

    public void characters(ParseReader reader)
            throws ListenerException, IOException {
        // copy input to the output
        reader.writeTo(writerStack);
    }

    /**<P>
     * Element dispatcher to select and invoke the appropriate element method.
     * @see ElementMapper
     */
    public void doElement(CurrentElement element)
            throws ListenerException, IOException {
        // let the element mapper find the appropriate method
        elementMapper.doElement(element);
    }

    public void defaultElement (CurrentElement element)
            throws ListenerException, IOException {
        element.parseContent();
    }

    public void element_title (CurrentElement element)
            throws ListenerException, IOException {

        // divert output to a CharArrayWriter so we can use the output twice
        CharArrayWriter charArrayWriter = new CharArrayWriter();
        writerStack.push(charArrayWriter);
        element.parseContent();
        writerStack.pop();

        // write to the normal destination
        charArrayWriter.writeTo(writerStack);
        writerStack.write(lineSeparator);

        // Using the ID value of the parent as a key, store the title in a Mark
        // for later use
        Attribute id = element.getParent().getAttribute(ATT_ID);
        if (id != null) {
            charArrayWriter.writeTo(writerStack.mark(id.getValue()));
        }
    }

    public void element_ref (CurrentElement element)
            throws ListenerException, IOException {
        Attribute idref = element.getAttribute(ATT_IDREF);

        // write a mark using the IDREF value as the key that identifies
        // the mark
        writerStack.write("See \"");
        writerStack.writeMark(idref.getValue());
        writerStack.write("\"");
        writerStack.write(lineSeparator);

        element.parseContent();
    }

XML Stream Parser Implementation

This section describes an implementation of the “XML Stream Parser API” over the SAX parser.

The greatest difficulty in the implementation is the need to run two synchronous activities, but without using the ordinary invoke-and-return paradigm of Java methods. The XML Stream Parser API requires that an element method suspend itself and allow the underlying SAX parser to run from its last position, even though the element method has not returned. At a subsequent point, the SAX parser will be suspended, and the element method will continue to its own exit, again causing the SAX parser to continue from its most recent position.

This section will show how coroutines are used to solve the problem managing and coordinating two synchronous activities to transform the SAX API (which produces a linear series of start and end events) into the XML Stream Parser API (which produces nested invocations of methods that each represent a pair of start and end events).

Coroutines

Sam Wilmott [Wilmott-2003] describes coroutines, which are not new [Conway-1963], shows how coroutines differ from the usual way of transferring control between two lines of execution, and explains how the coroutines are used to coordinate a pair of synchronous activities. Section “Using Coroutines to Implement the API on a SAX Parser” will explain why the XML Stream Parser API requires two synchronous lines of execution.

Transfer of control in a Java program occurs, as shown in figure 20, via the invoking of a method which

  1. Specifies the point at which execution begins in the destination, and suspends execution at the current point
  2. Subsequently returns to the point at which execution was suspended.
Figure 20: Transfer of Control via Method Invocation

          Main                        Subordinate

          V
          |
  suspend | --- invoke (1) ------------+-> | begin
                                      /    |
                                     /     |
                                    /      |
   resume | <-- return (1) --------+---+-- | end
          |                       /   /
          |                      /   /
          |                     /   /
  suspend | --- invoke (2) --->+   /
                                  /
                                 /
                                /
   resume | <-- return (2) ----+
          |
          V
        

Note the asymmetry of control transfer.

In effect, one line of execution is the manager of both lines. Coroutines, on the other hand, give equal standing to each line of execution. Each line chooses when to yield control to the other, but does not specify the starting point in the other line. Instead, the other line of execution continues from the point where it previously yielded control, as shown in Figure 21

Figure 21: Transfer of Control between Coroutines

          Coroutine 1                      Coroutine 2

          V
          |
  suspend | -- resume 2 -----------------> | continue
                                           |
                                           |
                                           |
 continue | <----------------- resume 1 -- | suspend
          |
          |
          |
  suspend | -- resume 2 -----------------> | continue
                                           |
                                           |
                                           |
 continue | <----------------- resume 1 -- | suspend
          |
          |
          V
        

Note the symmetry of control transfer.

Using Coroutines to Implement the API on a SAX Parser

The XML Stream Parser API transforms each pair of matched start and end events from the SAX parser into a single method invocation(see Figure 23. This means that each of the invoked methods must be suspended while the SAX parser runs to the next SAX event. Since the method has not yet completed, and was invoked by an event from the SAX parser, the method cannot share a stack with the SAX parser. In Java (and other common languages like C++), the only way to run with separate stacks is to run in separate threads (or separate processes in C++). In order to transfer control by suspending a method, instead invoking a method and returning from the method, it is necessary to use coroutines.

The SAX parser is run in a Coroutine, which extends Thread by adding methods to support the transfers of control illustrated in Figure 22

Figure 22: Running a SAX Parser as a Coroutine

          Application                       Parser

          V
          |
  suspend | -- start parser -------------->| begin
                                           |
                                           |
 continue | <----- yield to application -- | suspend
          |
          |
  suspend | -- yield to parser ----------> | continue
                                           |
                                           |
 continue | <----- yield to application -- | suspend
          |
          |
  suspend | -- yield to parser ----------> | continue
                                           |


          |
          V
        

Note that transfer of control transfer is symmetric after the parser has been started.

Figure 23 shows how the XML Stream Parser API would coordinate an application and a SAX parser when parsing a simple document containing two nested elements with some PCDATA in the inner element.

Figure 23: Application Running a SAX Parser as a Coroutine

     Application                              SAX Parser         API Method
      Activity                                 Activity      Activity and Scope
                                                                a   b   char

          V
          |
  suspend | -- start parser ----------------->| begin
                                              |
                                              |
 continue | <-------- yield to application -- | suspend
          | 
          |
          |
  suspend | -- yield to parser -------------> | continue
                                              |
                                              |
 continue | <------------ startElement (a) -- | suspend         |
          | begin element method (a)                            |
          |                                                     |
          |                                                     |
  suspend | -- yield to parser -------------> | continue        |
                   suspend element method (a) |                 -
                                              |                 -
 continue | <------------ startElement (b) -- | suspend         -   |
          | begin element method (b)                            -   |
          |                                                     -   |
          |                                                     -   |
  suspend | -- yield to parser -------------> | continue        -   |
                   suspend element method (b) |                 -   -
                                              |                 -   -
 continue | <------------ characters event -- | suspend         -   -   |
          | begin characters method                             -   -   |
          |                                                     -   -   |
          |                                                     -   -   |
  suspend | -- return ----------------------> | continue        -   -   |
                                              |                 -   -
                                              |                 -   -
 continue | <-------------- endElement (b) -- | suspend         -   |
          | resume element method (b)                           -   |
          |                                                     -   |
          |                                                     -   |
  suspend | -- return ----------------------> | continue        -   |
                                              |                 -
                                              |                 -
 continue | <-------------- endElement (a) -- | suspend         |
          | resume element method (a)                           |
          |                                                     |
          |                                                     |
  suspend | -- return ----------------------> | continue        |
                                              |
                                              |
 continue | <---------------------- resume -- | end
          |
          |
          V
        

The scope of element method a encloses the scope of element method b, because element method b is invoked and returns while element method a is suspended. Similarly, scope of the characters method is inside the scope of element method b.

Implementing the API in Other Languages

The author chose Java as the implementation language because of his familiarity with the language, and because of the availability of supporting software.

However, Java as a language does not offer any particular advantage over some other languages. Languages like perl and python would work as well, possibly even having an advantage with respect to operating on text.

The essential requirement for any language is that it allow an event-based parser and an application to run on separate stacks but synchronized with each other, as could be done with coroutines [Wilmott-2003].. Since Java does not support coroutines, it was necessary to use separate threads synchronized by exclusive locks. The same approach would work in python, using a python rlock. However, python supports generators. A generator produces output on request, but does not use the same stack as the requesting application. Consequently, a parser running in a generator would be more or less equivalent to a parser running in a coroutine.

Sam Wilmott took a different approach, creating a language [OmniMark] designed from the ground up as a markup stream processing language, in which an XML stream processing API would be superfluous.

Other Approaches

Jeremy Carroll [Carroll-2001] also advocates implementing another layer over an event-based parser like SAX but found, in his tests, that coroutine overhead uses 15 to 25 percent of the processing time, due to Java thread-swapping overhead. For this reason, he recommends using the lower level to drive a state machine in the higher level (he call this "inverting" the higher level parser) so that the lower level can run the higher level as a subroutine of the lower level.

The author has not followed up on this line of inquiry, but notes that his own preliminary tests show the Stream Parser API running at a speed more like that of DOM than SAX. However, the same tests show the Stream Parser API taking less than twice as long as SAX.

Conclusion

The choice is no longer between the low resource requirements of SAX and the greater ease-of-use and less complex design requirements of DOM. The “XML Stream Parser API” provides a middle ground for those applications that need the ease-of-use inherent in DOM parsers using XSLT and the ability, inherent in SAX, to handle documents of unlimited size.

The XML Stream Parser API makes SAX parsers easier to use, in much the same way that XSLT makes DOM parsers easier to use. In both cases, application designers can focus on what they want to do with the XML elements in a document, instead of focusing much of their effort on how to make a parser do what they want.

Resequenced output provides a pseudo-random access to the output stream. Consequently, it is easy to handle ID and IDREF attributes, and easy to divorce the sequence of the output and the sequence of the input document. Resequenced output is almost as flexible as the random access to input offered by DOM.

With large documents, the XML Stream Parser API will use much less memory than DOM parsers, even when using resequenced output.

When not using resequenced output, the XML Stream Parser API permits immediate output, prior to the end of the input, while still providing an ease-of-use that was previously available only from XSLT and DOM.

Output redirection within Java increases ease-of-use by making it possible to write element methods that are more self-contained and modular, giving stream parsers some of the advantages that were previously only available from XSLT and DOM.


Acknowledgments

Thanks to Sam Wilmott for suggestions and criticism.

Thanks to anonymous referees for some very helpful advice.


Bibliography

[Carroll-2001] Carroll, Jeremy J., CoParsing of RDF & XML, HP Laboratories Bristol, 2001

[Conway-1963] Conway, Melvin E., Design of a Separable Transition-Diagram Compiler, CACM Vol 6 No. 7, July 1963

[Knuth-1973] Knuth, Donald: "Stacks, Queues and Dqueues" in The Art of Computer Programming, Volume 1, Addison-Wesley, 1973

[Link-80] Digital Research Incorporated: Link-80 Operator's Guide, 1980 (A programmer's guide for the CP/M operating system)

[OmniMark] OmniMark programming language, http://www.stilo.com/

[Wilmott-2003] Wilmott, Sam: "What programming language designers should do to help markup processing", paper presented to Extreme Markup Languages 2003 Conference



Generic XML Stream Parser API

Ian E. Gorman [none]
iegorman@sourceforge.net