What can you do with half a parser?

Simon St.Laurent
simonstl@simonstl.com

Abstract

While most developers are happy parsing their XML with off-the-shelf parsers and working with fully-cooked results, there are times when developers need a little more control over their document processing. While XML is text, applying text processing tools directly to XML has some serious drawbacks. This presentation will explore the possibilities offered by a combination of XML parsing for context with text processing to manipulate that content.

Keywords: Parsing; Processing

Simon St.Laurent

Simon St. Laurent is an Associate Editor with O’Reilly and Associates. Prior to that, he’d been a web developer, network administrator, computer book author, and XML troublemaker. He lives in Ithaca, NY. His books include XML:A Primer, XML Elements of Style, and the upcoming Office 2003 XML Essentials. He is a contributing editor to xmlhack and an occasional contributor to XML.com.

What can you do with half a parser?

Simon St.Laurent [Associate Editor; O’Reilly & Associates]

Extreme Markup Languages 2003® (Montréal, Québec)

Copyright © 2003 Simon St. Laurent. Reproduced with permission.

The status quo and its difficulties

XML has succeeded beyond its creators’ expectations, often in unexpected fields. XML’s relative simplicity has made it ubiquitous, thanks largely to the widespread availability of tools for processing and manipulating XML. While these tools have accomplished a lot, some categories of tools have still barely begun to develop. XML parsing has generally been treated as an opaque and uncontrollable process. In particular, most tools have focused on a subset of XML’s syntactic expressions, discarding information between the original document and the receiving application.

Life in an Infoset world

Most developers have spent the past five years of XML growing ever more deeply trapped in a view of XML documents as node trees. While that view is useful in many circumstances, it is also severely limited by the many differences between the original text of XML documents and the results reported by the XML 1.0 [XML 1.0] processor, or worse, the XML 1.0+namespaces [XMLNS] processor. These differences are the result of hard work on the part of the processor (more commonly called the parser), but these differences have severely limited the applicability of text-based tools to XML and made it difficult to create transformations which change as little as possible of the original document. The tight integration of DTD processing with XML 1.0 parsing has both driven proposals for more cleanly layered models and made it difficult or impossible to update DTD processing for new features like namespaces.

In reponse to these problems, many developers have retreated more and more deeply into the node-based view, commonly describing their work as “Infoset manipulation” and paying little or no attention to the markup syntaxes which provide the foundation for the XML Infoset [Infoset]. While the Infoset-oriented community seems to believe that its approach provides better interoperability, Infoset-oriented tools also remove much of XML’s flexibility — particularly since DTDs play no direct role in the Infoset, which represents a processed view of the XML document.

So far as I can determine, XML parsers have been consistently built around what is effectively an Infoset model. The Simple API for XML [SAX2] and the Infoset are very similar, and while there are certainly differences between the Document Object Model [DOM] and XPath [XPath 1.0] models and the Infoset, all of these models are tightly bound to notions of nodes in a tree structure. Also, XML 1.0 is fairly specific about the processing required to be a “conforming XML 1.0 processor”, and performing that processing excises a lot of information from the original document. Wisely, most developers followed the path of least resistance and created tools which report fully-processed Infosets to the application.

Problems with the Infoset world

While many of the problems of Infoset-based processing are invisible to those who work only in the Infoset, they arise from a number of different situations:

  • Human-computer interaction — While a processor may only care about the complete XML document described by a complete parse, ignore namespace prefixes, and not use order information in attribute processing, humans tend to be fond of documents which remain familiar after processing, complete with features they used to assemble the document.
  • XML’s continuing evolution — XML 1.1 [XML 1.1] changes the character productions substantially, and future change, even if minor, is likely for the future. Schemas of various kinds have effectively deprecated DTDs for many users, while providing no replacement for character entities.
  • Leftover issues from XML 1.0 — Some of us need to be able to process external entities on a regular basis, thereby losing access to the DOCTYPE declaration which provided information about things like character entities.
  • Sugar-based processing — It’s sometimes useful to use syntactic mechanisms like whitespace and attribute quoting style to indicate content which needs special handling at some level of processing. Because this information is discarded by XML processors, developers have been unable to use these tricks once the XML parsing process has commenced — meaning that none of these tricks can be used reliably in situations which depend on markup or namespace context.

A half-parser by itself does not solve these problems, but it permits developers to solve them by applying their own logic to the processing of the complete lexical content of documents. By providing character-by-character reporting of the original document with markup and namespace context, a half-parser lets developers create their own tools for dealing with all of the above situations. This should be an improvement over banging their heads against the locked box of the XML 1.0 parser.

The Ripper parser

Because of my various frustrations with the Infoset approach, as well as a regular need to make automated but minimal transformations, I have been working for a while on this area, starting with an article [Layered] suggesting that XML parsing would benefit from a refactoring into separate components for syntax parsing, well-formedness checking, entity resolution, attribute defaulting, namespace processing, structural validation, and finally presentation to the application. All of these pieces of the XML puzzle are useful in isolation as well as in combination.

The Ripper parser performs some but not all of the functions of an XML 1.0 processor. Its primary function is to break documents down into components conforming to the markup grammar used by XML 1.0, hence its name. It performs some error reporting, primarily in cases where the markup itself violates the basic grammatical rules laid out in XML 1.0. Ripper keeps track of the element, attribute, and namespace contexts, and reports all of the content of the document to a handler, including tidbits like attribute quoting style and whitespace inside of tags.

Perhaps more important than what Ripper does is what Ripper leaves to the application. Ripper performs no DOCTYPE processing, entity processing, attribute defaulting, character checking, or normalization on the textual information it passes to the application. The application is responsible for performing any of these tasks as it deems appropriate, or it can ignore them and just process the raw information that is handed to it.

Communications between the application and the Ripper is managed through two key interfaces, ContextI and DocProcI.

The ContextI interface

Ripper uses the ContextI interface to communicate information about the document to the application, ranging from the origin URI to currently-scoped namespace declarations to scoped attribute values to a brief element tree. The application can also modify this context, either to communicate with the parser or to communicate with other applications in a chain of processors. The context object also provides a small foundation of initial information, notably XML 1.0’s built-in entities and namespace URIs for the xml and xmlns prefixes. (Applications can provide customized implementations of ContextI to provide for their own needs.)

The ContextI interface provides a variety of methods for tracking information about a document (including information from the XML Declaration) its structure, namespaces, and scoped attributes. The interface combines get/set for document properties with tree-based structural tracking.

public void setOrigin(String origin);

public String getOrigin();

public boolean setXMLDeclaration(String declaration);

public void setVersion(String version);

public String getVersion();

public void setEncoding(String encoding);

public String getEncoding();

public void setStandalone(String standalone);

public String getStandalone();

public void setExplicitXMLDecl(boolean happened);

public boolean getExplicitXMLDecl();

public StackableComponentI getParent();

public StackableComponentI getCurrent();

public void startChild (StackableComponentI child);

public void endChild ();

public void addNode (StackableComponentI child);

public void setEntity(String name, Object value);

public Object resolveEntity(String entName);

public boolean isSpace(char c);

public void declarePrefix (String prefix, String URI);

public String getUri(String prefix);

public String getPrefix(String URI);

public void trackScopedAtts(boolean track);

public void addScopedAtt (NamingI att);

public void removeScopedAtt (NamingI att);

public String getScopedAttValue (NamingI att);

public void pushLevel(); //when elements start

public void popLevel(); //when elements end

public void reset();

The ContextI class has also proven useful outside of Ripper — it’s a convenient lightweight tree-tracker — so it has become a member of the com.simonstl.common package rather than a part of com.simonstl.gorille.ripper. The common package also includes both a standard implementation of this functionality and a “Loud” version which reports activity for debugging purposes. The ContextI class also interoperates with MOE [Markup Object Events] [MOE], though the default implementation uses a lighter set of objects to track context.

The DocProcI interface

The DocProcI interface provides a means for Ripper to communicate the actual textual content of the document, angle brackets, whitespace, and all, with the receiving application. Because Ripper is so focused on text, the API is almost completely text-oriented, using StringBuffer objects to represent everything. This tactic is unusual and would probably horrify most proper Java programmers, but is appropriate to the kind of information Ripper provides. The API is also extremely close to the markup, as the following excerpt demonstrates:

public StringBuffer XMLDecl (StringBuffer content) throws GorilleException;

public StringBuffer DOCTYPE (StringBuffer content) throws GorilleException;

public StringBuffer startElementOTag (StringBuffer content) throws 
GorilleException;

public StringBuffer startElementCTag (StringBuffer content) throws 
GorilleException;

public StringBuffer elementName (StringBuffer content) throws GorilleException;

public StringBuffer tagSpace (StringBuffer content) throws GorilleException;

public StringBuffer attName (StringBuffer content) throws GorilleException;

public StringBuffer attEquals (StringBuffer content) throws GorilleException;

public StringBuffer attStartQuote (StringBuffer content) throws 
GorilleException;

public StringBuffer attEndQuote (StringBuffer content) throws GorilleException;

public StringBuffer endElementOTag (StringBuffer content) throws 
GorilleException;

public StringBuffer endElementETag (StringBuffer content) throws 
GorilleException;

public StringBuffer endElementCTag (StringBuffer content) throws 
GorilleException;

public StringBuffer chars (StringBuffer content) throws GorilleException;

public StringBuffer decCharRef (StringBuffer content) throws GorilleException;

public StringBuffer hexCharRef (StringBuffer content) throws GorilleException;

public StringBuffer entRef (StringBuffer content) throws GorilleException;

public StringBuffer commentStart (StringBuffer content) throws GorilleException;

public StringBuffer commentContent (StringBuffer content) throws 
GorilleException;

public StringBuffer commentEnd (StringBuffer content) throws GorilleException;

public StringBuffer PIStart (StringBuffer content) throws GorilleException;

public StringBuffer PITarget (StringBuffer content) throws GorilleException;

public StringBuffer PISpace (StringBuffer content) throws GorilleException;

public StringBuffer PIData (StringBuffer content) throws GorilleException;

public StringBuffer PIEnd (StringBuffer content) throws GorilleException;

public StringBuffer CDATAStart (StringBuffer content) throws GorilleException;

public StringBuffer CDATAEnd (StringBuffer content) throws GorilleException;

It’s not lovely code, but it does make it possible, even easy, to process lexical content and return lexical content. Recreating an XML document from Ripper events is a matter of concatenating all the returned StringBuffer objects to produce a file in the desired encoding. The original use-case for Ripper was as a pre-processor to another parser, making modifications in the text before passing the document to the parser. Some applications may also treat the StringBuffer return value as void if appropriate. Applications can also ignore events that don’t interest them. Most interestingly, of course, they can change or suppress content.

Challenges in writing Ripper

Ripper is not all that complicated a program, though its character-by-character parsing logic isn’t particularly delightful. For the most part, it trudges through the document, keeps track of both lexical and structural context, and reports what it finds. There are a few cases where markup is so badly wrong that it can’t be processed even at Ripper’s relatively simple level, and these are reported as errors.

The only particularly difficult part of writing Ripper was created by Namespaces in XML. Prior to namespaces, an XML document could be parsed directly in sequence. Everything needed to know to parse a given part of a document, if anything, came from earlier parts of the document. Because of namespaces, however, the parser frequently needs to read to the end of the start tag to interpret the element name at the beginning of the tag. Namespaced attributes often have the same problem, with namespace declarations that come after the prefix has already been used.

Traditionally, Infoset-like event-based parsers have reported the start tag as a single event, making it possible for them to avoid the problem of namespace declaration sequence. Unfortunately, because attributes may contain entities, this approach is not possible in an API which reports individual components as events.

Solving this problem requires processing the start tag twice. The first parse is used to set the context, including namespace context, and the second parse is used to report the text to the application. Using this approach, the application will have the namespace context it needs to interpret element and attribute names as they arrive.

Unfortunately, this double-parse has created some duplicate code, both for the double-parse itself and for ampersand and entity handling. Attribute values may, of course, include entities and character references, even attribute values which happen to be used by namespace declarations. For purposes of context, Ripper resolves these entities, but it then reports the unresolved entities separately during the reporting phase.

Ripper licensing and availability

Ripper is distributed as part of the Gorille project, all of which is licensed under the MPL [Mozilla Public License]. The Gorille distribution includes Ripper, rules-based character testing code, and some common code used to create shorthand descriptions of document structures. Ripper (and all of Gorille) is written in Java, and requires at least Java 1.2.

Applications

There are a number of cases where this impractical-looking and not particularly efficient API may be useful. Although Ripper is still just getting started, the applications below represent a few classes of problems on which I’ve started work.

Custom character contexts

While most of the arguments about XML 1.1 focus on the NEL character and whether or not change of any kind is a good thing for the core of XML, XML 1.0 certainly helped create its own versioning problem. The list of characters included in XML 1.0 was illuminating and useful, but it was also built so deeply into parsers that changing it now is difficult.

Ripper can’t fix all of the old parsers, but it does offer an approach that may be useful in the future. Ripper builds only its expectations for the markup characters themselves into the parsing logic, and leaves determination of whitespace and other acceptable characters to the application. Gorille [Gorille] defines a mechanism for testing acceptable characters in markup (and whitespace) which can handle the shift from XML 1.0 to XML 1.1. As Gorille integration with Ripper proceeds, this information will become available through the Context object and character checking will be performed by a filter on the Ripper output.

Entity processing

The most recently prominent use case involves the current set of issues surrounding character entities, where DTDs, particularly the internal subset, are used to provide entity declarations in otherwise schema-centric (or completely unvalidated) environments. Some developers would prefer not to deal with DTDs at all, and there are now a fair number of environments (notably SOAP [SOAP] messages) where the DOCTYPE declaration is prohibited. This creates problems for some developers, notably those using MathML [MathML] with its many frequently-used entities.

Because Ripper reserves entity processing to the application, applications can solve problems like these with entity resolvers focused on their particular needs rather than the expectations of a given parser. An application could even resolve entities based on their current namespace or element scope, making it possible to create entity vocabularies which are associated with particular structural vocabularies rather than with a single document. This could potentially reduce name collisions between the entities used by different vocabularies, a problem avoided today by copying and coordination. (Resolving entities based on where the document came from rather than its vocabulary may also make sense in some cases.) This isn’t here yet, but opening up the XML parser makes such alternate approaches possible, even preferable.

Ents [Ents] provides an alternate mechanism to DTDs for working with character references and entities composed only of characters. While the intial release of Ents only provided support for character entities, Ripper’s foundations are flexible enough that applications can summon a new instance of Ripper to parse an external entity and integrate it with the existing document, if desired. If the application takes care to preserve context objects, those can be combined to provide support for complex cases like nested entities which rely on namespace declarations from the parent. This integration of Ents and Ripper is proceeding and will be available by August 2003.

Ripper also makes it possible to revisit entity references as an inclusion mechanism, offering the prospect of alternative entity declaration mechanisms. While the W3C has layered XInclude [XInclude] on top of XML 1.0, XInclude is both a superset and a subset of entity processing, supporting only external references and requiring separate processing. XInclude is unsuitable for character references and notes that “Well-formed XML entities that do not have defined infosets (e.g., an external entity with multiple top-level elements) are outside the scope of this specification”, making them only a partial replacement for entity processing. Ents will provide a more complete alternate mechanism for entity declaration which will be integrated with the existing entity reference infrastructure, with these features complete by the end of 2003.

Valuable sugar

Another use case involves situations where otherwise unimportant details of an XML document are used to mark content which needs special treatment — a transformation that only applies to attributes with single quotes, for example. While such work doesn’t necessarily accord with an Infoset view of XML, and has often “flown under the radar”, it can still be a practical means of combining and massaging information from different sources.

Unfortunately for the “Desperate Perl Hacker”, XML is not particularly conducive to simple manipulation with regular expressions. Entity references and namespace prefixes both serve as abbreviations for information declared elsewhere, and default attributes can also make such processing difficult. Also, while transformations are a critical part of XML processing, such transformations may throw away information, notably comments, processing instructions, and whitespace — which are actually useful to developers manipulating documents as text.

NOTE:

There are other approaches to processing XML as text, notably [REX], which uses regular expressions for a shallow XML parse. Unfortunately, these approaches do not gather scoped information like namespace declarations or xml:base, xml:lang, or xml:space. Giving applications access to both lexical information and structural context requires a more complex approach.

Ripper doesn’t solve these problems automatically, but it provides a framework within which developers can combine textual processing and an understanding of the markup context. Everything is exposed, and everything is reported as a series of StringBuffer objects. In addition to giving access to XML details to programs reading XML documents, Ripper also gives them the ability to control the output of those events precisely. Instead of randomly reserialized attributes after processing, Ripper provides the opportunity to serialize attributes in the order readers want to see. Different readers can have different filters that present documents the way they want them — CDATA sections or entity references, entity references expanded or not, single or double quotes, attributes on their own lines, etc.

Minimal transformations

While total control over every character isn’t always necessary, the general notion of processing which makes as few changes as possible reduces the cost of each transformation. At present, SAX filters, XSLT, and most XML-based APIs work on completely processed inputs, potentially losing enormous detail. While APIs may well want to provide abstractions which go beyond Ripper’s minimal offerings, Ripper also permits applications to have more detail lurking under their abstractions.

Reducing the impact of transformations promises to make it easier to integrate human convenience with computer efficiency. Telling humans that computers don’t care about their details rarely makes humans happy, especially when the humans rely on those details for their own style of processing. Again, Ripper doesn’t solve this problem completely by any means, but it does provide a foundation for such work.

Custom contexts

The separation of context from parsing logic means that it is possible to configure context and then process document fragments within that context. This permits the processing of external entities which lack DOCTYPE declarations or namespace declarations, for example. It may also help in the processing of document fragments which are missing their namespace declarations. Unlike XML 1.0 parsers which generally expect the document to be complete and provide only a few hooks for information (typically, entity resolution support for catalogs), Ripper lets applications provide as much context information as their creators deem appropriate, both before and during the parse.

Custom parsing

Developers can also build their own parsing or storage logic on top of this API. Because the entire document will be reported verbatim, it’s possible to use this information to report an XML document to an application while preserving its original form more precisely than is possible with approaches like SAX or DOM. (Encoding issues may still keep it from being byte-for-byte identical, but character-for-character is plausible.)

There are a number of uses for this kind of processing. Environments like MOE, which support more features than are provided by SAX or DOM, have always been limited by the kinds of parsers available to them. MOE provides an object structure which can preserve entity information and other aspects of lexical XML. A Ripper application could easily create MOE events, effectively building a MOE parser. Lexical analysis programs might also use Ripper as a front end to their own work, and some kinds of diff tools may find Ripper a useful source of detailed information.

Conclusions

While this approach may not appeal to every developer, I hope that it will find a useful place in many developers’ toolkits, helping to solve problems that require both knowledge of the document as text and the document as marked-up structure and content. Markup as a field has long valued keeping information accessible to humans with minimal intercession by tools. Preserving the lexical details which make XML human-friendly over the course of computer processing offers the chance to build systems that recognize that humans are XML processors as well, as worthy of direct access to information as the tools they sometimes use.


Acknowledgments

Thanks to Walter Perry, Rick Jelliffe, Gavin Thomas Nicol, John Cowan, Paul Prescod, and the xml-dev list generally for various sparks. Additional thanks to the xmlhack editors for providing an informal support group for various demented XML adventures.


Bibliography

[DOM] W3C DOM Working Group. Document Object Model. http://www.w3.org/DOM/DOMTR.

[Ents] St.Laurent, Simon. Ents. http://simonstl.com/projects/ents/.

[Gorille] St.Laurent, Simon. Gorille. http://simonstl.com/projects/gorille/.

[Infoset] Cowan, John, and Tobin, Richard. XML Information Set. http://www.w3.org/TR/xml-infoset/.

[Layered] St.Laurent, Simon. Toward A Layered Model for XML. http://simonstl.com/articles/layering/layered.htm.

[MathML] Carlisle, David, et al. Mathematical Markup Language (MathML) 2.0. http://www.w3.org/TR/MathML2/.

[MOE] St.Laurent, Simon. Markup Object Events (MOE) http://simonstl.com/projects/moe/.

[REX] Cameron, Robert. REX: XML Shallow Parsing with Regular Expressions. http://www.cs.sfu.ca/~cameron/REX.html.

[SAX2] Megginson, David, Brownell, David, et al. The Simple API for XML (SAX). http://saxproject.org.

[SOAP] Gudgin, Martin, et al. SOAP Version 1.2 Part 1: Messaging Framework. http://www.w3.org/TR/soap12-part1/.

[XInclude] Marsh, Jonathan, and Orchard, David. XML Inclusions (XInclude) Version 1.0. http://www.w3.org/TR/xinclude/ .

[XML 1.0] Bray, Tim, et al. Extensible Markup Language 1.0 (Second Edition). http://www.w3.org/TR/REC-xml.

[XML 1.1] Cowan, John. Extensible Markup Language 1.1. http://www.w3.org/TR/xml11/.

[XMLNS] Bray, Tim, et al. Namespaces in XML. http://www.w3.org/TR/REC-xml-names.

[XPath 1.0] Clark, James, and DeRose, Steven. XML Path Language (XPath) 1.0. http://www.w3.org/TR/xpath.



What can you do with half a parser?

Simon St.Laurent [Associate Editor, O’Reilly & Associates]
simonstl@simonstl.com