General Parsed entities: Unfinished Business

Simon St.Laurent
simonstl@simonstl.com

Abstract

General parsed entities seemed like one of the simplest parts of XML when it first appeared, easy to create and use, and more predictable than their SGML counterparts. In the years since, however, parsed entities have become a malingering reminder that DTDs are not yet dead, continuing to demonstrate periodically that XML hasn't even resolved all of the interoperability issues at the markup level. Entities solved a problem, and then became a problem themselves. The world is still waiting for a solution to the problems that entities create, though many of the solutions proposed (including my own) are either partial or create new problems. Fortunately, years of implementation experience in this area, in both SGML and XML, may provide guidance for ways forward.

Keywords: Entities; Unicode; Namespaces; Validating; Parsing; Interoperability

Simon St.Laurent

Simon St. Laurent is an Editor with O'Reilly and Associates. Prior to that, he'd been a web developer, network administrator, computer book author, and XML troublemaker. He lives in Ithaca, NY. His books include XML:A Primer, XML Elements of Style, and Office 2003 XML, with Evan Lenz and Mary McRae.

General Parsed entities: Unfinished Business

Simon St.Laurent [Editor; O'Reilly & Associates]

Extreme Markup Languages 2004® (Montréal, Québec)

Copyright © 2004 Simon St.Laurent. Reproduced with permission.

Introduction

When I first started using XML, parsed general entities seemed like one of its friendliest aspects. The declarations were simple, and referencing them from documents was easy to do. In explaining XML, parsed general entities were extremely simple things, especially when compared to their unparsed cousins or to the tricks you can do with parameter entities. In all my time in markup, I've never seen anything done with general entities that resembled the brilliant perversity of Modularization in XHTML's use of parameter entities. General entities seemed polite and dull.

While general entities are, for the most part, polite and dull, they fit very poorly with some of the decisions made in XML 1.0 about the distinctions between validating and non-validating parsers. The fact that these entities are declared in Document Type Definitions at a time when DTDs are seen as passe by large sectors of the XML-using public has made it very difficult to integrate entity processing with many kinds of XML pipelines. The DTD-based nature of entity declarations also appears to have led the creators of Namespaces in XML to sidestep the question of what namespaced entities might mean, sparing entities some complexity but locking them further out of the mainstream of XML processing.

The details involved are, unfortunately, critical to understanding both where entities have fallen short and to understanding where alternative approaches have found their limitations. Many developers, including the W3C itself, have tried to solve some parts of the problem set addressed by entities, but the problems and the convenience and interoperability issues they create have not yet gone away.

Problems and Solutions: Parsed Entities from SGML to XML 1.0

SGML supported a wider range of entity possibilities than XML, both in the syntax used for referencing entities and in the kinds of entities that could be declared. This reflects both SGML's generally broader range of options and SGML's need to describe more relations directly, as the Web-based mechanisms commonly used in XML had not yet been developed.

SGML entity types

One of the goals of SGML was to create "a non-system specific technique for referring to content located outside the mainstream of the document, such as separately-written chapters, pi characters, photographs, etc." [Goldfarb 1990] In order to accomplish this lofty goal, SGML provided a variety of mechanisms for associating content of various types with SGML documents, some of which integrated the data with the parsed flow of the document and some of which did not. SGML's definition of entity was quite generic: "A collection of characters that can be referenced as a unit," and included SGML documents themselves.

The simplest entity referencing mechanism in SGML looked and worked much like its XML counterpart, though with fewer restrictions. A simple entity declaration might look like:

<!ENTITY NY "New York">

External entities could reference external resources through three different mechanisms. The simplest assumes the processor would understand what was meant simply because of the entity name:

<!ENTITY NY SYSTEM>

A more precise mechanism provided additional information about where the processor should find the contents of the entity, though in a system-specific way:

<!ENTITY NY SYSTEM "ny.text">

The third mechanism also gives directions, though in a more abstract manner, using a public identifier:

<!ENTITY NY PUBLIC "-//States//TEXT New York//EN">

SGML supported a variety of entity variants, including SUBDOC, SDATA, CDATA, and Processing Instruction entities. (NDATA was another option, though it relates to unparsed entities, a slightly different topic than the one at hand.) SUBDOC refers to SGML sub-documents (complete with their own DOCTYPE declarations), while CDATA entities are Character Data Entities, and SDATA entities are Specific Character Data Entities. Processing Instruction entities were specific to the use of processing instructions in documents, though the use of processing instructions generally was controversial.

This variety of options made it easy for developers to integrate SGML processing with existing systems and data. CDATA and SDATA entities allowed easy reuse of information in existing systems without major modification, though using SDATA reliably did require application intelligence about character sets. SUBDOC is a particularly interesting variant, as the contents of a SUBDOC could have their own supporting DTD for validation and entity processing. (Bracketed text, while not exactly an entity type, was a convenience for entering marked-up content.)

While complex, the SGML entity processing model took full advantage of the sophisticated framework SGML's DTD support provided. While some parts of it may feel baroque to developers used to pointing at a URI and saying "that one", it provided flexibility that still hasn't been matched generally by its XML successors, despite periodic attempts at reinvention.

The Transition to XML

XML 1.0 substantially simplified the set of entity options. While XML retained the DTD-based declaration mechanisms of SGML entities and their simplest form of reference, it removed many of the options that had previously been available and constrained the contents of various pieces.

XML's combination of Unicode foundations with declarations for encodings largely removed the need for SDATA. The encoding declaration on external parsed entities supports some of its functionality for mixing data stored in different character sets, though information about the nature of character sets is no longer part of the entity declaration. CDATA entities have also been removed. In practice, they can often be replaced with the use of CDATA sections in internal entities (unless ]] is involved) but there is no easy equivalent for external entities. SUBDOC entities, Processing Instruction entities, and bracketed text have all disappeared, as has the use of attributes for entities.

XML also regulated the contents and use of entities more tightly. The contents of external entities must be synchronous: all start- or end-tags in the entity must have matching end- or start-tags to form complete elements. Attributes can no longer reference external entities. It also required references to entities to be closed with a delimiter, removing the SGML option of skipping the delimiter if a non-name character followed the reference. [Comparison 1997]

While XML 1.0 tinkered with the entity mechanisms, however, it broke dramatically new ground with regard to the DTD infrastructure that had historically supported those mechanisms. Section 5, Conformance, made it clear that non-validating parsers were not obligated to read either the external DTD subset or external entities referenced from the internal DTD subset. [Comparison 1997]

While the purpose of this was to encourage the creation of small and lightweight parsers with less I/O and network overhead, it set XML loose from its DTD moorings, quite likely permanently. For entities, the impact has been especially difficult: the processing of external entitites became unreliable, varying by parser, as did the use of external sets of entities used for characters.

Complications of "SGML for the Web"

While most of the emphasis in creating "SGML for the Web", also known as XML, was on slimming down SGML to fit more readily into a networked environment of (mostly) interoperable applications. There wasn't much attention given to how the Web's own architecture might affect XML rather than the other way around. It's taken a few years for the implications of Web architecture to become obvious for XML, but many of those implications affect entity processing and related technologies.

David Megginson pointed out some of these issues at XTech 2000, in a presentation called "When XML Gets Ugly" [Megginson 2000]. Megginson pointed out that unreliability of web resources - a key factor in making the Web take off as mightily as it had - could easily cause problems for developers who naively assumed that these resources would always be as available as a file on the same system. Various attacks on those resources could also expose vulnerabilities in systems which were otherwise secure. Megginson cited external DTD subsets and external entities as well as XSLT and CSS stylesheets as potential risks.

More recently, some aspects of Web architecture have surfaced as having the potential to disrupt communication while leaving no obvious trace that something has gone wrong. Content negotiation, a feature long built into HTTP, is a key part of the separation between resources and representations in web architecture. While content-negotiation use has been fairly rare until recently, XML-based technologies like [Cocoon] make it easy to generate multiple representations of the same information at the same URI. This can mean, for instance, that a parser requesting a document over HTTP might well get a different document than a browser requesting a document from the same URI. It also means that if a single URI provides text, HTML, and SVG (or English, French, and Cherokee) versions of a document at the same URI based on content-negotiation, there is no way to specify - based purely on the URI - which one a parser would actually like.

These issues are hardly insoluble, but they do reflect ways in which XML's Web-based environment has different assumptions from the ones which held in the SGML work.

Developments since XML 1.0

XML's relaxation of SGML's commandment that "thou shalt process the DTD" has left entities in an anomalous position. Many processing environments won't collect external DTD or entity information by default, and aren't required to report it as an error. (This has become less common than it once was, but remains common in high-efficiency and small-scale systems.) Perhaps even more importantly, the share of developers whose skills include DTDs and entity handling has declined relative to the the number of people using XML. Where being a DTD expert was an important skill for SGML developers, it has become a secondary skill for XML developers. Even when skills are available, even the difficulty of creating an internal subset (and preserving it through transformations!) is enough to keep many people from taking the trouble.

As DTDs have faded, schemas have risen. Most of those schema languages (with the notable exception of the original XML-Data [XML Data] proposal) have avoided dealing with entities altogether. Most have focused exclusively on describing document structures, not on document content. Similarly, Namespaces in XML [XMLNS], explicitly applied namespaces to XML structures, not to entities.

XInclude

There has been one notable exception to this lack of effort, which sprang from the [XLink] work. XLink, despite some surface similarities (notably the embed value of the xlink:show attribute), was about connecting resources, not about embedding content from one document in another. [XInclude] set out to build a framework aimed explicitly at embedding content using the general foundations laid out by XLink.

Like XLink, XInclude defines an XML vocabulary in its own namespace for addition to other XML vocabularies. Like earlier drafts of XLink, it uses an element, xi:include, to represent an inclusion point. Also like XLink, it assumes a Web-based framework, where resources are identified by Uniform Resource Identifiers (URIs), not environment-specific system identifiers or the different abstraction of formal public identifiers.

XInclude only focuses on one half of the problem set created by the unreliable nature of entity processing in XML 1.0, inclusion of external resources. It doesn't address the issues of internal entities used for special characters or repeated text at all, and would be a rather verbose mechanism for doing so in any case.

Of the SGML inclusion mechanisms, XInclude is most akin to SUBDOC, though it doesn't require validation of the included document. (Indeed, it leaves such choices entirely up to implementations.) Like the documents included by SUBDOC, documents included by XInclude may have their own DOCTYPE declaration, and that might even include entity declarations and processing. XInclude is very specific in that it focuses on Infoset [Infoset] inclusion when an XML document is included in another XML document. XInclude itself specifies nothing about how included documents should be parsed, just that it only wants to handle a parsed result. (The entity processing of an included document still depends on the parser to do it, so once again the unreliability of entity processing may cause problems.)

XInclude also has capabilities much like those of SGML's CDATA entities, supporting a text value for its parse attribute which casts the included content as text, whether or not it contains markup. In this case there is no Infoset processing, except insofar as the text of the referenced document is included as text in the Infoset of the receiving document, replacing the xi:include element.

Moving beyond the capabilities of previous inclusion mechanisms, the most recent draft of XInclude also supports content-negotiation through its accept and accept-language attributes. Acknowledging the unreliability of the Web, XInclude also supports xi:fallback as a mechanism for handling cases where resource representations are not delivered as expected. XInclude thus represents the first XML inclusion mechanism to be thoroughly integrated with the Web.

Character Entities: Missing in Action

While whole-document entities have received a lot of attention, character entities have had little attention paid them. The W3C XML Core Working Group's 2002 statement on the subject [Character Entities] more or less declares the issues with character entities not to be real problems. Despite the rules permitting non-validating parsers to ignore external resources, not to mention the near-total lack of internal subset use by anyone save the upper echelons of XML developers or the unreliability of those resources in a variety of contexts, this document blithely claims that "Placing lists of character entity declarations in separate files, and then referencing them from the internal subset as external parameter entities, is the appropriate way to specify multiple sets of character entities." It goes on to say that "there is absolutely no need to introduce a new mechanism into XML to declare them. The existing mechanism, DTDs, is entirely adequate to the purpose." That's a nice thought, though practically useless given situations like external entities which lack their own DOCTYPE declarations, not to mention that it keeps a now generally obsolete mechanism alive for one purpose only.

Later work [XML Entity Declarations], apparently led by David Carlisle of the MathML working group, goes so far as to collect and list commonly used entity sets, and to provide XSLT 2.0 character maps for them. Character maps, a new feature in [XSLT 2.0], provide support for the serialization of particular characters as entity references, but there is no support in XSLT 2.0 for their use as entity resolvers on import.

Creating Richer Entity Mechanisms for XML

So far, the W3C appears to have addressed one set of issues - document inclusion of self-contained documents - in depth, while leaving the smaller entities, typically used for single characters, adrift. It's one thing to say that a problem doesn't exist, another thing to deal with a problem as it's encountered. This partial problem-solving also leaves XML with a basic problem: its mechanisms for managing content are still neither as robust as SGML's (for character entities), nor as convenient (for document inclusion). There is room for improvement on both fronts, while still maintaining syntactic compatibility within documents with XML 1.0.

The element which is not one

XInclude's designers made a deliberate choice to create a model based exclusively on processing after parsing is complete. Their retreat to the safe haven of the Infoset meant that they could ignore questions about how included documents should be processed, and also made it simpler for them to use a namespace-qualified element as their point of inclusion. There is, however, a large problem with this approach. It requires a stack of processor layered on top of the processor. Documents using XInclude legitimately present two very different Infosets to applications, depending on whether or not the processor included that XInclude layer. It may or may not be a problem to receive a document with XInclude entirely unprocessed, but it's yet another gotcha for developers to watch for.

XInclude's decision to take this path seems to be in odd contradiction to other design decisions in the specification, notably those which add robustness to the handling of information retrieval on the Web. Perhaps it can be justified on the grounds that non-validating parsers are also permitted to simply report "an entity named XX was here" when they don't perform entity processing, but there seems to be a large difference in the warning sign presented by something which is marked as unresolved and something which happens to be an element in a particular namespace.

Verbosity and reuse

XInclude can be tremendously convenient as a mechanism for assembling documents from component parts. It tends to become less convenient, however, when the same parts are used repeatedly, whether in the same document or across a library of documents. XInclude's relatively straightforward approach provides a direct path from element to the target for inclusion, but lacks the indirection that makes entities convenient.

While it is possible, to take a perverse example, to use XInclude for character entities, just the XML for accomplishing this is many times more inefficient than the entity equivalent, even before retrieval overhead is taken into account. Even in more probable cases, such as using XInclude to incorporate boilerplate, it is hard to argue that xi:include elements with multiple attributes and a URI identifying the target of inclusion is as readable as an entity reference.

XInclude has solved some parts of the entity processing puzzle well, and identified some key issues that need to be handled by any mechanism which purports to improve on XML's entity processing model. At the same time, however, it is hard to prescribe XInclude as a general solution to entity-processing issues.

Getting underneath the problem

For all the concerns about separating content mechanisms from structure descriptions, the SGML model for content inclusions worked very nicely in practice, if not in theory. Unfortunately, XML 1.0 defined the parsing process in such a way that applications had little control over how entity processing was handled. While some APIs provided capabilities like the [SAX2] EntityResolver interface, much XML processing is done in environments that offer no such configurability.

Even where similar facilities are available, many XML developers are more interested in creating and using definitions for entities than in writing code that links them to particular processors. The lack of a standard vocabulary for defining entities outside of the confines of (the declining field of) DTD processing has meant that many developers and content creators revert to numeric character references for character entities or the use of a few standard sets.

The most convenient mechanism seems to be one where developers can declares content for inclusion and reference it elsewhere. XML's existing entity reference supports the reference end of this perfectly; its only flaw is the unreliability of the content which it references, as well as a cultural shift away from DTDs and toward other mechanisms for defining structures. The easiest route for creating systems which are easy to author and reliable to process would seem to be through creating a system which processes entity references in a more robust way than is currently provided by DTDs.

(There have been occasional suggestions that developers use specialized elements as a replacement for entity references. I have some sympathy for this view, but it breaks down completely as far as entities in attributes are concerned.)

Creating an entity reference processor

Processing entity references in documents is not particularly difficult, provided that the material to which the entities refer is readily available. (Parameter entities used in general entity declaration contents can cause difficulties for applications which want to focus exclusively on general entities.) General entity processing is more or less a search-and-replace mechanism. The only challenge, and a minor one, is remembering that this search-and-replace has to be performed on content being added to a document as well as to the document itself. Too many layers of such recursive processing can make error reporting a challenge, though hardly an impossible one.

A simple substitution approach has already been deployed for character entities, in the [Ents] project. Ents takes an XML-based list of entity names and values and performs a search-and-replace on entity references in a given document. It replaces the entity references for which it has a corresponding entry with the appropriate value, and leaves entities for which it has no entry in place. Ents in its current form supports only character entities, so the process can be easily made to go both directions. Tweaking Ents to use other XML-based formats, like XSLT 2.0 character maps, should be relatively simple.

Ents is most commonly run as a pre-processor, processing XML documents before they reach the XML parser, though it can also be used as the source for a SAX2 entity resolver. The Ents approach makes it easy, for instance, to work with chapters in DTD-based DocBook work, themselves external parsed entities and therefore forbidden (thanks to XML's lack of SUBDOC entities!) to carry their own DOCTYPE declaration. As XML processing applications go, its implementation logic is trivial and easily recreated in a variety of environments.

In its current form, however, Ents is as guilty of solving half a problem as is XInclude. Its narrow focus on character entities means that it provides no support for the richer possibilities that XInclude supports. There does seem to be a path forward, however: extend Ents-style entity reference processing by allowing it to support XInclude-like external resource inclusion, complete with the metadata and fallback information XInclude provides. It may or may not make sense to provide additional information about how this data should interact with validation processing.

There are several paths forward for doing this work, though their complexity grows the more completely one attempts to solve the complete problem. While it isn't difficult to build a replacement for DTD entity processing, provided that parsers provide applications access to do so, it is very difficult to use an XML format to store entities which themselves contain markup. To demonstrate, the next few sections create a vocabulary which can be used to describe entities for use in document processing, moving from relatively simple to extremely difficult.

The Easiest Part: Character Entities

Defining a vocabulary which describes the relationship between entity names and character references is very easy. The single-character nature of the references also makes it simple to convert character references (or even characters) back into character entities if desired. One format for doing this, used by the Ents processor, looks like:

<rules xmlns="http://simonstl.com/ns/refsEnts/">
<description href="http://www.w3.org/TR/xhtml-modularization/
dtd_module_defs.html#a_xhtml_character_entities">Rules from Modularization 
of XHTML</description>

<definitions href="http://www.w3.org/TR/xhtml-modularization/
dtd_module_defs.html#a_module_XHTML_Latin_1_Character_Entities">
<ent name="nbsp" char="#160">no-break space = non-breaking space, U+00A0 ISOnum</ent>
<ent name="iexcl" char="#161">inverted exclamation mark, U+00A1 ISOnum</ent>
<ent name="cent" char="#162">cent sign, U+00A2 ISOnum</ent>
</definitions>
</rules>

This format is kind enough to provide descriptive content for the references, but the action takes place in the ent and ref attributes of the equal elements. They define equivalence between entities and characters, and the equivalence is simple enough that replacing one with the other is a simple search-and-replace operation: everything between and & and a semi-colon is checked for one and replaced with the other as appropriate.

It is important to note that creating entities named lt, gt, amp, apos, or quot is prohibited. The processor leaves handling of those entities exclusively to the parser.

Another Easy Part: Pure Text Entities

The character entity replacement described above can be easily extended to support multiple characters, though the processing needs to change slightly and it may be less attractive to re-convert from complete text to a version containing entities. The rules for such a system, building on the previous example, might look like:

<rules xmlns="http://simonstl.com/ns/refsEnts/">
<definitions>
<ent name="GLW">Corning</ent>
<ent name="IBM">IBM</ent>
<ent name="mySlightlyDullPara">This is an entire paragraph of slightly dull text
which will be referenced by an entity named "mySlightlyDullPara" if and
when I get around to it.  I hope you've enjoyed this brief and slightly
dull paragraph.</ent>
</definitions>
</rules>

So far, so good - these kinds of entity replacements can be easily defined, and everything's easy.

Getting Harder: External Entities

Referencing content outside of the document specifying rules introduces a number of complications. Some of those complications, having to do with URIs and content negotiation, can be sidestepped easily with some extra information in the same manner that recent drafts of XInclude have done so. Unfortunately, the XInclude approach, which is designed to add Infosets to other Infosets, doesn't work as well in a context where a processor is actually replacing entities before the parse - before the Infoset is created. While perhaps not as efficient, it still works.

For convenience, the demonstration application uses the same set of attributes with the same meanings as does XInclude. (It does not presently support the xpointer attribute or the fallback mechanism, but there is no technical reason preventing this.) This approach results in declarations which look like:

<rules xmlns="http://simonstl.com/ns/refsEnts/">
<definitions>
<ent name="chapter1" 
    href="http://simonstl.com/lovelyBook/ch1.xml" 
    accept="application/xml"
    accept-language="en-US"
    parse="xml"
    />
<ent name="chapter2" 
    href="http://simonstl.com/lovelyBook/ch2.xml" 
    accept="application/xml"
    accept-language="en-US"
    parse="xml"
    />
</definitions>
</rules>

These entities can function like ordinary XIncludes. When the processor encounters an entity reference of the form &chapter1; it will retrieve the document requested, parse it, and add the parsed version of the document to the original document. There is a conversion from text to Infoset and back to text again, but that is unavoidable if using existing entity syntax is desired. If the cost of that overhead (or the compatibility problem of requiring the preprocessor itself) is unacceptable, it may make more sense to use XInclude and XInclude post-processing.

There are two larger challenges here as well, however. The first arises if, using assumptions derived from ordinary XML 1.0 DTD processing, the included document expects its entity declarations to have been made for it in the parent document. In that (hardly unusual) case, the processor needs to perform entity processing on the document to be included. The rules for such a thing may either be the set currently in use or a different set. (The latter approach adds great flexibility but is not supported by DTDs.) Rules for using documents which themselves need entity processing performed on them might look like:

<rules xmlns="http://simonstl.com/ns/refsEnts/">
<definitions>
<ent name="chapter1" 
    href="http://simonstl.com/lovelyBook/ch1.xml" 
    accept="application/xml"
    accept-language="en-US"
    parse="xml"
    entRules="."
    />
<ent name="chapter2" 
    href="http://simonstl.com/lovelyBook/ch2.xml" 
    accept="application/xml"
    accept-language="en-US"
    parse="xml"
    entRules="http://simonstl.com/lovelyBook/entities.xml"
    />
</definitions>
</rules>

If entRules is not specified, the processor does no entity expansion on the document to be included. If entRules is specified and the value is ".", then it will perform entity expansion using the rules in the current document. (Like XML 1.0, circular references are prohibited.) If entRules is specified and the value is a URI, the processor will load the rules identified by that URI and apply them to entity expansion for the document being loaded.

(I still need to consider the possibilities for accept and accept-language in the case of rules files. It seems likely that they will need to be implemented as well.)

In the final inclusion case, it is possible that a document author will want to include documents as text, whether or not they happen to include XML markup or characters which resemble XML markup. XInclude can do this relatively easily because it addresses the Infoset, and doesn't have to consider further parsing of its content. In the entity-replacement context, this is somewhat more complicated. The easy answer - wrapping it in a CDATA section - fails because CDATA sections can't be nested. Fortunately, there is a simpler answer: replacing all markup characters in the document with XML's built-in entities for markup characters. A rule for this kind of inclusion might look like:

<rules xmlns="http://simonstl.com/ns/refsEnts/">
<definitions>
<ent name="sampleDoc" 
    href="http://simonstl.com/lovelyBook/sampleDoc.xml" 
    accept="application/xml"
    accept-language="en-US"
    parse="text"
    />
</rules>

In some ways, this processing is the reverse of what the rest of the processor is doing, but it works well at making a particularly kind of inclusion operate smoothly.

The Hard Part: Internal Entities with Markup

Because DTDs don't themselves use the same syntax as documents, they can freely include markup within their declarations. There are limits - DOCTYPE declarations can't be included on an internal entity, for instance - but it is simple to create entities containing things like:

<givenName>Irwin</givenName><nickName>Irwy</nickName>
Creating an entity like that in a context which is itself an XML document creates confusion:

<rules xmlns="http://simonstl.com/ns/refsEnts/">
<definitions>
<ent name="ir"><givenName>Irwin</givenName><nickName>Irwy</nickName></ent>
</rules>

There is little differentiation here between the elements used to define the entity structures and the elements contained in those structures. A processor could keep track of which was which, of course, and convert them back into text for inclusion after the parse. (This gets more complicated as more components are added to the included section.) The lack of separation between the two kinds of elements also creates potential for namespace confusions, as what had seemed like an inclusion has now fallen into the namespace used by the entity rules vocabulary itself. That is fixable with an additional declaration on the included content - and provides another use case for Namespaces in XML 1.1 [Character Entities] if the default namespace needs to be undeclared to hold content in no namespace.

Those problems are solvable, but there's a much larger problem lurking. In classic DTD style, entities may themselves use entities, and such processing needs to be performed once a complete list of entities is compiled, not during a single pass of parsing. An XML parser will rightly choke on code like:

<rules xmlns="http://simonstl.com/ns/refsEnts/">
<definitions>
<ent name="ir" entRules=".">
<givenName>Irwin</givenName><nickName>&irwy;</nickName>
</ent>
<ent name="irwy">Irwin doesn't like to use a nickname.</ent>
</rules>

This simple situations is complicated enough, but it's also possible that people will want to go beyond the capabilities of DTDs and include complete documents in the file rather than reference them externally. Fixing the relatively simple situation is difficult, but it happens to open the opportunity for people to go beyond DTD capabilities as well.

I considered several possible answers to this problem. The first, simply banning it, seemed draconian. The second, changing the delimiter rules, creates it owns problems. Whether the delimiters change for the container or for the contained content, there's likely to be a collision with content that matters to someone. Every key available on an ordinary keyboard in any country or through combinations is likely used. Control characters are not a user-friendly option. Occasional discussion of marking text up with colors sounded very attractive, but unfortunately coloration is not yet a standard feature of text files.

In the end, the conclusion I reached was to use standard delimiters (< and >), but double them. This chokes XML processors - they were going to be choked anyway - but it distinguishes the container from the contained and doesn't require complicated input or stealing another character from common use. It may also prove a reasonable solution for other situations where documents need to contain XML documents of various kinds - even XML fragments that may not be well-formed. In any event, the resulting document looks like:

<<rules xmlns="http://simonstl.com/ns/refsEnts/">>
<<definitions>>
<<ent name="ir" entRules=".">>
<givenName>Irwin</givenName><nickName>&irwy;</nickName>
<</ent>>
<<ent name="irwy">>Irwin doesn't like to use a nickname.<</ent>>
<</rules>>

Modifying a parser to use double delimiters is not difficult, and the Ripper parser, part of the Ents package, can cope with the change transparently. Elements and attributes still behave as expected, namespaces are untouched, and the contained XML requires no modification at all. As a container format, the only thing it modifies is the element, the containers. This approach can be reserved for use only when necessary, as the Ents processor will now accept either the regular XML or the DDXML (Double-delimited XML) form. DDXML doesn't itself support entities, but additional rules files can be linked through element-based inclusion.

At this point, the processor supports every vaguely reasonable possibility for inclusions into an XML document, and can do so using a single reasonably consistent processing models. Perhaps eventually a mechanism for unparsed entities could be interesting, if likely academic, and I'm still working on utilities that convert DTDs to this format. I don't expect this solution to take the world by storm, but it does at least demonstrate what's possible.

Continuing issues in entity processing

Even if an approach like the one described above works well for controlled situations, it has limits as a general solution for entity processing. While those limits may fairly be described as political, they are nonetheless substantial.

The simplest way to dismiss a solution like this one is to suggest that it's just another layer in a technical world already overburdened with layers. Not only that, but it is a layer inserted under what hundreds of PowerPoint slides and other presentations have asserted was a stable foundation.

Even if an extra layer of processing can be considered, the problem remains that XML processing is largely considered a completed system. XML 1.0 (and 1.1) are widely deployed and it seems unlikely that organizations already using them will be particularly excited about reopening those foundations to solve a problem many of them don't realize that they have.

These two issues are partially addressable by this approach, which by itself requires no syntactic change to XML 1.0 or 1.1. Entity references remain entity references, and this kind of pre-processing should only increase the reliability of the XML parsing that follows its work. Systems which don't need this kind of processing are plainly under no obligation to use it, and systems which might need it occasionally could use mechanisms like entity resolvers to integrate it with minimal disruption to existing code bases.

There is a larger problem, however, that seems worth consideration. While these systems are designed to facilitate both the sharing of entity processing and the integration of such processing with the Web, they do very little by themselves to address the management of these kinds of resources. It may well be that a broader approach, perhaps something like the work Elliot Kimber has done on XIndirect [XIndirect] is necessary to make this kind of system more flexible. Integration with resource libraries is also something worth considering, though that needs to done in a portable form.

That last issue is perhaps the most important. While it's relatively simple to create these systems, and years of experience have illustrated the issues involved, making this work in a genuinely portable way is going to require both the rush of initial implementation and the long slog toward making it ordinary.


Acknowledgments

Thanks to Walter Perry, Wendell Piez, Rick Jelliffe, John Cowan, and the xml-dev list generally for various sparks. Additional thanks to the xmlhack editors for providing an informal support group for various demented XML adventures.


Bibliography

[Character Entities] Character Entities: An XML Core WG View http://www.w3.org/XML/Core/2002/10/charents-20021023

[Cocoon] Cocoon http://cocoon.apache.org/

[Comparison 1997] Comparison of SGML and XML http://www.w3.org/TR/NOTE-sgml-xml-971215

[DITA] Darwin Information Typing Architecture (DITA) http://www.oasis-open.org/committees/dita

[DOM] W3C DOM Working Group. Document Object Model. http://www.w3.org/DOM/DOMTR.

[Ents] St.Laurent, Simon. Ents. http://simonstl.com/projects/ents/.

[Goldfarb 1990] Goldfarb, Charles. The SGML Handbook. Oxford University Press: 1990

[Infoset] Cowan, John, and Tobin, Richard. XML Information Set. http://www.w3.org/TR/xml-infoset/.

[Lease 2000] External entities and alternatives. http://www.infoloom.com/gcaconfs/WEB/paris2000/S14-02.HTM

[MathML] Carlisle, David, et al. Mathematical Markup Language (MathML) 2.0. http://www.w3.org/TR/MathML2/.

[Megginson 2000] When XML Gets Ugly http://www.xml.com/pub/a/2000/02/xtech/megginson.html

[SAX2] The Simple API for XML, 2.0 http://saxproject.org

[XInclude] Marsh, Jonathan, and Orchard, David. XML Inclusions (XInclude) Version 1.0. http://www.w3.org/TR/xinclude/ .

[XIndirect] Kimber, W. Eliot. XIndirect: Indirect addressing for XML ../../2003/Kimber01/EML2003Kimber01-toc.html

[XLink] DeRose, Steven, Maler, Eve, and Orchard, David. XML Linking Language (Xlink) 1.0. http://www.w3.org/TR/xlink.

[XML 1.0] Bray, Tim, et al. Extensible Markup Language 1.0 (Third Edition). http://www.w3.org/TR/REC-xml.

[XML 1.1] Cowan, John. Extensible Markup Language 1.1. http://www.w3.org/TR/xml11/.

[XML Data] XML-Data http://www.w3.org/TR/1998/NOTE-XML-data-0105/

[XML Entity Declarations] XML Entity Declarations for Characters http://www.w3.org/2003/entities/

[XMLNS] Bray, Tim, et al. Namespaces in XML. http://www.w3.org/TR/REC-xml-names.

[XMLNS] Bray, Tim, et al. Namespaces in XML 1.1. http://www.w3.org/TR/xml-names11.

[XPath 1.0] Clark, James, and DeRose, Steven. XML Path Language (XPath) 1.0. http://www.w3.org/TR/xpath.

[XSLT 2.0] XSL Transformations (XSLT) Version 2.0 http://www.w3.org/TR/2003/WD-xslt20-20031112/#character-maps



General Parsed entities: Unfinished Business

Simon St.Laurent [Editor, O'Reilly & Associates]
simonstl@simonstl.com