Declarative specification of XML document fixup

Henry Swift Thompson

Abstract

Fixing broken XML is a problem interesting in its own right, and fixing broken (X)HTML is a particularly timely topic. This paper introduces a declarative approach to specifying the repair of ill-formed XML, based on the TagSoup work of John Cowan. A prototype implementation, called PYXup, is described, which operates on the PYX output of the scanner phase of TagSoup to fix-up all well-formedness and some structural problems.

Keywords: HTML; XHTML

Henry Swift Thompson

Henry S. Thompson divides his time between the School of Informatics at the University of Edinburgh, where he is Reader in Artificial Intelligence and Cognitive Science, based in the Language Technology Group of the Human Communication Research Centre, and the World Wide Web Consortium (W3C), where he works in the XML Activity.

He received his Ph.D. in Linguistics from the University of California at Berkeley in 1980. His university education was divided between Linguistics and Computer Science, in which he holds an M.Sc. While still at Berkeley he was affiliated with the Natural Language Research Group at the Xerox Palo Alto Research Center, where he participated in the GUS and KRL projects. His research interests have ranged widely, including natural language parsing, speech recognition, machine translation evaluation, modelling human lexical access mechanisms, the fine structure of human-human dialogue, language resource creation and architectures for linguistic annotation. His current research is focussed on the semantics of markup, XML pipelines and more generally articulating and extending the architectures of XML.

He was a member of the SGML Working Group of the World Wide Web Consortium which designed XML, a major contributor to the core concepts of XSLT and W3C XML Schema and is currently a member of the XML Core, XML Schema and XML Processing Model Working Groups of the W3C. He has been elected twice to the W3C TAG (Technical Architecture Group). He was lead editor of the Structures part of the XML Schema W3C Recommendation, for which he co-wrote the first publicly available implementation, XSV. He has presented many papers and tutorials on SGML, DSSSL, XML, XSLT, XML Schema, XML Pipelines and Web Architecture in both industrial and public settings over the last ten years.

Declarative specification of XML document fixup

Henry Swift Thompson [University of Edinburgh, HCRC Language Technology Group]

Extreme Markup Languages 2007® (Montréal, Québec)

Copyright © 2007 Henry Swift Thompson. Reproduced with permission.

Introduction

The historical and social dimensions of the current situation and future trajectory of the HTML family of languages are so complex as to defy easy analysis (see for example [TAG07]), but one issue at least stands out: should 'the next HTML' be an XML language, and should it have an explicit schema? One major constituency has specifically rejected the use of any form of schema, and is describing their candidate for 'the next HTML' only using (somewhat stylised) natural language [Hickson07].

The primary motivation for the 'no schema' approach is the perceived necessity of specifying the language in a way which accommodates the brutal reality of current practice: most HTML on the Web today is invalid, but browsers process it pretty consistently none-the-less. The differences between HTML and XHTML are at best poorly understood, and rarely respected. The complex interplay of the relevant technologies and human propensities are judged likely to mean that this situation will persist indefinitely. In [TAG07] the W3C's Technical Architecture Group asks

"Is the indefinite persistence of 'tag soup' HTML consistent with a sound architecture for the Web? If so, (and the going-in assumption is that it is so), what changes, if any, to fundamental Web technologies are necessary to integrate 'tag soup' with SGML-valid HTML and well-formed XML?

. . .

"Can we [treat tag soup] 'as if' it was the serialization of the DOM produced by (some formalization of) common browser error recovery strategies?"

This paper explores one possible route to implementing the 'as if' suggestion, particularly the "some formalization of" parenthetical. There is one outstanding precedent for this, namely John Cowan's TagSoup [Cowan04]. Although to a certain extent TagSoup's behaviour is driven by declarative specifications, there are significant aspects which cannot be affected easily, or at all, without modifying the code. Preliminary exploration of matching the behaviour of TagSoup to the error-correcting behaviour of current browsers ran into difficulties for this reason. The PYXup system described here is an attempt to reconstruct part of TagSoup in a way which exposes more behavioural options to external declarative control.

Architecture and Methodology

PYXup is closely based on [Cowan04], which has two distinct components, both implemented in Java:

  1. A table-driven scanner;
  2. A schema-driven 'rectifier'.

The first component, the scanner, which parses input into a stream of low-level tokens, is already pretty much completely customisable via the external table which controls it. This table is expressed in XML.

The second component, the 'rectifier', which deals with, for example, missing end tags, is controlled in part by an external schema, again expressed in XML, but the degree of customisation this provides for is fairly limited.

Accordingly, PYXup uses the scanner component of [Cowan04] unchanged, but implements its own 'rectifier', operating on the PYX [McGrath00] output of the scanner to produce a fixed-up stream of PYX events.

The fixup component of PYXup is driven by two declarative inputs:

  1. A simplified grammar;
  2. A set of fixup rules.

The simplified grammar is expressed in XML, and can be derived automatically from a W3C XML Schema schema (and thus from other grammar-orientated schema languages such as DTD and Relax NG). It consists entirely of immediate dominance rules, and is interpreted as specifying a very regular push-down automaton. The fixup rules are likewise expressed in XML, and are interpreted as a kind of production system, with patterns expressed in terms of problem type and parsing state and actions expressed in terms of output events and operations on parser state.

For example, the following broken XML document:

<c>
will produce the following PYX stream from the TagSoup scanner:
(c
-\n
Given a schema corresponding to the following DTD:
<!DOCTYPE a [
<!ELEMENT a (b,d)>
<!ELEMENT b (#PCDATA|c)*>
<!ELEMENT c EMPTY>
]>
PYXup will fix up the above stream to produce:
(a
(b
(c
)c
-\n
)b
)a

Schema details

[Cowan04] defines a simple schema language, TSSL [Tag Soup Schema Language], which represents primarily immediate dominance information, with no sequencing, disjunction or cardinality information. PYXup uses a similar approach, with even less detail. A PYXup schema is a set of rules, each of which specifies only

nonTerm

An element name

children

A set of element names

mixed

A boolean

maybeRoot

A boolean

preferredParent

(optional) An element name

A PYXup schema also may specify a preferred parent for otherwise not-allowed text.

A PYXup schema is interpreted as a push-down automaton, in which each rule is corresponds to a finite state automaton, whose name is given by its nonTerm, as follows (for an element named nt):

[Link to open this graphic in a separate page]

The )) transition is to accommodate the PYX event generated for empty tags and the anachronistic </> markup.

Furthermore, for every child a transition from and to state 2 is added, labelled with the name of the corresponding FSA. If the element allows text content, a transation labelled -* is added likewise. For example, for the b element as defined above, that is <!ELEMENT b (#PCDATA|c)*>, the result would be

[Link to open this graphic in a separate page]

Such an automaton will accept the beginning and end PYX events for a b element, with any number of balanced c element begin/end pairs, mixed with any number of text events, in between.

Finally, the collection of automata is transformed to simplify processing and improve efficiency, by as it were promoting the opening transition from each embedded automaton, so that for example the automaton for the a element as defined above, that is <!ELEMENT a (b,d)> as actually used looks like this:

[Link to open this graphic in a separate page]

This change implies that there is must also be a root automaton which dispatches on the open event for all possible document elements to the appropriate named automaton.

The automata act as recognisers and identity transducers, that is, as each event is accepted it is echoed.

Problem taxonomy and fixup details

What can go wrong, and how can it be fixed? At one level, there are only four things that can go wrong:

  1. There is no input, but we're not done;
  2. There is more input, but we are done;
  3. We accept an open event, but there is no corresponding automaton;
  4. The current event cannot be accepted in the current state.

Case (2) is currently not detected by PYXup. This means that strictly speaking PYXup's output is a well-formed XML entity, that is, it may contain one or more elements. This case is discussed further in “Further work”.

Case (4) covers a wide range of well-formedness and basic validity problems. PYXup currently handles the following subcases, as well as (1) and (3), based on the type of event, the current state and the contents of the pushdown stack (examples are based on the 'abcd' doctype above):

upEnd

The event is a close event, and there is an automaton higher up the pushdown stack which could accept it, that is, a matching open even has occurred, as in (a (b )a.

The default fixup is to output a close event for the current automaton (in the example, a )b) and return from it.

badEnd

The event is a close event, and there is no automaton higher up the pushdown stack which could accept it, as in (a )d.

The default fixup is to discard the event.

upChild

The event is an open event, and there is an automaton higher up the pushdown stack which could accept it, that is, an open event for an element which could accept this as a child has occurred, as in (a (b (d.

The default fixup is to output a close event for the current automaton (in the example, a )b) and return from it.

badChild

The event is an open event, there is no automaton higher up the pushdown stack which could accept it, but the event's tag is known to be allowed as the child of at least one other element, as in (a (c.

The default fixup is to postpone the current event and process an open event for the preferred parent element (if specified in the grammar) or an arbitrary allowed parent element (otherwise) instead (in the example, this would be a (b).

badOrphan

The event is an open event but no known element allows it as a child, as in (a (b (a.

The default fixup is to force acceptance, that is, to act as if the current automaton were constructed from an element which did have the relevant element as a child.

upText

The event is a text event, and there is an automaton higher up the pushdown stack which could accept it, that is, an open event for an element which allows text has occurred, as in (a (b (c -text.

The default fixup is to output a close event for the current automaton (in the example, a )c) and return from it.

orphanText

The event is a text event, there is no automaton higher up the pushdown stack which could accept it, but at least one element is known to accept text, as in (a -text.

The default fixup is to postpone the current event and process an open event for the preferred parent element for text (if specified in the grammar) or an arbitrary text-allowing parent element (otherwise) instead (in the example, this would be a (b).

badText

The event is a text event, but there is no automaton higher up the pushdown stack which could accept it, and no text-allowing element in the grammar at all.

The default fixup is to force acceptance, that is, to act as if the current automaton were constructed from an element which did allow text.

overrun

The token stream ends, but the pushdown stack is not empty, as in the case where (a (b were the entire input.

The default fixup is to output a close event for the current automaton (in the example, a )b) and return from it.

unknown

The start tag for a tag with no definition is seen, as in (a (x.

The automatic fixup is to synthesise an automaton for the tag which allows all known tags as children (but not text).

The abstract syntax for the fixup language represents the defaults given above as follows:

upEnd: end("%o") & pop
badEnd: ignore
upChild: end("%o") & pop
badChild: splice("%p") & retry
badOrphan: force
upText: end("%o") & pop
orphanText: splice("%t") & retry
badText: force
overrun: end("%o") & pop
where %o is the current automaton's tag, %p is the tag of the preferred parent for the current event and %t is the tag of the preferred parent for orphan text.

Comparison with TagSoup

PYXup's default fixup as described above comes very close to reconstructing TagSoup's built-in 'rectification'. The major differences I'm aware of are as follows:

  1. PYXup has no support for what TagSoup calls 'restartable' tags. Adding provision for recording restartability information to the schema language and adding a restart action to the fixup language should be possible.
  2. Similarly, PYXup has no support for what TagSoup calls 'uncloseable' tags, that is, form and table. Exactly what the right generalisation is for this class is unclear to me.
  3. PYXup has no support for what TagSoup identifies as 'cdata' tags, that is, script and style.
  4. PYXup currently treats unknown elements as if they allowed any known content, whereas TagSoup provides command-line control of whether such elements are treated that way, or as if they allowed no content. It would be straightforward to add this to PYXup.
  5. TagSoup does some namespace fixup, but its results are not always ideal. PYXup currently doesn't do any kind of namespace fixup, so its output is not guaranteed to be namespace-well-formed.
  6. PYXup currently does not detect or eliminate multiple document elements, as it were, whereas TagSoup guarantees to produce a single document element, by treating the initial start tag specially: all end tags which match the initial start tag are ignored, thereby capturing all the input as children of the original start tag, and requiring a final end tag for it to be synthesised at the end of the token stream. It would be straightforward to add this to PYXup as a command-line option.

At the moment, when a given problem (e.g. non-allowed start tag) may need differing fixups depending on context, PYXup handles this by dispatching on a finite number of different keys which encode the context differences (e.g. upChild, badChild, badOrphan). It might be possible to address the 'restartable', 'unclosable' and 'cdata' issues described in the list above by further articulation of such keys, but it probably makes more sense to revise the rule language to allow for richer patterns which could test for (boolean combinations of) a range of properties of the current event and the events on the pushdown stack.

Further work

From a much more theoretical perspective, there is a body of work in formal language theory about error correction, see e.g. [Fischer77], which deserves to be explored for possible relevance.

For the time being PYXup discards comment, processing instruction and attribute events -- obviously this needs to be fixed! A systematic approach to namespaces is also needed.

The most important next step is to look at the kinds of fixup described in the prose of [Hickson07] to see which ones are covered by the default rules given above, which ones could be covered with new rules in the existing language or the extensions proposed above in section “Comparison with TagSoup, which ones would require other changes to the language and, of course, which ones cannot be accommodated by the PYXup paradigm at all.

Concrete syntax

The XML languages for specifying the schema and the fixup rules are both very simple. Only two elements are involved for the schema:

<idgram
  textParent? = IDREF
  namespace? = anyURI>
    (element)*
</idgram>

<element
  name = ID
  mixed? = true|false
  maybeRoot? = true|false
  parent? = IDREF
  children? = IDREFS/>

The fixup rules require a bit more:

<rules>
    rule*
</rules>

<rule
  match = string>
    (pop|ignore|force|retry|splice|end|error)*
</rule>

<pop/>
<ignore/>
<force/>
<retry/>

<splice>
   string
</splice>

<end>
   string
</end>

<error>
   string
</error>

Availability

The current status of PYXup, including material for download, is available from http://www.ltg.ed.ac.uk/PYXup/.


Acknowledgments

Needless-to-say, I owe a substantial debt of gratitude to John Cowan, not only for the design of TagSoup, but for making its implementation available. He is not, of course, responsible for any of the changes and additions reported here.


Bibliography

[Cowan04] Cowan, J., TagSoup: A SAX parser in Java for nasty, ugly HTML, 2004. Available online at http://home.ccil.org/~cowan/XML/tagsoup/tagsoup.pdf, see also the TagSoup project home page http://home.ccil.org/~cowan/XML/tagsoup/.

[Fischer77] Fischer, C. N., D. R. Milton, S. B. Quiring, "An efficient insertion-only error-corrector for LL(1) parsers", Proceedings of the 4th ACM SIGACT-SIGPLAN symposium on Principles of programming languages, pp. 97--103, ACM, Los Angeles, 1977.

[Hickson07] Hickson, I. ed., Web Applications 1.0, 2007. Available online as http://www.whatwg.org/specs/web-apps/current-work/.

[McGrath00] McGrath, S., XML Processing with Python, Prentice Hall, 0-13-021119-2, 2000. See also online http://www.xml.com/pub/2000/03/15/feature/index.html.

[TAG07] W3C Technical Architecture Group, Draft description of new TAG issue TagSoupIntegration-54, W3C, 2007. Available online at http://lists.w3.org/Archives/Public/www-tag/2006Oct/0062.html.



Declarative specification of XML document fixup

Henry Swift Thompson [University of Edinburgh, HCRC Language Technology Group]