Rabbit/duck grammars: a validation method for overlapping structures

C. M. Sperberg-McQueen

Abstract

Recent work on markup has emphasized the importance of overlapping structures, but little progress has been made toward validation of such structures: the state of the art in validation of overlap remains where it was in 1986 with the CONCUR feature of SGML.

Like CONCUR, rabbit/duck grammars manage validation by means of several context-free grammars, but rabbit/duck grammars offer some improvements over CONCUR: within a document grammar written for one hierarchical subset of the markup vocabulary, the start- and end-tags of other hierarchies may be made visible; rabbit/duck grammars thus make it possible to constrain the interactions of different hierarchies in ways not possible with CONCUR. Rabbit/duck grammars also provide a more principled account of self-overlap (i.e. overlap of two elements with the same generic identifier) than CONCUR.

Documents can be validated against rabbit/duck grammars either by validating individually against each grammar or by a single pass through the document which validates against all grammars simultaneously. An implementation of this single-pass approach using Brzozowski derivatives is sketched out.

Keywords: Validating; Modeling; Concurrent Markup/Overlap

C. M. Sperberg-McQueen

C.M. Sperberg-McQueen is a member of the technical staff at the World Wide Web Consortium; he serves on the W3C XML Schema Working Group and chairs the XML Coordination Group.

Rabbit/duck grammars: a validation method for overlapping structures

C. M. Sperberg-McQueen [World Wide Web Consortium, MIT Computer Science and Artificial Intelligence Laboratory]

Extreme Markup Languages 2006® (Montréal, Québec)

Copyright © 2006 C. M. Sperberg-McQueen. Reproduced with permission.

This document introduces rabbit/duck grammars, a method of validating overlapping structures.1

Introduction

Overlap

One of the first things taught to students in courses on SGML or XML is that elements nest within other elements, thus forming a hierarchical structure or tree. One of the first questions asked by alert students is: but what about non-hierarchical structures in the information? What about overlapping structures?

It is thus no surprise that the development and spread of SGML and XML have been accompanied from the beginning by a vigorous discussion of the limits of tree structures and the best ways of dealing with non-hierarchical information; see, for example, [Barnard et al. 1988], [Renear et al. 1993], [Barnard et al. 1995], [Huitfeldt 1995], [Murata 1995], [Durand et al. 1996]. Out-of-line or standoff markup [Dybkjær et al. 1998], fragmentation of elements, milestone elements, virtual elements (e.g. [Barnard et al. 1988], [ACH/ACL/ALLC 1994], [Barnard et al. 1995]), concurrent structures [ISO 1986] [Sperberg-McQueen / Huitfeldt 1999], MECS [Huitfeldt 1999], parallel encoding (e.g. [Witt 2004], [Hilbert et al. 2005], [Dekhtyar / Iacob 2005]), bottom-up virtual hierarchies [Durusau / O'Donnell 2002a], layered markup and annotation (LMNL) [Tennison / Piez 2002], range algebra [Nicol 2002], just-in-time trees [Durusau / O'Donnell 2002b], multi-colored trees [Jagadish et al. 2004], tables [Durusau / O'Donnell 2004], TexMecs [Huitfeldt / Sperberg-McQueen 2001], Goddag structures [Sperberg-McQueen / Huitfeldt 2000], Trojan horses [DeRose 2004] — the world has not been starved of proposals for markup systems that handle overlap more or less gracefully (for some sense of the word handle). The discussion has been vigorous both with respect to the serial form to be assumed by the data and with respect to the abstractions or data structures around which access to and operations on the data are to be organized.

On the topic of validation, however, the discussion has been remarkably quiet. The state of the art appears to be today just what it was twenty years ago, when ISO 8879 defined the validity conditions for markup using the SGML feature CONCUR. Concurrent markup defines multiple trees over the same frontier2, each obeying its own DTD; ISO 8879 does not go into details on the subject, but the document may conveniently be regarded as valid if and only if it is valid against each of its DTDs.

In general, discussions of overlap say nothing at all about validation, but those mechanisms (e.g. multi-colored trees, XCONCUR) which like CONCUR treat overlap as the interplay of multiple trees (possibly over the same frontier, as in CONCUR, and possibly over different frontiers, as in multi-colored trees) can in general be validated by validating each of their trees in isolation against an appropriate document grammar.

Some problems with tree-by-tree validation

Tree-by-tree validation provides no mechanism for constraining the interactions of the trees: each tree is validated in majestic and lonely isolation. Three forms of interaction in particular are worth noting.

  • First, the vocabulary designer may wish to require that certain nodes in one tree also be nodes in another tree. In an encoding of verse drama, for example, one will normally wish to stipulate that the act and scene elements of the dramatic hierarchy must in all their instances be identical to instances of the act and scene elements of the verse hierarchy.
  • Second, the designer may wish to restrict the ways in which element boundaries in different trees line up. It is not possible, in any traditional verse drama, for lines of verse to begin or end in the middle of a stage direction or speaker indication, but if the element for verse lines and the elements for stage directions and speaker attributions are assigned to different hierarchies, this constraint is impossible to state. More prosaically, one might wish to perform some simple sanity checks on a data stream with both a logical and a formatted view of the document, such as: there should be a page break immediately before each top-level division; there should be no page breaks in the middle of any heading of any kind; etc.
  • Third, it may be desirable to make some elements in the document effectively invisible to some trees. A hierarchical view of a text in terms of pages, blocks, and other formatting objects has relatively little use for a metadata header full of non-printing information relevant for controlling document work flows. The formatted view can be defined to include such information and assign it the property of being invisible, but the formatted-view grammar is less cluttered if the metadata can be invisible to the grammar as well as being invisible on the page. Similarly, users attempting to write a document grammar to describe verse drama are sometimes unpleasantly surprised to discover that they cannot simply ignore the stage directions and speaker attributions, which play a crucial role in the dramatic hierarchy but which seem to have no business in the verse hierarchy. When using CONCUR, however, if stage directions are omitted from the verse hierarchy, their tags are invisible to the verse tree, but their content is visible, so that a line of verse may appear to include extraneous prose material inserted at what look like arbitrary points. The tagging for stage directions must be visible in the verse hierarchy in order that verse-oriented applications can successfully ignore the stage directions.

In the search for a method of validating overlapping structures, several goals may be identified:

  • The method should be powerful enough to do useful work.
  • It need not, however, be universal or Turing-complete. In the interests of being able to reason about the mechanism, Turing completeness should be avoided if possible.
  • It should provide a clear comprehensible method of expressing constraints on overlapping structures. This is not easy to judge at first exposure to any formalism; the ease with which an idea is first grasped is not the same as the clarity given by that idea when it is understood.
  • It should be be built around simple core ideas; the algorithms involved should be simple; the notation should be simple. Again, this is not easy to judge without more experience.
  • Ideally, it should be possible to validate both data streams and graph data structures representing the overlapping structures of interest. DTDs and most XML schema languages have this property: they can be used to check both an XML document in serialized form and a tree derived from such an XML document.

The method outlined in this paper for writing document grammars and validating documents against them, called rabbit/duck grammars, is designed to address these issues.

Some related ideas

Some precursors of the idea are worth noting; they may help make it clearer.

The first idea is language intersection. Formal language theory treats languages as sets of strings; the intersection of two languages L 1 and L 2 is just the set of strings which appear both in L 1 and in L 2. Other set operations (e.g. union and complementation) are also possible.

Discussions of formal language theory routinely refer to the important results which establish that regular languages are closed under the operations of union, intersection, and complementation, and that the set of context-free languages is not closed under intersection. An example often cited is this: consider the languages L 1 and L 2, where L 1 is

S → B | a S c
B → ε | b B
and L 2 is
S → a S | B
B → ε | b B c
If we write a n to denote n occurrences of a, then L 1 is the language a n b * c n , which has some occurrences of a, then of b and c, with the same number of a as of c. L 2 is a * b n c n , which has the same number of b as of c. The intersection is the language a n b n c n , which has the same number of a, b, and c and is one of the best known of non-context-free languages.

This example can be (and, I think, normally is) taken as an indication that set operations are ‘safe’ when performed on regular languages but unsafe on context-free languages. Safe, since they cannot have a result which is not itself a regular language; unsafe, since they clearly increase the expressive power of the formalism.

A few years ago, however, Lars Johnsen (a collaborator on the MLCD project) suggested in a workshop that the facts here could be given a more positive interpretation. While it is in general difficult to tell in reasonably bounded time whether a given string is a member of an arbitrary context-sensitive language, the example just given demonstrates that there are at least some context-sensitive languages for which the test can be performed in the same time required to parse context-free languages, just by parsing the string twice against different context-free grammars; if both accept it, then the context-sensitive language also accepts it. Rabbit/duck grammars apply this idea to document grammars: while in general languages like TexMecs require more than context-free power, they can be factored into a finite set (often a small one) of mostly conventional document grammars.3

A second related idea is the SGML feature CONCUR, which rabbit/duck grammars resemble in some ways. In SGML using CONCUR,4 overlap is handled by grouping elements into trees. Each tag is marked with the identifiers of the trees it belongs to; tags belonging to other trees are ignored.5

Rabbit/duck grammars resemble CONCUR in analysing overlap for the most part as involving distinct trees which share frontiers and some internal structures, and in using a set of several context-free grammars as the basis of validation. They differ from CONCUR in allowing each grammar to refer to tags for elements which are not part of the tree being validated, and in providing for elements which overlap other elements with the same generic identifier.

The closest analogue I am aware of to rabbit/duck grammars, however, is an idea formulated in February 1992 by David Megginson on the Netnews discussion list comp.text.sgml [Megginson 1992].

Megginson's idea was to use end-tag omission, together with the fact that in SGML empty elements look the same as start-tags, to allow the same SGML data stream to be parsed in either of two ways. He provides two DTDs. In one, the elements recto, verso, line, and clause are empty while the elements poem, fit, verse, averse, and bverse have content. In the other DTD, the roles are reversed. Whichever grammar is used for parsing and processing, one hierarchical view is dominant and the other recessive, but Megginson's technique allows information about (the start of) elements in the recessive view to be visible. Whether the tag recto denotes an empty milestone or the beginning of an element with content depends on the grammar; changing grammars performs a sort of Gestalt switch.

Rabbit/duck grammars use a similar sort of Gestalt switch (although the technical details are different), and are named for the well known image used in many discussions of Gestalt perception.6

Figure 1: Figure 1: Rabbit / Duck
[Link to open this graphic in a separate page]

A figure which could be the head of a rabbit facing right or of a duck facing left.

Rabbit/duck grammars resemble Megginson grammars in using multiple document grammars and treating the same tag as a start-tag in some and a milestone in others; they differ in some ways made necessary by syntactic differences between the SGML reference concrete syntax and XML. Megginson also did not contemplate restricting the interactions of the two grammars; he used inclusion exceptions to allow milestones for recessive elements to appear anywhere.

Rabbit/duck grammars

The basic idea of rabbit/duck grammars is simple: define a set of distinct document grammars whose intersection describes the language to be accepted. In each grammar, each element type is assigned to one of four classes.7

  • First-class elements (‘normal’ elements) are treated as non-terminals of the grammar, just as in DTDs and conventional schema languages.
  • Second-class elements (‘milestones’) are also visible in the grammar, in a limited way. When second-class elements are encountered, their start- and end-tags are treated for purposes of parsing as if they were empty milestone elements; thus the elements themselves are not visible, but their boundaries are.
  • Start- and end-tags of third-class (‘transparent’) elements are ignored completely (like the tags of other document types in CONCUR). The content of third-class elements, however, is visible.
  • Fourth-class (‘opaque’) elements are ignored completely; neither their tags nor their content is checked against the grammar.
In a complete set of rabbit/duck grammars, each element type in the vocabulary is a first-class element in at least one grammar. A document is valid against a set of rabbit/duck grammars if and only if it is valid against each grammar in the set; the language defined by the set of grammars (viewed as a set of data streams) is thus the intersection of the languages defined by the members of the set.

The remainder of the paper is organized as follows. Section 2 provides a more complete definition of rabbit/duck grammars, defines a simple notation for writing them, and shows a simple example, concentrating on the simple case of languages without self-overlapping elements (i.e. languages in which any two overlapping elements are required to have different generic identifiers). Section 3 extends the discussion to languages with self-overlap. Section 4 addresses implementation issues. Section 5 describes some simple applications of rabbit/duck grammars.

Vocabularies without self-overlap

I first consider rabbit/duck grammars for languages like Mecs [Huitfeldt 1999], which allow elements with different generic identifiers to overlap each other but prohibit overlap between elements with the same generic identifier.

Description

Informally, one can describe the application of rabbit/duck grammars to such vocabularies thus. First, form groups of elements by assigning each element in the vocabulary to one or more groups, such that no two elements in the same group should ever overlap in a valid document instance. For example, if the vocabulary includes the elements doc (document), ch (chapter), sec (section), h (heading, title of chapter or section), p (paragraph), hi (highlighted phrase), vol (volume), page, and tl (typographic line), with the obvious meanings, then vol, page, and tl might go into one group, and the other elements into a second.

Second, define a document grammar for each group. For purposes of this paper, I will use an extension of DTD notation in which content models can contain primitive tokens of the following forms (where g is a generic identifier):

  • #stag( g ) matches a start-tag for g
  • #etag( g ) matches an end-tag for g
  • #tag( g ) matches either a start- or an end-tag for g
These are all used for constraining the locations where the start- and end-tags of second-class elements may appear in valid documents.

In addition, by analogy with XML Schema, I allow occurrence indicators to take the form of integer ranges, so that for any expression E one can write E { n , m }, where n is a non-negative integer and m is either a positive integer or the character “∞”8 The usual occurrence indicators of DTDs are also used.

Further extensions allow second-, third-, and fourth-class elements to be defined: instead of content models, they have the keywords MILESTONE-TAGS, IGNORE-TAGS, and IGNORE, respectively. In the examples given here, I adopt the convention that any element g mentioned in a primitive token of the form #stag( g ), #etag( g ), or #tag( g ) is a second-class element and need not be declared explicitly. Elements neither declared nor mentioned in a content model will be assumed to be third-class elements.

For our simple illustration, the page-oriented grammar might be

<!GRAMMAR vol [
<!ELEMENT vol  (#stag(doc)?, page+, #etag(doc)?) >
<!ELEMENT page (tl+) >
<!ELEMENT tl   (#PCDATA) >
]>
The content model for vol allows the start-tag for the doc element to appear at the beginning of the vol element and the end-tag at the end, but nowhere else. Other tags from the logical structure markup (ch, sec, h, p, hi) are all ignored, because they are implicitly declared third-class.

The logical-structure grammar might be

<!GRAMMAR doc [
<!ELEMENT doc  (#stag(vol)?, 
               (p, #tag(page)*)+, 
               (ch, #tag(page)+)+, 
               #etag(vol)?) >
<!ELEMENT ch   (#tag(page)*, 
               h, 
               (((p, #tag(page)*)+, 
                 (sec, #tag(page)*)*) 
                | (sec, #tag(page)*)+)) >
<!ELEMENT sec  (#tag(page)*, 
               h, 
               (((p, #tag(page)*)+, 
                 (sec, #tag(page)*)*) 
                | (sec, #tag(page)*)+)) >
<!ELEMENT h    (#PCDATA | hi | #tag(tl))* >
<!ELEMENT p    (#PCDATA | hi | #tag(tl) 
               | #tag(page))* >
<!ELEMENT hi   (#PCDATA | #tag(tl) 
               | #tag(page))* >
]>
Here, the interactions between the trees are more tightly constrained. The content model for doc ensures that the volume can begin and end only at the beginning and end of the document, and that page breaks, but not line breaks, may appear between paragraphs and chapters. The expression (ch, #tag(page)+)+ requires that at least one page tag appear after each ch element, thus enforcing the rule that each chapter begins on a new page.9 The identical declarations for ch and sec forbid page breaks between the chapter heading and the first paragraph, but allow page breaks between pages and sections. The declarations for h and p show different degrees of tolerance for typographic boundaries: in headings, only line breaks are allowed, while in paragraphs, line breaks and page breaks may occur.

Formally, a rabbit/duck grammar can be viewed as a triple (G, D, S), where

  • G is a set of generic identifiers (e.g. QNames)
  • SG is a start-symbol, the generic identifier of the document element
  • D is a set of declarations of the abstract form gE, where g is a generic identifier and E is either a content keyword (one of EMPTY, MILESTONE-TAGS, IGNORE-TAGS, or IGNORE) or a content-model expression. Content-model expressions may take the usual forms:10
    content-model ::= group | counted
    group   ::= seq | or | all | conj
    counted ::= group counter
    expr    ::= token | group | counted
    seq     ::= '(' expr (',' expr)* ')'
    or      ::= '(' expr ('|' expr)+ ')'
    all     ::= '(' expr ('&' expr)+ ')'
    and     ::= '(' expr ('∧' expr)+ ')'
    counter ::= '+'   // '+' = {1,∞}
              | '*'   // '*' = {0,∞}
              | '?'   // '?' = {0,1}
              | '{' min ',' max '}' 
              | group '{' count '}'
    min     ::= '0' | count
    count   ::= POSITIVE_INTEGER
    max     ::= count | '∞'
    token   ::= basic-token 
              | counted-token
    counted-token ::= basic-token counter
    
    The primitive content tokens include generic identifiers, the keyword #PCDATA, and the tag-expressions noted above:
    token        ::= first_class
                   | second_class
                   | '#PCDATA' 
    first_class  ::= GI
    second_class ::= '#stag(' GI ')'
                   | '#etag(' GI ')'
                   | '#tag(' GI ')'
The terminal symbol GI denotes anything that can be interpreted as a generic identifier; in practice this will typically be a QName.

In a clean rabbit/duck grammar, every element type is declared once (explicitly or implicitly), every generic identifier used in a right-hand side as a first-class token is that of a first-class element, and every generic identifier used in a second-class token is that of a second-class element (except in overlap-groups, as specified below in section 3).

Undefined element types, unused element types, non-productive declarations, and so on can be defined for rabbit/duck grammars in the same way as for context-free grammars in general.11 In what follows, I will without loss of generality restrict my attention to clean grammars in which none of these deficiencies occur.

In a full set of rabbit/duck grammars, each grammar is defined over the same vocabulary G, and each generic identifier in G is first-class in at least one grammar.

Some technical commentary is in order, on the four ways a grammar can treat elements and their start- and end-tags.

  • The treatment of first-class elements ensures that conventional DTDs and schemas are a simple subcase of the rabbit/duck mechanism: they are rabbit-duck grammars in which all elements are first-class.
  • Second-class elements make it possible to constrain the interrelation of grammars: the grammar author can ensure that lines of verse can begin within or between speeches in a verse drama, and can cross speech boundaries, but lines cannot cross act or scene boundaries, or begin in the middle of a stage direction.
  • Third-class elements make it convenient to say that the start and end points of a particular element type are completely unconstrained by a given grammar, and that the contents of the element should be visible to the grammar. The same effect could be achieved by declaring the element second-class and allowing its start- and end-tags absolutely everywhere in the grammar, but that is error-prone and rather boring. Note that the treatment of recessive tags in CONCUR is the same as the treatment of third-class elements in rabbit-duck grammars.
  • Fourth-class elements make a similar convenience possible: the location of a particular element type is unconstrained by the grammar, and the entire element is invisible to the grammar.

Example: Peer Gynt

An example may help clarify things. Consider the opening lines of Peer Gynt used as a running example in [Sperberg-McQueen / Huitfeldt 2000]:

Åse: Peer, du lyver!

Peer Gynt : [uten å stanse] Nei, jeg gjør ei!

Åse: Nå, så bann på det er sant!

Peer Gynt: Hvorfor banne?

Åse: Tvi, du tør ei! Alt i hop er tøv og tant!

In English:

AASE: Peer, you're lying!

PEER GYNT : [without stopping] No, I'm not!

AASE: Well then, swear to me it's true.

PEER GYNT: Swear? why should I?

AASE: See, you dare not! Every word of it's a lie.

In the dramatic hierarchy, we want a hierarchy of play / act / scene, with scenes containing speeches and stage directions (speech, stage).

In the metrical hierarchy, we want play / act / scene, with scenes containing verse lines (L).

The dramatic hierarchy can be defined thus:

<!--* Grammar D *-->
<!GRAMMAR play [
<!ELEMENT play  (act+) >
<!ELEMENT act   (scene+)
<!ELEMENT scene (sp | stage | #tag(L))* >
<!ELEMENT sp    (#PCDATA | stage | #tag(L))* >
<!ATTLIST sp    
          who CDATA #REQUIRED >
<!ELEMENT stage (#PCDATA) >
]>
This explicitly ensures that line breaks only occur as children of speeches or scene, and never within stage directions or between scenes or acts.12 If line breaks should occur only within speeches and not between them, the #tag(L) can be dropped from the declaration for scene.

The verse structure can be defined this way:

<!--* Grammar V *-->
<!GRAMMAR play [
<!ELEMENT play  (act+) >
<!ELEMENT act   (scene+)
<!ELEMENT scene (L | stage | #tag(sp) )* >
<!ELEMENT L     (#PCDATA | stage | #tag(sp) )* >
<!ELEMENT stage (#PCDATA) >
]>

In this view, a scene of a verse drama is a sequence of verse lines interspersed with stage directions. If we wish to ignore the stage directions entirely, we can do that, too:

<!ELEMENT scene (L | #tag(sp) )* >
<!ELEMENT stage IGNORE >

In TexMecs notation, this fragment can be marked up this way:13

<play| <* Peer Gynt *>
<act n="1"|
<scene n="i"|
<sp who="Aase"|<l|Peer, you're lying!|sp>
<sp who="Peer"|<stage|without stopping|stage> 
    No, I'm not!|l>|sp>
<sp who="Aase"|<l| Well then, swear to me it's true! |l>|sp>
<sp who="Peer"|<l| Swear? Why should I? |sp>
<sp who="Aase"| See, you dare not! |l>
<l| Every word of it's a lie! |l> |sp>

<* ... more dialogue ... *>
|scene>
<* ... more scenes ... *>
|act>
<* ... more acts ... *>
|play>

If we associate a view of the document (in the database sense) with each grammar, the difference between the two variants of the verse grammar just given is the difference between a view in which stage directions are present:

<l| Peer, you're lying! 
  <stage|without stopping|stage> No, I'm not! |l>
<l| Well then, swear to me it's true! |l>
<l| Swear? Why should I? See, you dare not! |l>
<l| Every word of it's a lie! |l> 
and one in which they are absent entirely:
<l| Peer, you're lying! No, I'm not! |l>
<l| Well then, swear to me it's true! |l>
<l| Swear? Why should I? See, you dare not! |l>
<l| Every word of it's a lie! |l> 
(These transcriptions are a bit casual about white space, for the sake of legibility).

For metrical studies, the second view may be more convenient than the first. It's easy enough to transform the first into the second, but some vocabulary designers may also find it easier to write grammars like the second, in which they can completely ignore elements in the other hierarchies. (When all elements in a grammar are first- or third-class, rabbit/duck grammars are very similar to CONCUR.)

Two general conditions

Two important conditions are established and maintained by the rules given above for all documents valid against a clean set of rabbit/duck grammars.

  1. Every grammar rule is satisfied each time it applies. That is, every declaration in every grammar is satisfied by every element in the document instance to which it applies.
  2. Every element in the document is covered by a declaration in at least one grammar, and (unless it is the outermost element in some grammar) matches a first-class token on the right-hand side of at least one declaration in at least one grammar (namely, the declaration applicable to its parent in that grammar).

For the cases covered by the current rules, it seems plausible to believe (1) that any document for which the two general conditions are true should count as valid for a given set of rabbit/duck grammars, and (2) that no document for which the general conditions do not hold should count as valid. The conditions, that is, seem to be both sufficient and necessary conditions for validity against a full set of clean rabbit/duck grammars.

The rules defined so far handle all overlap as the overlaying of the tree defined by one grammar with the tree defined by another. They cannot handle self-overlap, because in self-overlap both element instances are first-class elements in the same grammars. But the rules do suffice to cover most cases of concurrent hierarchies in which overlap stems from the tagging of multiple analytic perspectives, and when applied to such cases, the rules given so far appear to produce intuitively plausible results. So the conditions may serve to guide us in attempting to extend the mechanism to cover cases of self-overlap: what is needed is a way of handling self-overlap that satisfies both of them.

Vocabularies with self-overlap

If within a given language, instances of some particular element type b can overlap other instances of element type b, then we wish to write one or more of the grammars in a set of rabbit/duck grammars in such a way as to ensure that

  • When an instance of element type b occurs and does not overlap any other b, it is matched by a token in a grammar rule in the usual way.
  • When two b elements overlap each other, the grammar is ambiguous and provides two ways to parse the overlap, one parse in which the first b element is a first-class element and the second is a second-class element, and another parse in which the second b element is first-class and the first b is second-class.
  • When three b elements overlap each other, three parses are generated, one with the first b element as first-class, one with the second, and one with the third; in each parse the other two instances of b are treated as second-class elements.
  • And so on: for n overlapping b elements, the grammar will generate n parses. In each parse a different instance of b will be first-class.

For example, consider the following XML document, written with Trojan Horse milestones for b (later examples will use a more compact notation):

<doc>
 aaa
 <b sID="b1"/>
 bbb
 <b sID="b2"/>
 ccc
 <b eID="b1"/>
 ddd
 <b eID="b2"/>
 eee
</doc>
We wish to validate both element b1 and element b2. That is, we wish this document to be validated both as if it were written
<doc>
 aaa
 <b>
 bbb
 <b sID="b2"/>
 ccc
 </b>
 ddd
 <b eID="b2"/>
 eee
</doc>
and also as if it were written
<doc>
 aaa
 <b sID="b1"/>
 bbb
 <b>
 ccc
 <b eID="b1"/>
 ddd
 </b>
 eee
</doc>

It is desirable to have the same control over overlap as we have for other items in a content model, so we wish to be able to control where it can and cannot occur. The following sections outline a way to accomplish this.

Description

To handle languages in which elements with the same generic identifier can overlap each other, a few modest changes suffice. We need (1) a syntactic mechanism for matching start- and end-tags, (2) a new kind of token which we will call an overlap-token, and (3) a new kind of group for use in content models, the overlap-group.

First, since elements don't necessarily nest, some syntactic mechanism is needed for matching start- to end-tags. The generic identifier alone no longer suffices. The examples shown here will use the co-indexing syntax of TexMecs: a tilde and an arbitrary string of characters will decorate the generic identifier, and decorated start- and end-tags match only other tags with the same decoration. So in the document

<doc|<s~1|aaa<s~2|bbb|s~1>ccc|s~2>|doc>
the first s element (decorated with “~1”) contains the string “aaabbb” and the second contains “bbbccc”. As shown above, in an XML-based system Trojan Horse elements [DeRose 2004] could be used instead. The example shown above in Trojan Horse notation looks like this in TexMecs notation:
<doc|
 aaa
 <b~b1|
 bbb
 <b~b2|
 ccc
 |b~b1>
 ddd
 |b~b2>
 eee
|doc>

Second, a new kind of token is required, namely an overlap-token:

overlap-token ::= '~' GI '~'
Within an overlap-group, an overlap-token indicates a location where an instance of the given element type which is allowed to overlap with another instance of the same element type. When there is no overlap, an overlap-token has the same meaning as a first-class token with the same generic identifier: it matches one element instance of the type named. When there is overlap in the document instance, the overlap-group containing the overlap-token (see next paragraph) is matched against the document instance once for each overlapping element, each time yielding one reading or parse of the relevant portion of the document instance. In each reading, the overlap-token matches a different one of the overlapping element instances; the other overlapping element instances are treated as second-class elements and must match second-class tokens.

Third, a new type of content-model expression is required, an overlap group:

group ::= '#overlap(' expr ')'
If there are no instances of overlap in the document, the overlap group will have the same meaning as the expression it contains. If there are instances of (self-)overlap in the document, then the overlap group will be matched against the relevant part of the document instance one time for each overlapping element. The expr of an overlap expression differs from other expressions in two ways: first, it may contain overlap-tokens (for a single generic identifier), and second, it may (and always will, since otherwise it's unsatisfiable) contain second-class tokens (#tag( g ) and so on) for the first-class element named in the overlap-tokens.

In any normal overlap-group, there will be at least one overlap-token (~ g ~) and at least one #tag( g ) (or at least one #stag( g ) and at least one #etag( g )) token for the same generic identifier. Together, these show the two ways in which every occurrence of g within the scope of the overlap-group must be read. Every occurrence of g must be a first-class element in (at least) one reading, and a pair of milestones (i.e. a second-class element) in all other readings.14

If #overlap(α) is an overlap-group with an overlap-token for g, then we can use α1 to denote the reading of the expression in which the first g is first-class and any others are second-class, α2 to denote the reading in which the second g is first-class, and so on. Then if there is just one g in the relevant part of the document instance, the meaning of the overlap-group is (α1). If there are two overlapping g elements, the meaning of the overlap-group is (α1 ∧ α2), where the connector “∧” denotes conjunction. When F and G are expressions, the content-model expression (FG) denotes the set of sequences which are members both of F and of G. When F and G are readings, the expression (FG) denotes the set of sequences which can be read both as F and as G. So (α1 ∧ α2) denotes the set of expressions which can be read both as α1 and as α2 For three overlapping g elements, the meaning is (α1 ∧ α2 ∧ α3). And so on. The meaning of an overlap-group, formally, is thus the infinite disjunction ((α1) ∨ (α1 ∧ α2) ∨ (α1 ∧ α2 ∧ α3) ∨ ...)

The two general conditions identified above (section 2-3) are preserved by this interpretation of overlap-groups: the grammar rules for the parent apply separately for each overlapping element matching the overlap-token, as do those for the overlapping element itself. And each overlapping instance is covered by a declaration and produces a reading in which it is a first-class element.

Examples of self-overlap

A simple example of overlap might involve pages (tagged p) deletions (d), and other elements (a, b). Deletions can overlap other deletions (i.e. element d may self-overlap), but no other overlaps occur in this group of element types. We can define the grammar thus:

<!GRAMMAR p [
<!ELEMENT p #overlap( (
              (#PCDATA | a | #tag(d))*, 
              ~d~, 
              (#PCDATA | b | #tag(d))* )? ) >
<!ELEMENT d (#PCDATA | a | b | #tag(d))*) >
<!ELEMENT a EMPTY >
<!ELEMENT b EMPTY >
]>

Given a document instance

<p|
 <a><d~1|<a><d~2|<a><b><a>|d~1><b>|d~2><b>
|p>
The rules for interpreting overlap-groups produce two readings of the document. In one of these, element d~1 is the first-class element, and d~2 is second-class:
  • p
    • a
    • d~1
      • a
      • stag(d~2)
      • a
      • b
      • a
    • b
    • etag(d~2)
    • b
In the other, the roles are reversed:
  • p
    • a
    • stag(d~1)
    • a
    • d~2
      • a
      • b
      • a
      • etag(d~1)
      • b
    • b
Note that the content model for p must contain both the overlap-token ~d~ and the second-class tokens #tag(d). Without the latter, neither reading would be valid, since there would be unmatched milestones. The content model for d must similarly contain locations where start- and end-tags for other instances of d may occur (if they did not occur within d, the elements would not be overlapping).15

A second example illustrates slightly tighter constraints: d elements contain not an unordered mix of a and b elements, but an ordered sequence:

<!GRAMMAR p [
<!ELEMENT p (h*, 
            #overlap((a | #stag(d))*, 
                     ~d~, 
                     (c | #etag(d))*),
             f*) >
<!ELEMENT d ((a | #tag(d) | ~d~)*, 
             b,
             (b | #tag(d) | ~d~)+, 
             (c | #tag(d) | ~d~)*) >
<!ELEMENT a EMPTY >
<!ELEMENT b EMPTY >
<!ELEMENT c EMPTY >
<!ELEMENT f (#PCDATA) >
<!ELEMENT h (#PCDATA) >
]>
The instance given above is not valid according to this grammar, but the following is:
<p|
 <a><d~1|<a><d~2|<b>|d~1><c>|d~2><c>
|p>

Implementation issues

In the preceding sections I have endeavored to give a purely declarative account of rabbit/duck grammars and their meaning, without any appeal to particular parsing techniques. But of course a grammatical formalism is more likely to be useful in practice if the constraints it expresses can actually be enforced at reasonable implementation and runtime cost.

One implementation technique is straightforward for languages without self-overlap:

  1. Use a simple tokenizer to read the input document and produce a stream of tags and data-content chunks.
  2. For each grammar, create a filter which passes through all tags for first-class elements, translated to XML form. Second-class tags are translated into XML milestone elements. Third-class tags are suppressed entirely; fourth-class elements are suppressed entirely (both tags and content).16
  3. Translate each grammar in the set into any conventional grammar-based schema language.
  4. Validate the resulting XML document against the schema created in the previous step.
Repeat this process once for each grammar in the full set of rabbit/duck grammars. If no grammatical errors are found, the input document is valid.

Alternatively, one can dispense with the filtering step by building a parser for each grammar, in which the lexical scanner takes care of distinguishing among first- and second-class tags, suppressing third-class tags, and suppressing fourth-class elements. The document can be validated against a set of rabbit/duck grammars by parsing it once for each grammar.

It is also possible to validate an input document against a set of rabbit/duck grammars in a single pass; one way to do that is described below. It involves some unconventional extensions of Brzozowski derivatives; these are introduced piecemeal, in an attempt to ease comprehension. The exposition begins with a description of one way to check the well-formedness and validity of XML (section 4-1) using Brzozowski derivatives. These are relatively simple and straightforward illustrations of the mechanism. Section 4-2 discusses validation of TexMecs languages without self-overlap, and finally section 4-3 describes validation of languages which allow self-overlap.

Checking XML using Brzozowski derivatives

Brzozowski derivatives were introduced in 1964 by Janusz Brzozowski [Brzozowski 1964]. The derivative D of a set L with respect to some input string s, written D s L, is the set of strings which can follow s in words of the language L. Brzozowski showed, given a regular expression E denoting L, how to construct a regular expression for D s E, which makes possible an algebraic approach to recognizing members of the language. Less formally, given a regular expression and a (prefix) string, the derivative (of the expression, with respect to the string) tells us what can come next. In what follows, it will be important to recall that if one of the things which can come next is ε, the empty string, then the string s is itself a member of the language defined by E. An expression E whose language includes ε is said to be nullable, and the function ν(E) is true or false accordingly. In what follows, successful parses will normally be those which end with a derivative which is nullable.17

Well-formedness checking

Brzozowski derivatives are conventionally used to recognize or reason about regular languages. But with a couple of extensions, they can also be used to parse well-formed XML.

The first extension is to define an expression language slightly different from regular expressions as defined either conventionally or by Brzozowski. I shall call these language expressions rather that regular expressions, because the languages they denote are not necessarily regular. Each expression denotes a set of strings, and may be said to match any string in that language. The language expressions of interest here may take any of the following forms:

  • elem( g ) matches an element with generic identifier g.
  • stag( g ) matches a start-tag for an element with generic identifier g.
  • etag( g ) matches an end-tag for an element with generic identifier g.
  • #PCDATA matches a sequence of zero or more characters. It is intrinsically nullable.
  • W denotes a string of well-balanced XML; i.e., it is (#PCDATA | elem( g ))*, where g need not be the same for all elements matched. When parsing well-formed XML, W can be used to represent the expected content of any element.
  • X matches an arbitrary XML document; i.e., it is elem( g ) for any generic identifier g
  • ε matches the empty string.
  • ∅ matches nothing (it denotes the empty set).
  • If F and G are expressions, then (F G) is an expression denoting the set of strings formed by concatenating a string in F with one in G.
  • If F and G are expressions, then (F | G) is an expression denoting the union of F and G.
  • If F and G are expressions, then (F & G) is an expression denoting the set of strings formed by interleaving a string in F with a string in G.18
  • If F and G are expressions, then (FG) is an expression denoting the intersection of F and G.

The second extension is to define an infinite input alphabet. Assume that the input stream is a sequence of symbols each of which is one of the following:

  • sole( g ) denotes a sole-tag for an empty element with generic identifier g. For purposes of calculating derivatives, this is treated as if the input were the two-symbol string “stag( g ) etag( g )”.
  • stag( g ) denotes a start-tag for an element with generic identifier g.
  • etag( g ) denotes an end-tag for an element with generic identifier g.
  • pcdata() denotes a sequence of one or more characters.
Since there are an infinite number of legal generic identifiers, there are an infinite number of possible input symbols. Note that for simplicity I ignore comments, processing instructions, and insignificant white space, imagining that they are handled at another level of the processor. If it is desired, they can be handled explicitly by adding the rule that if x is a comment, processing instruction, or insignificant white space, D x E = E for all E.

As usual, the derivative of an expression E with respect to a sequence of symbols x 1, x 2, x 3, ..., x n ,is found by finding the derivative of E with respect to x 1, then the derivative of D x 1 E with respect to x 2, and so on. Formally,

D s E = D <x 1, x 2, ... x n > E = D x n (... (D x 3 (D x 2 (D x 1 E))) ...)

A sequence of input symbols d is well-formed XML if and only if D d X is nullable, when the rules for derivatives are as described below.

The third extension is the the derivative of a primitive language expression with respect to an input symbol may be something other than ε or ∅. Derivatives of compound expressions (concatenation, disjunction, conjunction) are given in [Brzozowski 1964] and need not be repeated here. [Sperberg-McQueen 2005] defines derivatives for interleave.

When x = stag( h ), then

  • D x elem( g ) = (W etag( g )) if h = g, otherwise ∅.
  • D x stag( g ) = ε if h = g, otherwise ∅.
  • D x etag( g ) = D x #PCDATA = ∅.
  • D x W = (W etag( g ) W).
  • D x X = (W etag( g )).
  • D x ε = D x ∅ = ∅.

When x = etag( h ), then

  • D x elem( g ) = D x stag( g ) = ∅.
  • D x etag( g ) = ε if h = g, otherwise ∅.
  • D x #PCDATA = D x W = D x X = D x ε = D x ∅ = ∅.

When x = pcdata(), then

  • D x elem( g ) = D x stag( g ) = D x etag( g ) = ∅.
  • D x #PCDATA = ε.
  • D x X = D x W = D x ε = D x ∅ = ∅.

Examples of using these rules to check XML well-formedness are given in appendix 7-1.

The reader will note that the derivative, which represents the expected sequence of tokens to come, is being used as a stack: when we encounter a start-tag in the input which matches an elem( g ) in the expression being matched, we push (W etag( g )) onto the stack by placing it at the front of the derivative.

Validation

This method can be applied to validation as well, by changing just one rule. When stag( g ) in the input matches the expression elem( g ), the rules just given make the derivative be (W etag( g )): in well-formed XML, what follows the start-tag of any element is well-formed XML followed by the end-tag of that element. To validate against a DTD, all we need to do is to replace W in the derivative with the content model of the element. If we write the expression “cm( g )” to denote the content model of element type g, and “[[cm( g )]]” to denote the value of that expression (i.e. to indicate that the expression should be replaced by its value), the rule becomes:

  • D stag(h) elem( g ) = ([[cm( g )]] etag( g )) if h = g, otherwise ∅.

For example, consider a vocabulary for serializing binary trees whose definition (in DTD syntax) is:

<!ELEMENT tree ((tree, tree) | leaf) >
<!ELEMENT leaf (#PCDATA) >
Let us validate the document
<tree><leaf>42</leaf></tree>

We start with the expression elem(tree) and validate the input stream “stag(tree), stag(leaf), pcdata(), etag(leaf), etag(tree)”:

E 0 = elem(tree)
E 1 = D stag(tree) E 0 = [[cm(tree)]], etag(tree) = ((elem(tree), elem(tree)) | elem(leaf)), etag(tree)
E 2 = D stag(leaf) E 1 = ((∅, elem(tree)) | ([[cm(leaf)]], etag(leaf))), etag(tree) = ((∅) | ([[cm(leaf)]], etag(leaf))), etag(tree) = [[cm(leaf)]], etag(leaf), etag(tree) = #PCDATA, etag(leaf), etag(tree)
E 3 = D pcdata() E 2 = ε, etag(leaf), etag(tree) = etag(leaf), etag(tree)
E 4 = D etag(leaf) E 3 = ε, etag(tree) = etag(tree)
E 5 = D etag(tree) E 4 = ε
The final expression is nullable, and we can therefore conclude that the input was valid against the grammar.

Further examples of validation using Brzozowski derivatives are given in appendix 7-2.

Parsing TexMecs without self-overlap

The technique just described can be used without major changes to validate documents with overlap against individual rabbit/duck grammars which have no overlap-groups. The only change is that the implementation must take care to treat fourth-class elements and tags for third-class elements appropriately.

We can validate a document against a whole set of rabbit/duck grammars rather than just individual grammars with parsing mechanism which keeps track of the current derivative in each grammar and for each token calculates the next derivative for each grammar. Perhaps the simplest way to acquire such a mechanism is to connect the derivatives for the different languages using logical conjunction. Thus if the current validation state of grammar 1 is denoted E 1, and the state of grammar 2 is E 2, etc., then for n grammars we will have a single expression of the form

E 1E 2 ∧ ... ∧ E n

It is necessary, however, to keep track of which expressions come from which grammars, so as to treat third- and fourth-class elements appropriately. One way to do this is to assign a number to each grammar and extend the expression language so as to handle the necessary bookkeeping. The foundation of the expression language are the content model expressions defined above in section 2-1; for purposes of validation we extend that expression language as follows.

First, we subscript or mark each reference to a first-class element in a grammar with the number of that grammar. In written form, this can be represented as a subscript to the generic identifier, so that instead of being written (a, (b | #tag(c)), d), a content model in grammar 1 would be written (a 1, (b 1 | #tag(c)), d 1), and similarly for grammar 2, 3, and so on. The rule for start-tags matching elements can then be rewritten

  • D stag(h) elem( g , n ) = ([[cm( g , n )]] etag( g )) if h = g, otherwise ∅.
where “cm(g,n)” denotes the content model of element g in grammar n.

Second, we add two new types of group expressions. Neither of these will normally occur in user-written content models; they will be used only as the result of taking a derivative from some other expression.

group   ::= g-tag | spinner
g-tag   ::= '#gr(' POSITIVE_INTEGER ')' content-model
spinner ::= '#spin(' POSITIVE_INTEGER ')' 
            '(' list ')' content-model
list    ::= empty | GI ' / ' list
empty   ::= ε // nothing, the empty sequence
It's also convenient to be able to use symbols for the empty string and the empty set:
token ::= 'ε'
        | '∅'

An expression of the form #gr( n )( E ) labels expression E as having come from grammar number n. The rules for taking the derivative D x #gr( n )( E ) are these:

  • If x is stag( g ), etag( g ), or sole( g ), and g is a third-class element in grammar n, then D x #gr( n )( E ) = #gr( n )( E ). Informally: third-class tags are ignored.
  • If x is stag( g ) and g is a fourth-class element in grammar n, then D x #gr( n )( E ) = #spin( n )( g )( E ). Informally: fourth-class start-tags put the grammar into a ‘spinner’ state, in which it spins its wheels until the matching end-tag is found.
  • Otherwise, D x #gr( n )( E ) = #gr( n )( D x E ), Informally: normally, the grammar labeling has no particular effect.

Derivatives for spinner expressions are mostly concerned with keeping track of how many fourth-class elements, and of what kinds, are currently open. Only when they are all closed again will the grammar in question leave its spinner state and return to normal.

  • If x is #stag( g ) and g is a fourth-class element in grammar n, then D x #spin( n )( L )( E ) = D x #spin( n )( g / L )( E ). Informally: if a grammar is currently blocked and we see another blocking element, we add it to the list.
  • If x is #etag( g ), and g is a fourth-class element in grammar n, and L 1 is a list of generic identifiers for fourth-class elements in grammar n, containing at least one g, then
    • If L 1 contains two or more items, then D x #spin( n )( L 1 )( E ) = D x #spin( n )( L 2 )( E ), where L 2 is L 1 with one entry for g removed. That entry need not be at the beginning of L 1: L 1 is a list, not a stack.
    • If L 1 contains just one item (i.e. L 1 = g), then D x #spin( n )( L 1 )( E ) = D x #gr( n )( E ).
    Informally: if a grammar is currently in a spinner state and we see an end-tag for a fourth-class element, then we remove it from the list. If the list is now empty, the grammar leaves the spinner state and goes back to normal.
  • Otherwise, D x #spin( n )( L 1 )( E ) = #spin( n )( L 2 )( E ). Informally: if a grammar is currently blocked and we see a token which is not a start- or end-tag for a fourth-class element, we ignore it.

With these rules it is possible to handle any set of rabbit/duck grammars which has no overlap-groups. Examples of validation without self-overlap are given in appendix 7-3.

Handling overlap-groups and overlap-tokens

To handle self-overlap, it is convenient to add two more new types of group to the expression language. Like the g-tag and spinner groups introduced in the previous section, these will not appear in user-written content models, but serve internal book-keeping purposes. Each time a start-tag symbol in the input matches an overlap-token, we need to produce two readings, one in which the start-tag is treated as first-class, and one in which it is treated as second-class. To force the one interpretation or the other, we introduce R- and S-groups.

group   ::= r-group | s-group
r-group ::= '#r(' GI ')' expr
s-group ::= '#s(' GI ')' expr
In an R-group, a start-tag which matches an overlap-token must be interpreted as a first-class tag. In an S-group, it must be treated as second-class.

The rules for derivatives are these.

First, the rules for derivatives of overlap-expressions. When #overlap( E ) is an overlap-group whose overlap-token(s) are for generic identifier g, then

  • D stag(g) #overlap( E ) =    D stag(g) #r( g )(E)    | ( D stag(g) #r( g )(E)      ∧ #overlap( D stag(g) #s( g )(E))).
    Note that the R-group must be matched, while the S-group is in some sense optional: if there is no overlap but only a single instance of element g, only the R-group match is important.
  • D x #overlap( E ) = #overlap( D x E ) for all x other than start-tags for the overlap-token in E.

Next, the rules for R-groups:

  • D stag(g) #r( g )(stag( g )) = ∅.
  • D stag(g) #r( g )(~ g ~) = [[cm(n,g)]] etag( g ).
  • D stag(g) #r( g )(elem( g )) = [[cm(n,g)]] etag( g ).
  • D x #r( g )(E) = D x E if E is a token and x is not stag( g ).
  • D x #r( g )(F | G) = (D x #r( g )(F) | D x #r( g )(G)), and similarly for all the other compound expressions.

Finally, the rules for S-groups (in which the element is treated as second-class).

  • D stag(g) #s( g )(stag( g )) = ε.
  • D stag(g) #s( g )(~ g ~) = ∅.
  • D stag(g) #s( g )(elem( g )) = ∅.
  • D x #s( g )(E) = D x E if E is a token and x is not stag( g ).
  • D x #s( g )(F | G) = (D x #s( g )(F) | D x #s( g )(G)), and similarly for all the other compound expressions.

Examples of validating self-overlap are given in appendix 7-4.

Applications

This section describes two simplified versions of real-world markup scenarios and illustrates the use of rabbit/duck grammars to solve problems in those scenarios.

Example: Multiple structures

Consider a markup vocabulary in which several different and overlapping hierarchical structures are mingled. The vocabulary provides element types for marking up manuscript documents with information about

  1. their compositional structure, using text, head, div, p, and s (sentence-unit) elements
  2. the physical organization of the book, for both the original manuscript and later print publications, using vol, page, col, block, and line elements
  3. the details of the manuscript inscription, using add, del, restore, and hi elements
  4. hypertext structures, using a elements modeled on HTML and link and end elements modeled on XLink (one link element contains one or more end elements; each end element points to a location in a document; the link element as a whole asserts a link among those end points; the link element need not be physically near any end point — its location is without any significance at all)

The names by themselves should (it is hoped) make the intended semantics clear enough for purposes of this example.

The constraints to be enforced include some which are fairly easy to express in single hierarchies:

  • A text is partitioned by the heading, paragraph, and divisions which occur directly within it.
  • Similarly, div elements are partitioned by the heading, paragraphs, and sub-divisions which occur directly within them.
  • A p element is partitioned by its child s elements.
  • A restore element must be a child of a del element (it indicates the cancellation of that part of the deletion).
  • When volumes, pages, columns, and lines are marked, a text element is partitioned by the set of vol elements, and also by the sets of page, col, and line elements, respectively.
  • A vol element is partitioned by the set of page elements which are its children.
  • If col elements or line elements are used, they partition the page elements which directly contain them.

Some constraints involve cross-view interactions:

  • Nothing may overlap with any text element.
  • No physical boundaries (pages, columns, blocks, lines) may overlap with any element of type head. That is, the markup language requires that headings of chapters be on lines by themselves.
  • Nothing may overlap with any vol element.
  • link elements may appear anywhere in the document but must not overlap other elements.
  • a elements must nest within sentences but may span most other boundaries.
  • Like the phenomena they represent, the elements add, del, and hi should be allowed to overlap virtually anything, except the element types which may not be overlapped at all. In particular,
    • add may not overlap with add. (What would it mean?)
    • del may overlap with del. (This is semantically tricky but may be interpreted either as redundant deletion by the author, as uncertainty on the part of the encoder which deletion occurred first, or as involving an implicit cancellation of the first deletion, before the second deletion.)
    • hi may overlap hi; in practice, this will happen most often when the overlapping highlightings have different attributes
    • None of add, del, or hi may overlap page. We are transcribing manuscripts, and to delete material on separate pages requires multiple marks (i.e. multiple deletions). The case for additions and highlighting is slightly less clear, but we will take a page-oriented view of them, as well as of deletion.

The following subsections define a set of rabbit/duck grammars to validate this language.

The extended-link structure

From the point of view of the extended link markup, the structure of a document is quite simple: a text is a series of link elements intermixed with character data.

<!GRAMMAR text [
<!ELEMENT text (#PCDATA | link)* >
<!ELEMENT link (end+) >
<!ATTLIST link
          type CDATA     #REQUIRED >
<!ELEMENT end  EMPTY >
<!ATTLIST end
          role CDATA     #REQUIRED
          loc  IDREFS    #REQUIRED >
]>
Note that this grammar does nothing to enforce the prohibition on overlaps between links and other elements. That rule could be enforced by explicitly constraining where other tags may occur, thus:
<!GRAMMAR text [
<!ENTITY % tags.comp "#tag(div) | #tag(head) 
  | #tag(p) | #tag(s)" >
<!ENTITY % tags.phys "#tag(vol) | #tag(page) 
  | #tag(col) | #tag(block) | #tag(line)" >
<!ENTITY % tags.inscr "#tag(add) | #tag(del) 
  | #tag(restore) | #tag(hi)" >
<!ENTITY % tags.other "%tags.comp; 
  | %tags.phys; | %tags.inscr;">

<!ELEMENT text (#PCDATA | link | %tags.other;)* >
<!ELEMENT link (end+) >
<!ATTLIST link
          type CDATA     #REQUIRED >
<!ELEMENT end  EMPTY >
<!ATTLIST end
          role CDATA     #REQUIRED
          loc  IDREFS    #REQUIRED >
]>
But the hyperlink grammar can be kept to the simpler form if the rules are enforced by the other grammars in the set, as shown below.

The compositional hierarchy

The compositional hierarchy has a structure whose general outline will be familiar from many familiar document markup languages.

<!GRAMMAR text [
<!ELEMENT text (head, p*, div*) >
<!ELEMENT div  (head?, p*, div*) >
<!ELEMENT p    (s+) >
<!ELEMENT head (#PCDATA) >
<!ELEMENT s    (#PCDATA | a)* >
<!ELEMENT a    (#PCDATA)* >

<!ELEMENT link IGNORE >
]>

The declaration of link as IGNORE makes link elements and their content invisible to this grammar. It follows that no start- or end-tags for text, div, etc., will be seen inside any link elements, and thus that link elements cannot overlap any first-class elements declared in this grammar. We need not declare end elements as fourth-class here: since they occur only within link elements, they are guaranteed invisible to this grammar.

One might be tempted to try to enforce the prohibition on overlaps with headings by a similar maneuver, and declare head as IGNORE in all the other grammars. But this would prevent multi-line headings, as well as preventing highlighting from occurring in headings. To enforce the prohibition effectively, we will need a more complicated version of the compositional grammar, which explicitly allows tags from the physical hierarchy everywhere except in headings:

<!GRAMMAR text [
<!ENTITY % tags.phys "#tag(vol) | #tag(page) 
  | #tag(col) | #tag(block) | #tag(line)" >
<!ENTITY % x "(%tags.phys;)*">

<!ELEMENT text (%x;, head, %x;, 
                     (p, %x;)*, 
                     (div, %x;)*) >
<!ELEMENT div  (%x;, (head, %x;)?, 
                     (p, %x;)*, 
                     (div, %x;)*) >
<!ELEMENT p    (%x;, (s, %x;)+) >
<!ELEMENT head (#PCDATA)* >
<!ELEMENT s    (#PCDATA | a | %tags.phys;)* >
<!ELEMENT a    (#PCDATA | %tags.phys;)* >

<!ELEMENT link IGNORE >
]>

Note that this grammar does nothing to enforce the constraint forbidding any of these elements from overlapping with volumes. That will be taken up in section 5-1-5.

The physical hierarchy

The physical grammar defines a fairly straightforward hierarchy of volumes, pages, columns, text blocks, and lines. Some of the levels of hierarchy are of interest only for particularly detailed markup or particularly complex physical organization of the source; they can be omitted when not necessary. Thus col should be used only if there are at least two columns on the page, block only if there are at least two text blocks in the column, etc.

<!GRAMMAR text [
<!ELEMENT text  (page+ | vol{2,∞}) >
<!ELEMENT vol   (page+) >
<!ATTLIST vol   n CDATA #IMPLIED >
<!ELEMENT page  (line+ 
                   | block{2,∞} 
                   | col{2,∞}) >
<!ATTLIST page  n CDATA #IMPLIED >
<!ELEMENT col   (line+ 
                   | block{2,∞}) >
<!ATTLIST col   n CDATA #IMPLIED >
<!ELEMENT block (line+) >
<!ATTLIST block n CDATA #IMPLIED >
<!ELEMENT line  (#PCDATA) >
<!ATTLIST line  n CDATA #IMPLIED >

<!ELEMENT link IGNORE >
]>

The inscription hierarchies

Since the elements related to inscription can overlap each other, they cannot all be in the same grammar; we will require one grammar each for additions, deletions, and highlighting. Since the elements are defined as not spanning page boundaries, we can conveniently include the page element in each of the inscription grammars.

Since additions do not self-overlap, their grammar is very straightforward:

<!GRAMMAR text [
<!ELEMENT text (page+) >
<!ELEMENT page (#PCDATA | add)* >
<!ELEMENT add  (#PCDATA | add)* >

<!ELEMENT link IGNORE >
]>
This grammar requires only that additions nest within each other and within pages (and incidentally that they not overlap any links), nothing more.

Highlighting is slightly more complex, since highlighted phrases can both nest within each other and overlap each other:

<!GRAMMAR text [
<!ELEMENT text (page+) >
<!ELEMENT page #overlap((#PCDATA 
                       | ~hi~ | #tag(hi))*) >
<!ELEMENT hi   #overlap((#PCDATA 
                       | ~hi~ | #tag(hi))*) >

<!ELEMENT link IGNORE >
]>

Deletions add the complication that restore elements may occur within deletions and may be overlapped by deletions other than their parent(s):

<!GRAMMAR text [
<!ELEMENT text (page+) >
<!ELEMENT page #overlap((#PCDATA 
                       | ~del~ | #tag(del))*) >
<!ELEMENT del  #overlap((#PCDATA | restore
                       | ~del~ | #tag(del))*) >
<!ELEMENT restore #overlap((#PCDATA
                       | ~del~ | #tag(del))*) >

<!ELEMENT link IGNORE >
]>

Miscellaneous constraints

The grammars thus far given have enforced all of the constraints outlined except one: that nothing may overlap any vol element.

This could probably be accomplished by suitably careful inclusion of #tag(vol) tokens at the right places in the right content models, but it is simpler just to ensure that vol is described as containing (directly or indirectly) each element other than text from all of the other hierarchies. Part of this has already been done: it follows from the physical and inscription grammars given above that the first-class elements in those grammars all nest within vol elements.19 The physical grammar also guarantees that vol does not overlap link. The remainder of the elements in the vocabulary can be handled with the following grammar, which requires only that the first-class elements all nest properly:

<!GRAMMAR text [
<!ELEMENT text (vol | head | div | p | s | a)* >
<!ELEMENT vol  (vol | head | div | p | s | a)* >
<!ELEMENT head (vol | head | div | p | s | a)* >
<!ELEMENT div  (vol | head | div | p | s | a)* >
<!ELEMENT p    (vol | head | div | p | s | a)* >
<!ELEMENT s    (vol | head | div | p | s | a)* >
<!ELEMENT a    (vol | head | div | p | s | a)* >
]>

XML vocabularies with rabbit/duck grammars

Rabbit/duck grammars may also usefully be applied in some cases not involving overlap. Consider the following example, based (very loosely) on an active vocabulary design project called to my attention by Fabio Vitale of the University of Bologna.

The vocabulary is to support documents about sales transactions of various kinds. The overall structure of the document is perfectly conventional; it differs from the compositional grammar given above in section 5-1-2 primarily in adding more phrase-level elements, including some semantically specialized elements which allow crucial information about the sales agreement to be captured.

Somewhere in the document, exactly one seller must be identified, either as a corporation or as an individual; similarly one buyer must be identified. For corporations, a VAT number must be given; for individuals, a personal identity number (PIN). Start- and end-dates of the agreement, amount to be paid, address of the buyer and seller, a description of the good being sold, and a statement of the controlling jurisdiction in cases of dispute, must also be indicated. For a variety of reasons, it proves impractical to require that this information come in a prescribed order or at a prescribed location in the document: the accepted form of these documents allows the information to occur pretty much anywhere in the document. It would be possible to record the information both in the running text and in a header, but that would involve redundancy and raise the likelihood of inconsistency. It would be possible to record the information just once, in a header, and use hyperlinks to refer to it from the text body; for purposes of the example, we assume that this solution was also found impractical for the project.

In a conventional schema, it will prove difficult to ensure that exactly one individual-buyer element or corp-buyer element occurs in the document, and so forth.

A pair of rabbit/duck grammars can, however, impose the constraints without much difficulty.

The main document structure grammar resembles the one given above, with additional phrase-level elements:

<!GRAMMAR text [
<!ENTITY % hypertext "a" >
<!ENTITY % hilites "it | bd | tt" >
<!ENTITY % buyer "individual-buyer | corp-buyer 
  | buyer-PIN | buyer-VATnum | buyer-address" >
<!ENTITY % seller "individual-seller | corp-seller 
  | seller-PIN | seller-VATnum | seller-address" >
<!ENTITY % misc "start-date | end-date | amount
  | description-of-goods | jurisdiction" >

<!ENTITY % phrases "%buyer; | %seller; 
  | %hypertext; | %hilites; | %misc;">

<!ELEMENT text (head, p*, div*) >
<!ELEMENT div  (head?, p*, div*) >
<!ELEMENT p    (#PCDATA | %phrases;)* >
<!ELEMENT head (#PCDATA | %phrases;)* >
<!ELEMENT a    (#PCDATA | %hilites;)* >
<!ELEMENT it   (#PCDATA | %hilites;)* >
<!ELEMENT bd   (#PCDATA | %hilites;)* >
<!ELEMENT tt   (#PCDATA | %hilites;)* >

<!ELEMENT individual-buyer  (#PCDATA | %hilites;)* >
<!ELEMENT corp-buyer        (#PCDATA | %hilites;)* >
<!ELEMENT buyer-PIN         (#PCDATA | %hilites;)* >
<!ELEMENT buyer-VATnum      (#PCDATA | %hilites;)* >
<!ELEMENT buyer-address     (#PCDATA | %hilites;)* >
<!ELEMENT individual-seller (#PCDATA | %hilites;)* >
<!ELEMENT corp-seller       (#PCDATA | %hilites;)* >
<!ELEMENT seller-PIN        (#PCDATA | %hilites;)* >
<!ELEMENT seller-VATnum     (#PCDATA | %hilites;)* >
<!ELEMENT seller-address    (#PCDATA | %hilites;)* >
<!ELEMENT start-date        (#PCDATA | %hilites;)* >
<!ELEMENT end-date          (#PCDATA | %hilites;)* >
<!ELEMENT amount            (#PCDATA | %hilites;)* >
<!ELEMENT description-of-goods (#PCDATA | %hilites;)* >
<!ELEMENT jurisdiction      (#PCDATA | %hilites;)* >
]>

The constraints on the semantically rich phrase-level tags are enforced in a second grammar. For purposes of illustration, we assume that the & connector in the content model is interpreted as denoting interleave rather than denoting the same relation as the & connector in SGML.

<!--* We must have one buyer, with address and either 
    * SSN or VATnumber, also a seller with similar details.
    * They can appear anywhere, and must appear exactly
    * once each.
    *-->
<!--* N.B. interleave interpretation of & needed here, 
    * not the SGML interpretation *-->
<!ELEMENT doc  (#PCDATA 
               & ( (individual-buyer & buyer-PIN) 
                 | (corp-buyer & buyer-VATnum)
                 )
               & ( (individual-seller & seller-PIN) 
                 | (corp-seller & seller-VATnum)
                 )
               & buyer-address
               & seller-address
               & start-date
               & end-date
               & amount
               & description-of-goods
               & jurisdiction) >

Conclusion

This paper has presented rabbit/duck grammars, which are a modest advance on CONCUR as a mechanism for validating overlapping structures. The essential difference from CONCUR is that rabbit/duck grammars allow the boundaries of elements in one hierarchy to be visible in document grammars for other hierarchies; this makes it possible to constrain interactions between hierarchies in ways not feasible with CONCUR.

In addition, Brzozowski derivatives have been extended to apply to the parsing of well-formed XML or MECs, to the validation of XML against one or more DTDs, and to the validation of markup with overlapping structures against rabbit/duck grammars.

Some of the goals with which the work was begun have been met, at least in part. Rabbit/duck grammars are powerful enough to be useful, without being Turing-complete. They are built around well understood ideas, notably the representation of overlapping structures as sets of tangled trees and the validation of trees using context-free grammars. There are simple validation algorithms using Brzozowski derivatives. Some goals have been met only incompletely. The requirement that no single grammar have two first-class elements which overlap each other means that grammars cannot always correspond 1:1 with the sub-groupings of a vocabulary which seem on conceptual grounds to belong together. The fact that in case of need a grammar may contain just one document element and one self-overlapping element (as in the case of the deletion grammar in section 5-1-4) serves as a reminder that every graph may be factored into trees, but may also suggest to some observers that that exercise is not necessarily an enlightening one.

One goal remains unmet: the rabbit/duck mechanism outlined here is well suited for validating data streams (TexMecs, or XML with Trojan horses, or concurrent XML in the style of XCONCUR), but there are at least some graph structures proposed in the literature (Goddag structures and multi-colored trees, to name two) which cannot be validated against a set of rabbit/duck grammars in any obvious way, since they have no structural analogue to the second-class tags in the data stream.

Most important, it is not clear whether rabbit/duck grammars can handle all of the kinds of constraints which one might naturally wish to impose on overlap, mostly because (in the absence of any serious discussion of validation) there is no consensus view in the community about what constitutes a natural constraint on overlap. The development of better intuitions on this subject remains a task for further work in the community. So does the elucidation of what classes of constraints can, and which classes of constraints cannot, be expressed by sets of rabbit/duck grammars.

Appendix: Worked examples

The examples in this section are intended to illustrate the operations described in the main text. They need not be consulted unless the reader is interested in detailed information.

Checking XML well-formedness

A well-formed document

A simple example may serve to illustrate the process. Let the input stream d be

<a>www<b><c/>xxx</b>yyy<d/>zzz</a>
The process of checking well-formedness then looks like this:
E 0 = X
E 1 = D stag(a) E 0 = W etag(a)
E 2 = D pcdata() E 1 = E 1
E 3 = D stag(b) E 2 = W etag(b) W etag(a)
E 4 = D stag(c) E 3 = W etag(c) W etag(b) W etag(a)
E 5 = D etag(c) E 4 = W etag(b) W etag(a) = E 2
E 6 = D pcdata() E 5 = E 5
E 7 = D etag(b) E 6 = W etag(a) = E 1
E 8 = D pcdata() E 7 = E 7
E 9 = D stag(d) E 8 = W etag(d) W etag(a)
E 10 = D etag(d) E 9 = W etag(a) = E 1
E 11 = D pcdata() E 10 = E 10
E 12 = D etag(a) E 11 = ε
So D d X = E 12 = ε, and we can conclude that the input d matched the expression X, i.e. that the document was well-formed.

An ill-formed data stream

If we parse an ill-formed stream, however, we should get an error. Let the input stream d be

<a>xxx<b>yyy</a>zzz</b>
Then the parse is:
E 0 = X
E 1 = D stag(a) E 0 = W etag(a)
E 2 = D pcdata() E 1 = E 1
E 3 = D stag(b) E 2 = W etag(b) W etag(a)
E 4 = D pcdata() E 3 = E 3
E 5 = D etag(a) E 4 = ∅
E 6 = D pcdata() E 5 = ∅
E 7 = D etag(b) E 6 = ∅
That is, D d X = E 7 = ∅, from which we can conclude that d is not well-formed XML.20

Validating XML

A valid instance

Let the DTD be

<!DOCTYPE a [
<!ELEMENT a (b, (c | d | e){2,&infin;}, b*) >
<!ELEMENT b EMPTY >
<!ELEMENT c EMPTY >
<!ELEMENT d EMPTY >
<!ELEMENT e EMPTY >
]>
and let the input d be
<a><b/><d/><e/></a>
Then the sequence of derivatives is as follows. (For brevity, we will also add the rule that D sole(e) elem(e) = ε, for cases where we know the element in question was declared EMPTY.)
E 0 = X
E 1 = D stag(a) E 0
= cm(a) etag(a)
= (elem(b) (elem(c) | elem(d) | elem(e)){2,∞} (elem(b))*) etag(a)
E 2 = D sole(b) E 1
= D sole(b)(elem(b) (elem(c) | elem(d) | elem(e)){2,∞} (elem(b))* etag(a))
= (D sole(b)(elem(b)) (elem(c) | elem(d) | elem(e)){2,∞} (elem(b))* etag(a))
= (ε (elem(c) | elem(d) | elem(e)){2,∞} (elem(b))* etag(a))
= ( (elem(c) | elem(d) | elem(e)){2,∞} (elem(b))* etag(a))
E 3 = D sole(d) E 2
= D sole(d)((elem(c) | elem(d) | elem(e)){2,∞} (elem(b))* etag(a))
= (D sole(d)((elem(c) | elem(d) | elem(e)){2,∞}) (elem(b))* etag(a))
= ((D sole(d)(elem(c) | elem(d) | elem(e)) (elem(c) | elem(d) | elem(e)){1,∞}) (elem(b))* etag(a))
= (((D sole(d) elem(c) | D sole(d) elem(d) | D sole(d) elem(e)) (elem(c) | elem(d) | elem(e)){1,∞}) (elem(b))* etag(a))
= (((∅ | ε | ∅) (elem(c) | elem(d) | elem(e)){1,∞}) (elem(b))* etag(a))
= (((ε)) (elem(c) | elem(d) | elem(e)){1,∞} (elem(b))* etag(a))
= ( (elem(c) | elem(d) | elem(e)){1,∞} (elem(b))* etag(a))
E 4 = D sole(e) E 3
= D sole(e)( (elem(c) | elem(d) | elem(e)){1,∞} (elem(b))* etag(a))
= (D sole(e)((elem(c) | elem(d) | elem(e)){1,∞}) (elem(b))* etag(a))
= (((D sole(e)(elem(c) | elem(d) | elem(e)) (elem(c) | elem(d) | elem(e)){0,∞})) (elem(b))* etag(a))
= ((((D sole(e) elem(c) | D sole(e) elem(d) | D sole(e) elem(e)) (elem(c) | elem(d) | elem(e)){0,∞})) (elem(b))* etag(a))
= ((((∅ | ∅ | ε) (elem(c) | elem(d) | elem(e)){0,∞})) (elem(b))* etag(a))
= ((((ε) (elem(c) | elem(d) | elem(e)){0,∞})) (elem(b))* etag(a))
= ( (elem(c) | elem(d) | elem(e)){0,∞} (elem(b))* etag(a))
E 5 = D etag(a) E 4
= D etag(a)((elem(c) | elem(d) | elem(e)){0,∞} (elem(b))* etag(a))
= (D etag(a)((elem(c) | elem(d) | elem(e)){0,∞}) (elem(b))* etag(a)) | (D etag(a)(elem(b))* etag(a)) | (D etag(a) etag(a))
= (∅ (elem(b))* etag(a)) | (∅ etag(a)) | (D etag(a) etag(a))
= (∅ | ∅ | D etag(a) etag(a))
= D etag(a) etag(a)
= ε
Since D d X is the empty string, the document is valid.

An invalid document

If we try an invalid document, we produce ∅, not a nullable expression. Let d be

<a><b/><d/></a>
Then the sequence of derivatives is:
E 0 = X
E 1 = D stag(a) E 0
= cm(a) etag(a)
= (elem(b) (elem(c) | elem(d) | elem(e)){2,∞} (elem(b))*) etag(a)
E 2 = D sole(b) E 1
= D sole(b)((elem(b) (elem(c) | elem(d) | elem(e)){2,∞} (elem(b))*) etag(a))
= ( (elem(c) | elem(d) | elem(e)){2,∞} (elem(b))* etag(a))
E 3 = D sole(d) E 2
= D sole(d)( (elem(c) | elem(d) | elem(e)){2,∞} (elem(b))* etag(a))
= (D sole(d)(elem(c) | elem(d) | elem(e)) (elem(c) | elem(d) | elem(e)){1,∞} (elem(b))* etag(a))
= (ε (elem(c) | elem(d) | elem(e)){1,∞} (elem(b))* etag(a))
= ( (elem(c) | elem(d) | elem(e)){1,∞} (elem(b))* etag(a))
E 4 = D etag(a) E 3
= D etag(a)( (elem(c) | elem(d) | elem(e)){1,∞} (elem(b))* etag(a))
= (D etag(a)((elem(c) | elem(d) | elem(e)){1,∞}) (elem(b))* etag(a))
= ((D etag(a)(elem(c) | elem(d) | elem(e)) (elem(c) | elem(d) | elem(e)){1,∞}) (elem(b))* etag(a))
= (((D etag(a) elem(c) | D etag(a) elem(d) | D etag(a) elem(e)) (elem(c) | elem(d) | elem(e)){1,∞}) (elem(b))* etag(a))
= (((∅ | ∅ | ∅) (elem(c) | elem(d) | elem(e)){1,∞}) (elem(b))* etag(a))
= (∅ (elem(c) | elem(d) | elem(e)){1,∞} (elem(b))* etag(a))
= ∅

Validating documents in languages without self-overlap

A valid instance

Let grammar 1 be the grammar given in section 2-2 for the dramatic hierarchy; let grammar 2 be the (first) grammar given for the verse hierarchy.

Let the input be the fragment of Peer Gynt given in the text, with the addition of start- and end-tags for play, act, and scene.

<play|<act|<scene|
<sp who="Aase"|<L|Peer, you're lying!|sp>
<sp who="Peer"|<stage|without stopping|stage> 
    No, I'm not!|L>|sp>
<sp who="Aase"|<L| Well then, swear to me it's true! |L>|sp>
<sp who="Peer"|<L| Swear? Why should I? |sp>
<sp who="Aase"| See, you dare not! |L>
<L| Every word of it's a lie! |L>|sp>
|scene>|act>|play>

A trace generated by a validator for rabbit/duck grammars (and slightly cleaned up for presentation here) contains the following information. Each derivative takes the form E = FG; the sub-expression before the ∧ connector is from grammar 1 (dramatic hierarchy), the one after the ∧ is from grammar 2 (verse hierarchy).

  • E 1 = D stag(play) play 1play 2
    = (act 1+, #etag(play)) ∧ (act 2+, #etag(play))
  • E 2 = D stag(act) E 1
    = (scene 1+, #etag(act), act 1*, #etag(play)) ∧ (scene 2+, #etag(act), act 2*, #etag(play))
  • E 3 = D stag(scene) E 2
    = ((sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ((L 2 | stage 2 | #tag(sp))*, #etag(scene), scene 2*, #etag(act), act 2*, #etag(play))
  • E 4 = D stag(sp) E 3
    = ((#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ((L 2 | stage 2 | #tag(sp))*, #etag(scene), scene 2*, #etag(act), act 2*, #etag(play))
  • E 5 = D stag(L) E 4
    = ((#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ((#PCDATA | stage 2 | #tag(sp))*, #etag(L), (L 2 | stage 2 | #tag(sp))*, #etag(scene), scene 2*, #etag(act), act 2*, #etag(play))
  • E 6 = D pcdata() E 5
    = ((#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ((#PCDATA | stage 2 | #tag(sp))*, #etag(L), (L 2 | stage 2 | #tag(sp))*, #etag(scene), scene 2*, #etag(act), act 2*, #etag(play))
  • E 7 = D etag(sp) E 6
    = ((sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ((#PCDATA | stage 2 | #tag(sp))*, #etag(L), (L 2 | stage 2 | #tag(sp))*, #etag(scene), scene 2*, #etag(act), act 2*, #etag(play))
  • E 8 = D stag(sp) E 7
    = ((#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ((#PCDATA | stage 2 | #tag(sp))*, #etag(L), (L 2 | stage 2 | #tag(sp))*, #etag(scene), scene 2*, #etag(act), act 2*, #etag(play))
  • E 9 = D stag(stage) E 8
    = (#PCDATA, #etag(stage), (#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ (#PCDATA, #etag(stage), (#PCDATA | stage 2 | #tag(sp))*, #etag(L), (L 2 | stage 2 | #tag(sp))*, #etag(scene), scene 2*, #etag(act), act 2*, #etag(play))
  • E 10 = D pcdata() E 9
    = (#PCDATA, #etag(stage), (#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ (#PCDATA, #etag(stage), (#PCDATA | stage 2 | #tag(sp))*, #etag(L), (L 2 | stage 2 | #tag(sp))*, #etag(scene), scene 2*, #etag(act), act 2*, #etag(play))
  • E 11 = D etag(stage) E 10
    = ((#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ((#PCDATA | stage 2 | #tag(sp))*, #etag(L), (L 2 | stage 2 | #tag(sp))*, #etag(scene), scene 2*, #etag(act), act 2*, #etag(play))
  • E 12 = D pcdata() E 11
    = ((#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ((#PCDATA | stage 2 | #tag(sp))*, #etag(L), (L 2 | stage 2 | #tag(sp))*, #etag(scene), scene 2*, #etag(act), act 2*, #etag(play))
  • E 13 = D etag(L) E 12
    = ((#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ((L 2 | stage 2 | #tag(sp))*, #etag(scene), scene 2*, #etag(act), act 2*, #etag(play))
  • E 14 = D etag(sp) E 13
    = ((sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ((L 2 | stage 2 | #tag(sp))*, #etag(scene), scene 2*, #etag(act), act 2*, #etag(play))
  • E 15 = D stag(sp) E 14
    = ((#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ((L 2 | stage 2 | #tag(sp))*, #etag(scene), scene 2*, #etag(act), act 2*, #etag(play))
  • E 16 = D stag(L) E 15
    = ((#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ((#PCDATA | stage 2 | #tag(sp))*, #etag(L), (L 2 | stage 2 | #tag(sp))*, #etag(scene), scene 2*, #etag(act), act 2*, #etag(play))
  • E 17 = D pcdata() E 16
    = ((#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ((#PCDATA | stage 2 | #tag(sp))*, #etag(L), (L 2 | stage 2 | #tag(sp))*, #etag(scene), scene 2*, #etag(act), act 2*, #etag(play))
  • E 18 = D etag(L) E 17
    = ((#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ((L 2 | stage 2 | #tag(sp))*, #etag(scene), scene 2*, #etag(act), act 2*, #etag(play))
  • E 19 = D etag(sp) E 18
    = ((sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ((L 2 | stage 2 | #tag(sp))*, #etag(scene), scene 2*, #etag(act), act 2*, #etag(play))
  • E 20 = D stag(sp) E 19
    = ((#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ((L 2 | stage 2 | #tag(sp))*, #etag(scene), scene 2*, #etag(act), act 2*, #etag(play))
  • E 21 = D stag(L) E 20
    = ((#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ((#PCDATA | stage 2 | #tag(sp))*, #etag(L), (L 2 | stage 2 | #tag(sp))*, #etag(scene), scene 2*, #etag(act), act 2*, #etag(play))
  • E 22 = D pcdata() E 21
    = ((#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ((#PCDATA | stage 2 | #tag(sp))*, #etag(L), (L 2 | stage 2 | #tag(sp))*, #etag(scene), scene 2*, #etag(act), act 2*, #etag(play))
  • E 23 = D etag(sp) E 22
    = ((sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ((#PCDATA | stage 2 | #tag(sp))*, #etag(L), (L 2 | stage 2 | #tag(sp))*, #etag(scene), scene 2*, #etag(act), act 2*, #etag(play))
  • E 24 = D stag(sp) E 23
    = ((#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ((#PCDATA | stage 2 | #tag(sp))*, #etag(L), (L 2 | stage 2 | #tag(sp))*, #etag(scene), scene 2*, #etag(act), act 2*, #etag(play))
  • E 25 = D pcdata() E 24
    = ((#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ((#PCDATA | stage 2 | #tag(sp))*, #etag(L), (L 2 | stage 2 | #tag(sp))*, #etag(scene), scene 2*, #etag(act), act 2*, #etag(play))
  • E 26 = D etag(L) E 25
    = ((#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ((L 2 | stage 2 | #tag(sp))*, #etag(scene), scene 2*, #etag(act), act 2*, #etag(play))
  • E 27 = D stag(L) E 26
    = ((#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ((#PCDATA | stage 2 | #tag(sp))*, #etag(L), (L 2 | stage 2 | #tag(sp))*, #etag(scene), scene 2*, #etag(act), act 2*, #etag(play))
  • E 28 = D pcdata() E 27
    = ((#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ((#PCDATA | stage 2 | #tag(sp))*, #etag(L), (L 2 | stage 2 | #tag(sp))*, #etag(scene), scene 2*, #etag(act), act 2*, #etag(play))
  • E 29 = D etag(L) E 28
    = ((#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ((L 2 | stage 2 | #tag(sp))*, #etag(scene), scene 2*, #etag(act), act 2*, #etag(play))
  • E 30 = D etag(sp) E 29
    = ((sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ((L 2 | stage 2 | #tag(sp))*, #etag(scene), scene 2*, #etag(act), act 2*, #etag(play))
  • E 31 = D etag(scene) E 30
    = (scene 1*, #etag(act), act 1*, #etag(play)) ∧ (scene 2*, #etag(act), act 2*, #etag(play))
  • E 32 = D etag(act) E 31
    = (act 1*, #etag(play)) ∧ (act 2*, #etag(play))
  • E 33 = D etag(play) E 32
    = ε ∧ ε
The final derivative shows that the document satisfies both grammars.

An invalid instance

If we make the document invalid against the verse hierarchy by attempting to embed a line within another line, the input is:

<play|<act|<scene|
<sp who="Aase"|<L|<L|Peer, you're lying!|L>|sp>
<sp who="Peer"|<stage|without stopping|stage> 
    <L|No, I'm not!|L>|L>|sp>
<sp who="Aase"|<L| Well then, swear to me it's true! |L>|sp>
<sp who="Peer"|<L|<L| Swear? Why should I? |L>|sp>
<sp who="Aase"|<L| See, you dare not! |L>|L>
<L| Every word of it's a lie! |L> |sp>
|scene>|act>|play>

A validation trace for this element (cleaned up slightly for presentation) follows:

  • E 1 = D stag(play) play 1play 2
    = (act 1+, #etag(play)) ∧ (act 2+, #etag(play))
  • E 2 = D stag(act) E 1
    = (scene 1+, #etag(act), act 1*, #etag(play)) ∧ (scene 2+, #etag(act), act 2*, #etag(play))
  • E 3 = D stag(scene) E 2
    = ((sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ((L 2 | stage 2 | #tag(sp))*, #etag(scene), scene 2*, #etag(act), act 2*, #etag(play))
  • E 4 = D stag(sp) E 3
    = ((#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ((L 2 | stage 2 | #tag(sp))*, #etag(scene), scene 2*, #etag(act), act 2*, #etag(play))
  • E 5 = D stag(L) E 4
    = ((#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ((#PCDATA | stage 2 | #tag(sp))*, #etag(L), (L 2 | stage 2 | #tag(sp))*, #etag(scene), scene 2*, #etag(act), act 2*, #etag(play))
  • E 6 = D stag(L) E 5
    = ((#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ∅
  • E 7 = D pcdata() E 6
    = (#PCDATA, (#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ∅
  • E 8 = D etag(L) E 7
    = (#PCDATA, (#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ∅
  • E 9 = D etag(sp) E 8
    = ((sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ∅
  • E 10 = D stag(sp) E 9
    = ((#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ∅
  • E 11 = D stag(stage) E 10
    = (#PCDATA, #etag(stage), (#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ∅
  • E 12 = D pcdata() E 11
    = (#PCDATA, #etag(stage), (#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ∅
  • E 13 = D etag(stage) E 12
    = ((#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ∅
  • E 13 = D stag(L) E 12
    = ((#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ∅
  • E 14 = D pcdata() E 13
    = (#PCDATA, (#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ∅
  • E 15 = D etag(L) E 14
    = (#PCDATA, (#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ∅
  • E 16 = D etag(L) E 15
    = (#PCDATA, (#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ∅
  • E 17 = D etag(sp) E 16
    = ((sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ∅
  • E 18 = D stag(sp) E 17
    = ((#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ∅
  • E 19 = D stag(L) E 18
    = ((#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ∅
  • E 20 = D pcdata() E 19
    = (#PCDATA, (#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ∅
  • E 21 = D etag(L) E 20
    = (#PCDATA, (#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ∅
  • E 22 = D etag(sp) E 21
    = ((sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ∅
  • E 23 = D stag(sp) E 22
    = ((#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ∅
  • E 24 = D stag(L) E 23
    = ((#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ∅
  • E 25 = D stag(L) E 24
    = ((#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ∅
  • E 26 = D pcdata() E 25
    = (#PCDATA, (#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ∅
  • E 27 = D etag(L) E 26
    = (#PCDATA, (#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ∅
  • E 28 = D etag(sp) E 27
    = ((sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ∅
  • E 29 = D stag(sp) E 28
    = ((#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ∅
  • E 30 = D stag(L) E 29
    = ((#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ∅
  • E 31 = D pcdata() E 30
    = (#PCDATA, (#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ∅
  • E 32 = D etag(L) E 31
    = (#PCDATA, (#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ∅
  • E 33 = D etag(L) E 32
    = (#PCDATA, (#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ∅
  • E 34 = D stag(L) E 33
    = (#PCDATA, (#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ∅
  • E 35 = D pcdata() E 34
    = (#PCDATA, (#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ∅
  • E 36 = D etag(L) E 35
    = (#PCDATA, (#PCDATA | stage 1 | #tag(L))*, #etag(sp), (sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ∅
  • E 37 = D etag(sp) E 36
    = ((sp 1 | stage 1 | #tag(L))*, #etag(scene), scene 1*, #etag(act), act 1*, #etag(play)) ∧ ∅
  • E 38 = D etag(scene) E 37
    = (scene 1*, #etag(act), act 1*, #etag(play)) ∧ ∅
  • E 39 = D etag(act) E 38
    = (act 1*, #etag(play)) ∧ ∅
  • E 40 = D etag(play) E 39
    = ε ∧ ∅

As may be seen, the result makes clear that the document is valid against grammar 1 but invalid against grammar 2.

Validating documents with self-overlap

Let grammar 1 be the first grammar of section 3-2.

The sample document

<p|<a><d~1|<a><d~2|<a><b><a>|d~1><b>|d~2><b>|p>
produces the following trace:
  • E 0 = elem(p)
  • E 1 = D stag(p) E 0
    = (#overlap(((#PCDATA | a 1 | #tag(d))*, ~d~, (#PCDATA | b 1 | #tag(d))*)?), #etag(p))
  • E 2 = D sole(a) E 1 = E 1
    = (#overlap(((#PCDATA | a 1 | #tag(d))*, ~d~, (#PCDATA | b 1 | #tag(d))*)), #etag(p))
  • E 3 = D stag(d~1) E 2
    = ((((#PCDATA | a 1 | b 1 | #tag(d))*, #etag(d~1), (#PCDATA | b 1 | #tag(d))*) | (((#PCDATA | a 1 | b 1 | #tag(d))*, #etag(d~1), (#PCDATA | b 1 | #tag(d))*) ∧ #overlap(((#PCDATA | a 1 | #tag(d))*, ~d~, (#PCDATA | b 1 | #tag(d))*)))), #etag(p))
  • E 4 = D sole(a) E 3 = E 3
    = ((((#PCDATA | a 1 | b 1 | #tag(d))*, #etag(d~1), (#PCDATA | b 1 | #tag(d))*) | (((#PCDATA | a 1 | b 1 | #tag(d))*, #etag(d~1), (#PCDATA | b 1 | #tag(d))*) ∧ #overlap(((#PCDATA | a 1 | #tag(d))*, ~d~, (#PCDATA | b 1 | #tag(d))*)))), #etag(p))
  • E 5 = D stag(d~2) E 4
    = ((((#PCDATA | a 1 | b 1 | #tag(d))*, #etag(d~1), (#PCDATA | b 1 | #tag(d))*) | (((#PCDATA | a 1 | b 1 | #tag(d))*, #etag(d~1), (#PCDATA | b 1 | #tag(d))*) ∧ (((#PCDATA | a 1 | b 1 | #tag(d))*, #etag(d~2), (#PCDATA | b 1 | #tag(d))*) | (((#PCDATA | a 1 | b 1 | #tag(d))*, #etag(d~2), (#PCDATA | b 1 | #tag(d))*) ∧ #overlap(((#PCDATA | a 1 | #tag(d))*, ~d~, (#PCDATA | b 1 | #tag(d))*)))))), #etag(p))
  • E 6 = D sole(a) E 5 = E 5
    = ((((#PCDATA | a 1 | b 1 | #tag(d))*, #etag(d~1), (#PCDATA | b 1 | #tag(d))*) | (((#PCDATA | a 1 | b 1 | #tag(d))*, #etag(d~1), (#PCDATA | b 1 | #tag(d))*) ∧ (((#PCDATA | a 1 | b 1 | #tag(d))*, #etag(d~2), (#PCDATA | b 1 | #tag(d))*) | (((#PCDATA | a 1 | b 1 | #tag(d))*, #etag(d~2), (#PCDATA | b 1 | #tag(d))*) ∧ #overlap(((#PCDATA | a 1 | #tag(d))*, ~d~, (#PCDATA | b 1 | #tag(d))*)))))), #etag(p))
  • E 7 = D sole(b) E 6
    = ((((#PCDATA | a 1 | b 1 | #tag(d))*, #etag(d~1), (#PCDATA | b 1 | #tag(d))*) | (((#PCDATA | a 1 | b 1 | #tag(d))*, #etag(d~1), (#PCDATA | b 1 | #tag(d))*) ∧ (((#PCDATA | a 1 | b 1 | #tag(d))*, #etag(d~2), (#PCDATA | b 1 | #tag(d))*) | (((#PCDATA | a 1 | b 1 | #tag(d))*, #etag(d~2), (#PCDATA | b 1 | #tag(d))*) ∧ #overlap(∅))))), #etag(p))
  • E 8 = D sole(a) E 7 = E 7
    = ((((#PCDATA | a 1 | b 1 | #tag(d))*, #etag(d~1), (#PCDATA | b 1 | #tag(d))*) | (((#PCDATA | a 1 | b 1 | #tag(d))*, #etag(d~1), (#PCDATA | b 1 | #tag(d))*) ∧ (((#PCDATA | a 1 | b 1 | #tag(d))*, #etag(d~2), (#PCDATA | b 1 | #tag(d))*) | (((#PCDATA | a 1 | b 1 | #tag(d))*, #etag(d~2), (#PCDATA | b 1 | #tag(d))*) ∧ #overlap(∅))))), #etag(p))
  • E 9 = D etag(d~1) E 8
    = ((((((#PCDATA | a 1 | b 1 | #tag(d))*, #etag(d~1)) | ε), (#PCDATA | b 1 | #tag(d))*) | (((((#PCDATA | a 1 | b 1 | #tag(d))*, #etag(d~1)) | ε), (#PCDATA | b 1 | #tag(d))*) ∧ (((#PCDATA | a 1 | b 1 | #tag(d))*, #etag(d~2), (#PCDATA | b 1 | #tag(d))*) | (((#PCDATA | a 1 | b 1 | #tag(d))*, #etag(d~2), (#PCDATA | b 1 | #tag(d))*) ∧ #overlap(∅))))), #etag(p))
  • E 10 = D sole(b) E 9
    = ((((#PCDATA | a 1 | b 1 | #tag(d))*, #etag(d~1), (#PCDATA | b 1 | #tag(d))*) | (#PCDATA | b 1 | #tag(d))* | ((((#PCDATA | a 1 | b 1 | #tag(d))*, #etag(d~1), (#PCDATA | b 1 | #tag(d))*) | (#PCDATA | b 1 | #tag(d))*) ∧ (((#PCDATA | a 1 | b 1 | #tag(d))*, #etag(d~2), (#PCDATA | b 1 | #tag(d))*) | (((#PCDATA | a 1 | b 1 | #tag(d))*, #etag(d~2), (#PCDATA | b 1 | #tag(d))*) ∧ #overlap(∅))))), #etag(p))
  • E 11 = D etag(d~2) E 10
    = ((((#PCDATA | a 1 | b 1 | #tag(d))*, #etag(d~1), (#PCDATA | b 1 | #tag(d))*) | (#PCDATA | b 1 | #tag(d))* | ((((#PCDATA | a 1 | b 1 | #tag(d))*, #etag(d~1), (#PCDATA | b 1 | #tag(d))*) | (#PCDATA | b 1 | #tag(d))*) ∧ (((((#PCDATA | a 1 | b 1 | #tag(d))*, #etag(d~2)) | ε), (#PCDATA | b 1 | #tag(d))*) | (((((#PCDATA | a 1 | b 1 | #tag(d))*, #etag(d~2)) | ε), (#PCDATA | b 1 | #tag(d))*) ∧ #overlap(∅))))), #etag(p))
  • E 12 = D sole(b) E 11
    = ((((#PCDATA | a 1 | b 1 | #tag(d))*, #etag(d~1), (#PCDATA | b 1 | #tag(d))*) | (#PCDATA | b 1 | #tag(d))* | ((((#PCDATA | a 1 | b 1 | #tag(d))*, #etag(d~1), (#PCDATA | b 1 | #tag(d))*) | (#PCDATA | b 1 | #tag(d))*) ∧ (((#PCDATA | a 1 | b 1 | #tag(d))*, #etag(d~2), (#PCDATA | b 1 | #tag(d))*) | (#PCDATA | b 1 | #tag(d))* | ((((#PCDATA | a 1 | b 1 | #tag(d))*, #etag(d~2), (#PCDATA | b 1 | #tag(d))*) | (#PCDATA | b 1 | #tag(d))*) ∧ #overlap(∅))))), #etag(p))
  • E 13 = D etag(p) E 12
    = ε

Notes

1.

The work reported here was undertaken in the context of the MLCD Project (Markup languages for complex documents) at the University of Bergen and supported in part by a grant from the Meltzer Foundation (L. Meltzers Høyskolefond). Earlier drafts have been read by Steven J. Derose, Patrick Durusau, Claus Huitfeldt, and Andreas Witt, as well as by the anonymous reviewers who reviewed the paper for this conference, and the paper has benefited from their comments. My thanks to all of them.

2.

This is a slight oversimplification: in principle, general entities appearing in the document might have different replacement text in different DTDs, and a record-end which was suppressed in one DTD by the record-end rules of ISO 8879 might count as data in a different tree. These complications are ignored here, as they were in all the work I am aware of that used concurrent markup.

3.

I am not aware of formal work on the classes of context-sensitive languages which may be recognized as the intersection of two, or of three, or of n context-free grammars. I conjecture, without being able to offer proof, that there are context-sensitive languages which can be recognized as the intersection of three context-free grammars, but not as the intersection of two, and more generally that as the number n increases, so does the set of languages expressible as the intersection of n context-free languages.

4.

Also in the use of CONCUR being worked out in Bielefeld by Andreas Witt and his colleagues and students, which I refer to here as XCONCUR or as ‘Bielefeld CONCUR’.

5.

As a convenience, in ISO 8879 the omission of explicit labels serves to mark a tag as belonging to every tree.

6.

The image was first made an object of scholarly inquiry by Joseph Jastrow [Jastrow 1899]; the image used here is a reproduction of that used by Jastrow, from [Kihlstrom 2004]. The same ambiguous figure occupied Ludwig Wittgenstein and it is sometimes erroneously attributed to him, although he appears to have learned of it from Jastrow.

7.

My colleague Liam Quin suggests that orthogonality would require a fifth class of element, in which the start- and end-tags would be visible but the contents of the element invisible. I have not yet seen an application for this class, so it has not been incorporated into the exposition below.

8.

Systems for practical use may prefer to use the keyword unbounded instead of ∞.

9.

This constraint is not intended to replace the stylesheet code which tells the formatter to begin each chapter on a new page; it merely provides a convenient way of making sure that the formatter actually did so, and that the output has not been garbled since.

10.

Because it is used in later discussion, the connector “∧” is also defined here, although it is not conventional in schema languages. When F and G are expressions, the content-model expression (FG) denotes the set of sequences which are members both of F and of G.

11.

The reader in search of clear definitions of these should consult [Grune/Jacobs 1990].

12.

One can imagine that an experimental writer like Perec might try to write a verse drama in which the stage directions are part of the verse structure, but let us assume our grammar is for Shakespeare, Goethe, and Ibsen, who do not do this.

13.

The reader unfamiliar with TexMecs should be told that <e| is a start-tag with generic identifier g, and |e> is the corresponding end-tag. If the element is empty, it may be written as a sole-tag: <e>. Text enclosed between <* and *> is a comment. For further details see [Huitfeldt / Sperberg-McQueen 2001].

14.

That is (as already noted), a normal overlap group will be ambiguous: the same element type will be treated both as a first- and as a second-class element. One purpose of introducing an explicit overlap-group is to allow the scope of the ambiguity to be marked explicitly. In a system for practical use, it would be convenient to allow overlap-tokens to occur without an explicit overlap-group, and to assume that the content model as a whole is the overlap-group. In such a system, explicit overlap groups would need to be written only when it was important to note that the scope of the ambiguity was less than the entire content model.

15.

As the reader will have noted, the overlap rules just described treat a self-overlapping element as first-class in one reading and second-class in the others. The grammars are simpler to read and write, though less fine control is available, if self-overlap is handled as first-class in one reading and third-class in the others. The third-class tags need not be provided for in the grammar. This alternative is only a convenience feature, though: it does not increase the formal power of the grammatical formalism, and so it is not illustrated here. [Probably it should be, though.]

16.

Since the content of fourth-class elements is suppressed, no first-class element can overlap a fourth-class element.

17.

For further information, the reader is directed to [Brzozowski 1964], which is quite accessible to non-mathematical readers, or to [Sperberg-McQueen 2005], which applies Brzozowski derivatives to the processing of XML documents.

18.

The operator denoted by “&” is unrestricted interleave: the interleave (s 1 & s 2) of two strings s 1 and s 2 is any string which contains both the elements of s 1, in order, and the elements of s 2, in order, with no element of the interleave being assigned both to s 1 and to s 2. Applied to sets, the interleave (F & G) of F and G denotes the set of strings s formed by the interleave of some s 1F and some s 2 in G. This is not the place to provide a detailed motivation for an interleave operator; it is included here partly because some current schema language proposals have it, and partly because the example in section 5-2 uses it.

19.

The grammars do go beyond the explicit formulation of the constraint, which requires only that nothing overlap with vol. The constraint does not forbid vol from nesting within a section or a paragraph. But either such a volume would be unnaturally small, or else such a paragraph unnaturally long.

20.

As the sequence of derivatives makes clear, the ill-formedness is visible as soon as the end-tag for a is encountered.


Bibliography

[ACH/ACL/ALLC 1994] Association for Computers and the Humanities, Association for Computational Linguistics, and Association for Literary and Linguistic Computing. 1994. Guidelines for Electronic Text Encoding and Interchange (TEI P3). Ed. C. M. Sperberg-McQueen and Lou Burnard. Chicago, Oxford: Text Encoding Initiative, 1994.

[Barnard et al. 1988] Barnard, David, Ron Hayter, Maria Karababa, George Logan, and John McFadden. 1988. “SGML-based markup for literary texts: Two problems and some solutions”. Computers and the Humanities 22: 265-276.

[Barnard et al. 1995] Barnard, David, Lou Burnard, Jean-Pierre Gaspart, Lynne A. Price, C. M. Sperberg-McQueen, and Giovanni Battista Varile. 1995. 1995. “Hierarchical encoding of text: Technical problems and SGML solutions”. Computers and the Humanities 29: 211-231.

[Brzozowski 1964] Brzozowski, Janusz A. 1964. “Derivatives of regular expressions”. Journal of the ACM 11.4 (1964): 481-494.

[Dekhtyar / Iacob 2005] Dekhtyar, Alex, and Ionut Emil Iacob. 2005. “A Framework For Management of Concurrent XML Markup”. Data and Knowledge Engineering 52.2: 185-215.

[DeRose 2004] DeRose, Steven. 2004. “Markup overlap: A review and a Horse”. Paper given at Extreme Markup Languages 2004, Montréal, sponsored by IDEAlliance. Available on the Web at http://www.mulberrytech.com/Extreme/Proceedings/html/2004/DeRose01/EML2004DeRose01.html.

[Durand et al. 1996] Durand, David G., Elli Mylonas, and Steven J. Derose. 1996. “What should markup really be? Applying theories of text to the design of markup systems.” ALLC-ACH'96, Joint Conference of the Association for Literary and Linguistic Computing and the Association for Computers and the Humanities. Bergen. Available on the Web at http://cs-people.bu.edu/dgd/ach96_talk/Redefining_long.html.

[Durusau / O'Donnell 2002a] Durusau, Patrick, and Matthew Brook O'Donnell. 2002. “Concurrent markup for XML documents”. Paper given at XML Europe 2002, Barcelona, sponsored by IDEAlliance. Available on the Web at http://www.idealliance.org/papers/xmle02/dx_xmle02/papers/03-03-07/03-03-07.html.

[Durusau / O'Donnell 2002b] Durusau, Patrick, and Matthew Brook O'Donnell. 2002. “Coming down from the trees: Next step in the evolution of markup?” Late-breaking paper given at Extreme Markup Languages 2002, Montréal, sponsored by IDEAlliance. Slides (but no full text) available on the Web at http://www.mulberrytech.com/Extreme/Proceedings/author-pkg/2002/Durusau01/EML2002Durusau01.zip.

[Durusau / O'Donnell 2004] Durusau, Patrick, and Matthew Brook O'Donnell. 2004. “Tabling the overlap discussion”. Paper given at Extreme Markup Languages 2004, Montréal, sponsored by IDEAlliance. Available on the Web at http://www.mulberrytech.com/Extreme/Proceedings/html/2004/Durusau01/EML2004Durusau01.html.

[Dybkjær et al. 1998] Dybkjær, Laila, Niels Ole Bernsen, Hans Dybkjær, David McKelvie, and Andreas Mengel. 1998. The MATE Markup Framework. MATE Deliverable D1.2. 30 November 1998. Available on the Web at http://mate.nis.sdu.dk/information/d12/

[Grune/Jacobs 1990] Grune, Dick, and Ceriel J. H. Jacobs. 1990. Parsing techniques: a practical guide. New York, London: Ellis Horwood, 1990. Postscript of the book is available from the first author's Web site at http://www.cs.vu.nl/~dick/PTAPG.html. A second edition is due to appear in 2006 from Springer Verlag.

[Hilbert et al. 2005] Hilbert, Mirco, Oliver Schonefeld, and Andreas Witt. 2005. “Making CONCUR work”. Paper given at Extreme Markup Languages 2005, Montréal, sponsored by IDEAlliance. Available on the Web at http://www.mulberrytech.com/Extreme/Proceedings/html/2005/Witt01/EML2005Witt01.xml.

[Huitfeldt / Sperberg-McQueen 2001] Huitfeldt, Claus, and C. M. Sperberg-McQueen. 2001. “TexMECS: An experimental markup meta-language for complex documents”. http://helmer.aksis.uib.no/claus/mlcd/papers/texmecs.html

[Huitfeldt 1995] Huitfeldt, Claus. 1995. “Multi-dimensional texts in a one-dimensional medium”. Computers and the Humanities 28: 235-241.

[Huitfeldt 1999] Huitfeldt, Claus. 1999. MECS—A multi-element code system. Working papers of the Wittgenstein Archives at the University of Bergen.

[ISO 1986] International Organization for Standardization (ISO). 1986. ISO 8879-1986 (E). Information processing — Text and Office Systems — Standard Generalized Markup Language (SGML). International Organization for Standardization, Geneva, 1986.

[Jagadish et al. 2004] Jagadish, H. V., Laks V. S. Lakshmanan, Monica Scannapieco, Divesh Srivastava, and Nuwee Wiwatwattana. 2004. “Colorful XML: One hierarchy isn't enough”. Proceedings of the 2004 ACM SIGMOD International conference on management of data, Paris, sponsored by the Association for Computing Machinery Special Interest Group on Management of Data. New York: ACM Press.

[Jastrow 1899] Jastrow, Joseph. 1899. “The mind's eye”. Popular Science Monthly 54 (1899): 299-312.

[Kihlstrom 2004] Kihlstrom, John F. 2004. “Joseph Jastrow and his duck — Or is it a rabbit?” Originally written 16 November 2004 as a letter to the editor of Trends in cognitive sciences; available on Kihlstrom's Web site at http://ist-socrates.berkeley.edu/~kihlstrm/JastrowDuck.htm

[Megginson 1992] Megginson, David. 1992. Re: Minimization, nesting, multiple types, etc. Contribution to thread in comp.text.sgml, 14 February 1992. Available on the Web at http://groups.google.com/group/comp.text.sgml/browse_thread/thread/e108fbd9e92fa833/82b3ff7b20f42651?lnk=st&q=&rnum=9#82b3ff7b20f42651

[Murata 1995] Murata Makoto. 1995. “File format for documents containing both logical structures and layout structures”. Electronic publishing 8: 295-317.

[Nicol 2002] Nicol, Gavin. 2002. “Core range algebra: Toward a formal theory of markup”. Paper given at Extreme Markup Languages 2002, Montréal, sponsored by IDEAlliance. Available on the Web at http://www.mulberrytech.com/Extreme/Proceedings/html/2002/Nicol01/EML2002Nicol01.html.

[Renear et al. 1993] Renear, Allen, Elli Mylonas, and David Durand. 1993. “Refining our Notion of What Text Really Is: The Problem of Overlapping Hierarchies”. In Research in Humanities Computing, ed. Nancy Ide and Susan Hockey. Oxford: Oxford University Press. Available on the Web at: http://www.stg.brown.edu/resources/stg/monographs/ohco.html

[Sasaki 2004] Sasaki, Felix. 2004. “Secondary information structuring: A methodology for the vertical interrelation of information resources”. Paper given at Extreme Markup Languages 2004, Montréal, sponsored by IDEAlliance. Available on the Web at http://www.mulberrytech.com/Extreme/Proceedings/html/2004/Sasaki01/EML2004Sasaki01.html.

[Sperberg-McQueen / Huitfeldt 1999] Sperberg-McQueen, C. M., and Claus Huitfeldt. 1999. “Concurrent document hierarchies in MECS and SGML”. Literary & Linguistic Computing 14.1: 29-42. http://www.w3.org/People/cmsmcq/2000/poddp2000.html

[Sperberg-McQueen / Huitfeldt 2000] Sperberg-McQueen, C. M., and Claus Huitfeldt. 2000. “GODDAG: A Data Structure for Overlapping Hierarchies”. In DDEP-PODDP 2000, ed. P. King and E.V. Munson, Lecture Notes in Computer Science 2023 (Berlin: Springer, 2004), pp. 139-160. Available on the Web at http://www.springerlink.com/index/98J1VBU5NBY73UL3 and http://www.w3.org/People/cmsmcq/2000/poddp2000.html.

[Sperberg-McQueen 2005] Sperberg-McQueen, C. M. 2005. “Applications of Brzozowski derivatives to XML Schema processing”. Paper given at Extreme Markup Languages 2005, Montréal, sponsored by IDEAlliance. Available on the Web at http://www.mulberrytech.com/Extreme/Proceedings/html/2005/SperbergMcQueen01/EML2005SperbergMcQueen01.html, http://www.w3.org/People/cmsmcq/2005/abdxsp.unicode.html, and http://www.w3.org/People/cmsmcq/2005/abdxsp.ascii.html.

[Tennison / Piez 2002] Tennison, Jeni, and Wendell Piez. 2002. “LMNL Syntax”. Available on the Web at http://www.lmnl.net/prose/syntax/index.html.

[Witt 2004] Witt, Andreas. 2004. “Multiple hierarchies: new aspects of an old solution”. Paper given at Extreme Markup Languages 2004, Montréal, sponsored by IDEAlliance. Available on the Web at http://www.mulberrytech.com/Extreme/Proceedings/html/2004/Witt01/EML2004Witt01.html



Rabbit/duck grammars: a validation method for overlapping structures

C. M. Sperberg-McQueen [World Wide Web Consortium, MIT Computer Science and Artificial Intelligence Laboratory]