Recent work on markup has emphasized the importance of overlapping structures, but little progress has been made toward validation of such structures: the state of the art in validation of overlap remains where it was in 1986 with the CONCUR feature of SGML.
Like CONCUR, rabbit/duck grammars manage validation by means of several context-free grammars, but rabbit/duck grammars offer some improvements over CONCUR: within a document grammar written for one hierarchical subset of the markup vocabulary, the start- and end-tags of other hierarchies may be made visible; rabbit/duck grammars thus make it possible to constrain the interactions of different hierarchies in ways not possible with CONCUR. Rabbit/duck grammars also provide a more principled account of self-overlap (i.e. overlap of two elements with the same generic identifier) than CONCUR.
Documents can be validated against rabbit/duck grammars either by validating individually against each grammar or by a single pass through the document which validates against all grammars simultaneously. An implementation of this single-pass approach using Brzozowski derivatives is sketched out.
Keywords: Validating; Modeling; Concurrent Markup/Overlap
| XML Source | PDF (for print) | Author Package | Typeset PDF |
This document introduces rabbit/duck grammars, a method of validating overlapping structures.1
One of the first things taught to students in courses on SGML or XML is that elements nest within other elements, thus forming a hierarchical structure or tree. One of the first questions asked by alert students is: but what about non-hierarchical structures in the information? What about overlapping structures?
It is thus no surprise that the development and spread of SGML and XML have been accompanied from the beginning by a vigorous discussion of the limits of tree structures and the best ways of dealing with non-hierarchical information; see, for example, [Barnard et al. 1988], [Renear et al. 1993], [Barnard et al. 1995], [Huitfeldt 1995], [Murata 1995], [Durand et al. 1996]. Out-of-line or standoff markup [Dybkjær et al. 1998], fragmentation of elements, milestone elements, virtual elements (e.g. [Barnard et al. 1988], [ACH/ACL/ALLC 1994], [Barnard et al. 1995]), concurrent structures [ISO 1986] [Sperberg-McQueen / Huitfeldt 1999], MECS [Huitfeldt 1999], parallel encoding (e.g. [Witt 2004], [Hilbert et al. 2005], [Dekhtyar / Iacob 2005]), bottom-up virtual hierarchies [Durusau / O'Donnell 2002a], layered markup and annotation (LMNL) [Tennison / Piez 2002], range algebra [Nicol 2002], just-in-time trees [Durusau / O'Donnell 2002b], multi-colored trees [Jagadish et al. 2004], tables [Durusau / O'Donnell 2004], TexMecs [Huitfeldt / Sperberg-McQueen 2001], Goddag structures [Sperberg-McQueen / Huitfeldt 2000], Trojan horses [DeRose 2004] — the world has not been starved of proposals for markup systems that handle overlap more or less gracefully (for some sense of the word handle). The discussion has been vigorous both with respect to the serial form to be assumed by the data and with respect to the abstractions or data structures around which access to and operations on the data are to be organized.
On the topic of validation, however, the discussion has been remarkably quiet. The state of the art appears to be today just what it was twenty years ago, when ISO 8879 defined the validity conditions for markup using the SGML feature CONCUR. Concurrent markup defines multiple trees over the same frontier2, each obeying its own DTD; ISO 8879 does not go into details on the subject, but the document may conveniently be regarded as valid if and only if it is valid against each of its DTDs.
In general, discussions of overlap say nothing at all about validation, but those mechanisms (e.g. multi-colored trees, XCONCUR) which like CONCUR treat overlap as the interplay of multiple trees (possibly over the same frontier, as in CONCUR, and possibly over different frontiers, as in multi-colored trees) can in general be validated by validating each of their trees in isolation against an appropriate document grammar.
Tree-by-tree validation provides no mechanism for constraining the interactions of the trees: each tree is validated in majestic and lonely isolation. Three forms of interaction in particular are worth noting.
In the search for a method of validating overlapping structures, several goals may be identified:
The method outlined in this paper for writing document grammars and validating documents against them, called rabbit/duck grammars, is designed to address these issues.
Some precursors of the idea are worth noting; they may help make it clearer.
The first idea is language intersection. Formal language theory treats languages as sets of strings; the intersection of two languages L 1 and L 2 is just the set of strings which appear both in L 1 and in L 2. Other set operations (e.g. union and complementation) are also possible.
Discussions of formal language theory routinely refer to the important results which establish that regular languages are closed under the operations of union, intersection, and complementation, and that the set of context-free languages is not closed under intersection. An example often cited is this: consider the languages L 1 and L 2, where L 1 is
This example can be (and, I think, normally is) taken as an indication that set operations are ‘safe’ when performed on regular languages but unsafe on context-free languages. Safe, since they cannot have a result which is not itself a regular language; unsafe, since they clearly increase the expressive power of the formalism.
A few years ago, however, Lars Johnsen (a collaborator on the MLCD project) suggested in a workshop that the facts here could be given a more positive interpretation. While it is in general difficult to tell in reasonably bounded time whether a given string is a member of an arbitrary context-sensitive language, the example just given demonstrates that there are at least some context-sensitive languages for which the test can be performed in the same time required to parse context-free languages, just by parsing the string twice against different context-free grammars; if both accept it, then the context-sensitive language also accepts it. Rabbit/duck grammars apply this idea to document grammars: while in general languages like TexMecs require more than context-free power, they can be factored into a finite set (often a small one) of mostly conventional document grammars.3
A second related idea is the SGML feature CONCUR, which rabbit/duck grammars resemble in some ways. In SGML using CONCUR,4 overlap is handled by grouping elements into trees. Each tag is marked with the identifiers of the trees it belongs to; tags belonging to other trees are ignored.5
Rabbit/duck grammars resemble CONCUR in analysing overlap for the most part as involving distinct trees which share frontiers and some internal structures, and in using a set of several context-free grammars as the basis of validation. They differ from CONCUR in allowing each grammar to refer to tags for elements which are not part of the tree being validated, and in providing for elements which overlap other elements with the same generic identifier.
The closest analogue I am aware of to rabbit/duck grammars, however, is an idea formulated in February 1992 by David Megginson on the Netnews discussion list comp.text.sgml [Megginson 1992].
Megginson's idea was to use end-tag omission, together with the fact that in SGML empty elements look the same as start-tags, to allow the same SGML data stream to be parsed in either of two ways. He provides two DTDs. In one, the elements recto, verso, line, and clause are empty while the elements poem, fit, verse, averse, and bverse have content. In the other DTD, the roles are reversed. Whichever grammar is used for parsing and processing, one hierarchical view is dominant and the other recessive, but Megginson's technique allows information about (the start of) elements in the recessive view to be visible. Whether the tag recto denotes an empty milestone or the beginning of an element with content depends on the grammar; changing grammars performs a sort of Gestalt switch.
Rabbit/duck grammars use a similar sort of Gestalt switch (although the technical details are different), and are named for the well known image used in many discussions of Gestalt perception.6

A figure which could be the head of a rabbit facing right or of a duck facing left.
Rabbit/duck grammars resemble Megginson grammars in using multiple document grammars and treating the same tag as a start-tag in some and a milestone in others; they differ in some ways made necessary by syntactic differences between the SGML reference concrete syntax and XML. Megginson also did not contemplate restricting the interactions of the two grammars; he used inclusion exceptions to allow milestones for recessive elements to appear anywhere.
The basic idea of rabbit/duck grammars is simple: define a set of distinct document grammars whose intersection describes the language to be accepted. In each grammar, each element type is assigned to one of four classes.7
The remainder of the paper is organized as follows. Section 2 provides a more complete definition of rabbit/duck grammars, defines a simple notation for writing them, and shows a simple example, concentrating on the simple case of languages without self-overlapping elements (i.e. languages in which any two overlapping elements are required to have different generic identifiers). Section 3 extends the discussion to languages with self-overlap. Section 4 addresses implementation issues. Section 5 describes some simple applications of rabbit/duck grammars.
I first consider rabbit/duck grammars for languages like Mecs [Huitfeldt 1999], which allow elements with different generic identifiers to overlap each other but prohibit overlap between elements with the same generic identifier.
Informally, one can describe the application of rabbit/duck grammars to such vocabularies thus. First, form groups of elements by assigning each element in the vocabulary to one or more groups, such that no two elements in the same group should ever overlap in a valid document instance. For example, if the vocabulary includes the elements doc (document), ch (chapter), sec (section), h (heading, title of chapter or section), p (paragraph), hi (highlighted phrase), vol (volume), page, and tl (typographic line), with the obvious meanings, then vol, page, and tl might go into one group, and the other elements into a second.
Second, define a document grammar for each group. For purposes of this paper, I will use an extension of DTD notation in which content models can contain primitive tokens of the following forms (where g is a generic identifier):
In addition, by analogy with XML Schema, I allow occurrence indicators to take the form of integer ranges, so that for any expression E one can write E { n , m }, where n is a non-negative integer and m is either a positive integer or the character “∞”8 The usual occurrence indicators of DTDs are also used.
Further extensions allow second-, third-, and fourth-class elements to be defined: instead of content models, they have the keywords MILESTONE-TAGS, IGNORE-TAGS, and IGNORE, respectively. In the examples given here, I adopt the convention that any element g mentioned in a primitive token of the form #stag( g ), #etag( g ), or #tag( g ) is a second-class element and need not be declared explicitly. Elements neither declared nor mentioned in a content model will be assumed to be third-class elements.
For our simple illustration, the page-oriented grammar might be
<!GRAMMAR vol [ <!ELEMENT vol (#stag(doc)?, page+, #etag(doc)?) > <!ELEMENT page (tl+) > <!ELEMENT tl (#PCDATA) > ]>
The logical-structure grammar might be
<!GRAMMAR doc [
<!ELEMENT doc (#stag(vol)?,
(p, #tag(page)*)+,
(ch, #tag(page)+)+,
#etag(vol)?) >
<!ELEMENT ch (#tag(page)*,
h,
(((p, #tag(page)*)+,
(sec, #tag(page)*)*)
| (sec, #tag(page)*)+)) >
<!ELEMENT sec (#tag(page)*,
h,
(((p, #tag(page)*)+,
(sec, #tag(page)*)*)
| (sec, #tag(page)*)+)) >
<!ELEMENT h (#PCDATA | hi | #tag(tl))* >
<!ELEMENT p (#PCDATA | hi | #tag(tl)
| #tag(page))* >
<!ELEMENT hi (#PCDATA | #tag(tl)
| #tag(page))* >
]>Formally, a rabbit/duck grammar can be viewed as a triple (G, D, S), where
content-model ::= group | counted
group ::= seq | or | all | conj
counted ::= group counter
expr ::= token | group | counted
seq ::= '(' expr (',' expr)* ')'
or ::= '(' expr ('|' expr)+ ')'
all ::= '(' expr ('&' expr)+ ')'
and ::= '(' expr ('∧' expr)+ ')'
counter ::= '+' // '+' = {1,∞}
| '*' // '*' = {0,∞}
| '?' // '?' = {0,1}
| '{' min ',' max '}'
| group '{' count '}'
min ::= '0' | count
count ::= POSITIVE_INTEGER
max ::= count | '∞'
token ::= basic-token
| counted-token
counted-token ::= basic-token counter
token ::= first_class
| second_class
| '#PCDATA'
first_class ::= GI
second_class ::= '#stag(' GI ')'
| '#etag(' GI ')'
| '#tag(' GI ')'In a clean rabbit/duck grammar, every element type is declared once (explicitly or implicitly), every generic identifier used in a right-hand side as a first-class token is that of a first-class element, and every generic identifier used in a second-class token is that of a second-class element (except in overlap-groups, as specified below in section 3).
Undefined element types, unused element types, non-productive declarations, and so on can be defined for rabbit/duck grammars in the same way as for context-free grammars in general.11 In what follows, I will without loss of generality restrict my attention to clean grammars in which none of these deficiencies occur.
In a full set of rabbit/duck grammars, each grammar is defined over the same vocabulary G, and each generic identifier in G is first-class in at least one grammar.
Some technical commentary is in order, on the four ways a grammar can treat elements and their start- and end-tags.
An example may help clarify things. Consider the opening lines of Peer Gynt used as a running example in [Sperberg-McQueen / Huitfeldt 2000]:
Åse: Peer, du lyver!
Peer Gynt : [uten å stanse] Nei, jeg gjør ei!
Åse: Nå, så bann på det er sant!
Peer Gynt: Hvorfor banne?
Åse: Tvi, du tør ei! Alt i hop er tøv og tant!
In English:
AASE: Peer, you're lying!
PEER GYNT : [without stopping] No, I'm not!
AASE: Well then, swear to me it's true.
PEER GYNT: Swear? why should I?
AASE: See, you dare not! Every word of it's a lie.
In the dramatic hierarchy, we want a hierarchy of play / act / scene, with scenes containing speeches and stage directions (speech, stage).
In the metrical hierarchy, we want play / act / scene, with scenes containing verse lines (L).
The dramatic hierarchy can be defined thus:
<!--* Grammar D *-->
<!GRAMMAR play [
<!ELEMENT play (act+) >
<!ELEMENT act (scene+)
<!ELEMENT scene (sp | stage | #tag(L))* >
<!ELEMENT sp (#PCDATA | stage | #tag(L))* >
<!ATTLIST sp
who CDATA #REQUIRED >
<!ELEMENT stage (#PCDATA) >
]>
The verse structure can be defined this way:
<!--* Grammar V *--> <!GRAMMAR play [ <!ELEMENT play (act+) > <!ELEMENT act (scene+) <!ELEMENT scene (L | stage | #tag(sp) )* > <!ELEMENT L (#PCDATA | stage | #tag(sp) )* > <!ELEMENT stage (#PCDATA) > ]>
In this view, a scene of a verse drama is a sequence of verse lines interspersed with stage directions. If we wish to ignore the stage directions entirely, we can do that, too:
<!ELEMENT scene (L | #tag(sp) )* > <!ELEMENT stage IGNORE >
In TexMecs notation, this fragment can be marked up this way:13
<play| <* Peer Gynt *>
<act n="1"|
<scene n="i"|
<sp who="Aase"|<l|Peer, you're lying!|sp>
<sp who="Peer"|<stage|without stopping|stage>
No, I'm not!|l>|sp>
<sp who="Aase"|<l| Well then, swear to me it's true! |l>|sp>
<sp who="Peer"|<l| Swear? Why should I? |sp>
<sp who="Aase"| See, you dare not! |l>
<l| Every word of it's a lie! |l> |sp>
<* ... more dialogue ... *>
|scene>
<* ... more scenes ... *>
|act>
<* ... more acts ... *>
|play>
If we associate a view of the document (in the database sense) with each grammar, the difference between the two variants of the verse grammar just given is the difference between a view in which stage directions are present:
<l| Peer, you're lying! <stage|without stopping|stage> No, I'm not! |l> <l| Well then, swear to me it's true! |l> <l| Swear? Why should I? See, you dare not! |l> <l| Every word of it's a lie! |l>
<l| Peer, you're lying! No, I'm not! |l> <l| Well then, swear to me it's true! |l> <l| Swear? Why should I? See, you dare not! |l> <l| Every word of it's a lie! |l>
For metrical studies, the second view may be more convenient than the first. It's easy enough to transform the first into the second, but some vocabulary designers may also find it easier to write grammars like the second, in which they can completely ignore elements in the other hierarchies. (When all elements in a grammar are first- or third-class, rabbit/duck grammars are very similar to CONCUR.)
Two important conditions are established and maintained by the rules given above for all documents valid against a clean set of rabbit/duck grammars.
For the cases covered by the current rules, it seems plausible to believe (1) that any document for which the two general conditions are true should count as valid for a given set of rabbit/duck grammars, and (2) that no document for which the general conditions do not hold should count as valid. The conditions, that is, seem to be both sufficient and necessary conditions for validity against a full set of clean rabbit/duck grammars.
The rules defined so far handle all overlap as the overlaying of the tree defined by one grammar with the tree defined by another. They cannot handle self-overlap, because in self-overlap both element instances are first-class elements in the same grammars. But the rules do suffice to cover most cases of concurrent hierarchies in which overlap stems from the tagging of multiple analytic perspectives, and when applied to such cases, the rules given so far appear to produce intuitively plausible results. So the conditions may serve to guide us in attempting to extend the mechanism to cover cases of self-overlap: what is needed is a way of handling self-overlap that satisfies both of them.
If within a given language, instances of some particular element type b can overlap other instances of element type b, then we wish to write one or more of the grammars in a set of rabbit/duck grammars in such a way as to ensure that
For example, consider the following XML document, written with Trojan Horse milestones for b (later examples will use a more compact notation):
<doc> aaa <b sID="b1"/> bbb <b sID="b2"/> ccc <b eID="b1"/> ddd <b eID="b2"/> eee </doc>
<doc> aaa <b> bbb <b sID="b2"/> ccc </b> ddd <b eID="b2"/> eee </doc>
<doc> aaa <b sID="b1"/> bbb <b> ccc <b eID="b1"/> ddd </b> eee </doc>
It is desirable to have the same control over overlap as we have for other items in a content model, so we wish to be able to control where it can and cannot occur. The following sections outline a way to accomplish this.
To handle languages in which elements with the same generic identifier can overlap each other, a few modest changes suffice. We need (1) a syntactic mechanism for matching start- and end-tags, (2) a new kind of token which we will call an overlap-token, and (3) a new kind of group for use in content models, the overlap-group.
First, since elements don't necessarily nest, some syntactic mechanism is needed for matching start- to end-tags. The generic identifier alone no longer suffices. The examples shown here will use the co-indexing syntax of TexMecs: a tilde and an arbitrary string of characters will decorate the generic identifier, and decorated start- and end-tags match only other tags with the same decoration. So in the document
<doc|<s~1|aaa<s~2|bbb|s~1>ccc|s~2>|doc>
<doc| aaa <b~b1| bbb <b~b2| ccc |b~b1> ddd |b~b2> eee |doc>
Second, a new kind of token is required, namely an overlap-token:
overlap-token ::= '~' GI '~'
Third, a new type of content-model expression is required, an overlap group:
group ::= '#overlap(' expr ')'In any normal overlap-group, there will be at least one overlap-token (~ g ~) and at least one #tag( g ) (or at least one #stag( g ) and at least one #etag( g )) token for the same generic identifier. Together, these show the two ways in which every occurrence of g within the scope of the overlap-group must be read. Every occurrence of g must be a first-class element in (at least) one reading, and a pair of milestones (i.e. a second-class element) in all other readings.14
If #overlap(α) is an overlap-group with an overlap-token for g, then we can use α1 to denote the reading of the expression in which the first g is first-class and any others are second-class, α2 to denote the reading in which the second g is first-class, and so on. Then if there is just one g in the relevant part of the document instance, the meaning of the overlap-group is (α1). If there are two overlapping g elements, the meaning of the overlap-group is (α1 ∧ α2), where the connector “∧” denotes conjunction. When F and G are expressions, the content-model expression (F ∧ G) denotes the set of sequences which are members both of F and of G. When F and G are readings, the expression (F ∧ G) denotes the set of sequences which can be read both as F and as G. So (α1 ∧ α2) denotes the set of expressions which can be read both as α1 and as α2 For three overlapping g elements, the meaning is (α1 ∧ α2 ∧ α3). And so on. The meaning of an overlap-group, formally, is thus the infinite disjunction ((α1) ∨ (α1 ∧ α2) ∨ (α1 ∧ α2 ∧ α3) ∨ ...)
The two general conditions identified above (section 2-3) are preserved by this interpretation of overlap-groups: the grammar rules for the parent apply separately for each overlapping element matching the overlap-token, as do those for the overlapping element itself. And each overlapping instance is covered by a declaration and produces a reading in which it is a first-class element.
A simple example of overlap might involve pages (tagged p) deletions (d), and other elements (a, b). Deletions can overlap other deletions (i.e. element d may self-overlap), but no other overlaps occur in this group of element types. We can define the grammar thus:
<!GRAMMAR p [
<!ELEMENT p #overlap( (
(#PCDATA | a | #tag(d))*,
~d~,
(#PCDATA | b | #tag(d))* )? ) >
<!ELEMENT d (#PCDATA | a | b | #tag(d))*) >
<!ELEMENT a EMPTY >
<!ELEMENT b EMPTY >
]>Given a document instance
<p| <a><d~1|<a><d~2|<a><b><a>|d~1><b>|d~2><b> |p>
A second example illustrates slightly tighter constraints: d elements contain not an unordered mix of a and b elements, but an ordered sequence:
<!GRAMMAR p [
<!ELEMENT p (h*,
#overlap((a | #stag(d))*,
~d~,
(c | #etag(d))*),
f*) >
<!ELEMENT d ((a | #tag(d) | ~d~)*,
b,
(b | #tag(d) | ~d~)+,
(c | #tag(d) | ~d~)*) >
<!ELEMENT a EMPTY >
<!ELEMENT b EMPTY >
<!ELEMENT c EMPTY >
<!ELEMENT f (#PCDATA) >
<!ELEMENT h (#PCDATA) >
]><p| <a><d~1|<a><d~2|<b>|d~1><c>|d~2><c> |p>
In the preceding sections I have endeavored to give a purely declarative account of rabbit/duck grammars and their meaning, without any appeal to particular parsing techniques. But of course a grammatical formalism is more likely to be useful in practice if the constraints it expresses can actually be enforced at reasonable implementation and runtime cost.
One implementation technique is straightforward for languages without self-overlap:
Alternatively, one can dispense with the filtering step by building a parser for each grammar, in which the lexical scanner takes care of distinguishing among first- and second-class tags, suppressing third-class tags, and suppressing fourth-class elements. The document can be validated against a set of rabbit/duck grammars by parsing it once for each grammar.
It is also possible to validate an input document against a set of rabbit/duck grammars in a single pass; one way to do that is described below. It involves some unconventional extensions of Brzozowski derivatives; these are introduced piecemeal, in an attempt to ease comprehension. The exposition begins with a description of one way to check the well-formedness and validity of XML (section 4-1) using Brzozowski derivatives. These are relatively simple and straightforward illustrations of the mechanism. Section 4-2 discusses validation of TexMecs languages without self-overlap, and finally section 4-3 describes validation of languages which allow self-overlap.
Brzozowski derivatives were introduced in 1964 by Janusz Brzozowski [Brzozowski 1964]. The derivative D of a set L with respect to some input string s, written D s L, is the set of strings which can follow s in words of the language L. Brzozowski showed, given a regular expression E denoting L, how to construct a regular expression for D s E, which makes possible an algebraic approach to recognizing members of the language. Less formally, given a regular expression and a (prefix) string, the derivative (of the expression, with respect to the string) tells us what can come next. In what follows, it will be important to recall that if one of the things which can come next is ε, the empty string, then the string s is itself a member of the language defined by E. An expression E whose language includes ε is said to be nullable, and the function ν(E) is true or false accordingly. In what follows, successful parses will normally be those which end with a derivative which is nullable.17
Brzozowski derivatives are conventionally used to recognize or reason about regular languages. But with a couple of extensions, they can also be used to parse well-formed XML.
The first extension is to define an expression language slightly different from regular expressions as defined either conventionally or by Brzozowski. I shall call these language expressions rather that regular expressions, because the languages they denote are not necessarily regular. Each expression denotes a set of strings, and may be said to match any string in that language. The language expressions of interest here may take any of the following forms:
The second extension is to define an infinite input alphabet. Assume that the input stream is a sequence of symbols each of which is one of the following:
As usual, the derivative of an expression E with respect to a sequence of symbols x 1, x 2, x 3, ..., x n ,is found by finding the derivative of E with respect to x 1, then the derivative of D x 1 E with respect to x 2, and so on. Formally,
D s E = D <x 1, x 2, ... x n > E = D x n (... (D x 3 (D x 2 (D x 1 E))) ...)
A sequence of input symbols d is well-formed XML if and only if D d X is nullable, when the rules for derivatives are as described below.
The third extension is the the derivative of a primitive language expression with respect to an input symbol may be something other than ε or ∅. Derivatives of compound expressions (concatenation, disjunction, conjunction) are given in [Brzozowski 1964] and need not be repeated here. [Sperberg-McQueen 2005] defines derivatives for interleave.
When x = stag( h ), then
When x = etag( h ), then
When x = pcdata(), then
Examples of using these rules to check XML well-formedness are given in appendix 7-1.
The reader will note that the derivative, which represents the expected sequence of tokens to come, is being used as a stack: when we encounter a start-tag in the input which matches an elem( g ) in the expression being matched, we push (W etag( g )) onto the stack by placing it at the front of the derivative.
This method can be applied to validation as well, by changing just one rule. When stag( g ) in the input matches the expression elem( g ), the rules just given make the derivative be (W etag( g )): in well-formed XML, what follows the start-tag of any element is well-formed XML followed by the end-tag of that element. To validate against a DTD, all we need to do is to replace W in the derivative with the content model of the element. If we write the expression “cm( g )” to denote the content model of element type g, and “[[cm( g )]]” to denote the value of that expression (i.e. to indicate that the expression should be replaced by its value), the rule becomes:
For example, consider a vocabulary for serializing binary trees whose definition (in DTD syntax) is:
<!ELEMENT tree ((tree, tree) | leaf) > <!ELEMENT leaf (#PCDATA) >
<tree><leaf>42</leaf></tree>
We start with the expression elem(tree) and validate the input stream “stag(tree), stag(leaf), pcdata(), etag(leaf), etag(tree)”:
Further examples of validation using Brzozowski derivatives are given in appendix 7-2.
The technique just described can be used without major changes to validate documents with overlap against individual rabbit/duck grammars which have no overlap-groups. The only change is that the implementation must take care to treat fourth-class elements and tags for third-class elements appropriately.
We can validate a document against a whole set of rabbit/duck grammars rather than just individual grammars with parsing mechanism which keeps track of the current derivative in each grammar and for each token calculates the next derivative for each grammar. Perhaps the simplest way to acquire such a mechanism is to connect the derivatives for the different languages using logical conjunction. Thus if the current validation state of grammar 1 is denoted E 1, and the state of grammar 2 is E 2, etc., then for n grammars we will have a single expression of the form
E 1 ∧ E 2 ∧ ... ∧ E n
It is necessary, however, to keep track of which expressions come from which grammars, so as to treat third- and fourth-class elements appropriately. One way to do this is to assign a number to each grammar and extend the expression language so as to handle the necessary bookkeeping. The foundation of the expression language are the content model expressions defined above in section 2-1; for purposes of validation we extend that expression language as follows.
First, we subscript or mark each reference to a first-class element in a grammar with the number of that grammar. In written form, this can be represented as a subscript to the generic identifier, so that instead of being written (a, (b | #tag(c)), d), a content model in grammar 1 would be written (a 1, (b 1 | #tag(c)), d 1), and similarly for grammar 2, 3, and so on. The rule for start-tags matching elements can then be rewritten
Second, we add two new types of group expressions. Neither of these will normally occur in user-written content models; they will be used only as the result of taking a derivative from some other expression.
group ::= g-tag | spinner
g-tag ::= '#gr(' POSITIVE_INTEGER ')' content-model
spinner ::= '#spin(' POSITIVE_INTEGER ')'
'(' list ')' content-model
list ::= empty | GI ' / ' list
empty ::= ε // nothing, the empty sequence
token ::= 'ε'
| '∅'
An expression of the form #gr( n )( E ) labels expression E as having come from grammar number n. The rules for taking the derivative D x #gr( n )( E ) are these:
Derivatives for spinner expressions are mostly concerned with keeping track of how many fourth-class elements, and of what kinds, are currently open. Only when they are all closed again will the grammar in question leave its spinner state and return to normal.
With these rules it is possible to handle any set of rabbit/duck grammars which has no overlap-groups. Examples of validation without self-overlap are given in appendix 7-3.
To handle self-overlap, it is convenient to add two more new types of group to the expression language. Like the g-tag and spinner groups introduced in the previous section, these will not appear in user-written content models, but serve internal book-keeping purposes. Each time a start-tag symbol in the input matches an overlap-token, we need to produce two readings, one in which the start-tag is treated as first-class, and one in which it is treated as second-class. To force the one interpretation or the other, we introduce R- and S-groups.
group ::= r-group | s-group
r-group ::= '#r(' GI ')' expr
s-group ::= '#s(' GI ')' expr
The rules for derivatives are these.
First, the rules for derivatives of overlap-expressions. When #overlap( E ) is an overlap-group whose overlap-token(s) are for generic identifier g, then
Next, the rules for R-groups:
Finally, the rules for S-groups (in which the element is treated as second-class).
Examples of validating self-overlap are given in appendix 7-4.
This section describes two simplified versions of real-world markup scenarios and illustrates the use of rabbit/duck grammars to solve problems in those scenarios.
Consider a markup vocabulary in which several different and overlapping hierarchical structures are mingled. The vocabulary provides element types for marking up manuscript documents with information about
The names by themselves should (it is hoped) make the intended semantics clear enough for purposes of this example.
The constraints to be enforced include some which are fairly easy to express in single hierarchies:
Some constraints involve cross-view interactions:
The following subsections define a set of rabbit/duck grammars to validate this language.
From the point of view of the extended link markup, the structure of a document is quite simple: a text is a series of link elements intermixed with character data.
<!GRAMMAR text [
<!ELEMENT text (#PCDATA | link)* >
<!ELEMENT link (end+) >
<!ATTLIST link
type CDATA #REQUIRED >
<!ELEMENT end EMPTY >
<!ATTLIST end
role CDATA #REQUIRED
loc IDREFS #REQUIRED >
]><!GRAMMAR text [
<!ENTITY % tags.comp "#tag(div) | #tag(head)
| #tag(p) | #tag(s)" >
<!ENTITY % tags.phys "#tag(vol) | #tag(page)
| #tag(col) | #tag(block) | #tag(line)" >
<!ENTITY % tags.inscr "#tag(add) | #tag(del)
| #tag(restore) | #tag(hi)" >
<!ENTITY % tags.other "%tags.comp;
| %tags.phys; | %tags.inscr;">
<!ELEMENT text (#PCDATA | link | %tags.other;)* >
<!ELEMENT link (end+) >
<!ATTLIST link
type CDATA #REQUIRED >
<!ELEMENT end EMPTY >
<!ATTLIST end
role CDATA #REQUIRED
loc IDREFS #REQUIRED >
]>The compositional hierarchy has a structure whose general outline will be familiar from many familiar document markup languages.
<!GRAMMAR text [ <!ELEMENT text (head, p*, div*) > <!ELEMENT div (head?, p*, div*) > <!ELEMENT p (s+) > <!ELEMENT head (#PCDATA) > <!ELEMENT s (#PCDATA | a)* > <!ELEMENT a (#PCDATA)* > <!ELEMENT link IGNORE > ]>
The declaration of link as IGNORE makes link elements and their content invisible to this grammar. It follows that no start- or end-tags for text, div, etc., will be seen inside any link elements, and thus that link elements cannot overlap any first-class elements declared in this grammar. We need not declare end elements as fourth-class here: since they occur only within link elements, they are guaranteed invisible to this grammar.
One might be tempted to try to enforce the prohibition on overlaps with headings by a similar maneuver, and declare head as IGNORE in all the other grammars. But this would prevent multi-line headings, as well as preventing highlighting from occurring in headings. To enforce the prohibition effectively, we will need a more complicated version of the compositional grammar, which explicitly allows tags from the physical hierarchy everywhere except in headings:
<!GRAMMAR text [
<!ENTITY % tags.phys "#tag(vol) | #tag(page)
| #tag(col) | #tag(block) | #tag(line)" >
<!ENTITY % x "(%tags.phys;)*">
<!ELEMENT text (%x;, head, %x;,
(p, %x;)*,
(div, %x;)*) >
<!ELEMENT div (%x;, (head, %x;)?,
(p, %x;)*,
(div, %x;)*) >
<!ELEMENT p (%x;, (s, %x;)+) >
<!ELEMENT head (#PCDATA)* >
<!ELEMENT s (#PCDATA | a | %tags.phys;)* >
<!ELEMENT a (#PCDATA | %tags.phys;)* >
<!ELEMENT link IGNORE >
]>Note that this grammar does nothing to enforce the constraint forbidding any of these elements from overlapping with volumes. That will be taken up in section 5-1-5.
The physical grammar defines a fairly straightforward hierarchy of volumes, pages, columns, text blocks, and lines. Some of the levels of hierarchy are of interest only for particularly detailed markup or particularly complex physical organization of the source; they can be omitted when not necessary. Thus col should be used only if there are at least two columns on the page, block only if there are at least two text blocks in the column, etc.
<!GRAMMAR text [
<!ELEMENT text (page+ | vol{2,∞}) >
<!ELEMENT vol (page+) >
<!ATTLIST vol n CDATA #IMPLIED >
<!ELEMENT page (line+
| block{2,∞}
| col{2,∞}) >
<!ATTLIST page n CDATA #IMPLIED >
<!ELEMENT col (line+
| block{2,∞}) >
<!ATTLIST col n CDATA #IMPLIED >
<!ELEMENT block (line+) >
<!ATTLIST block n CDATA #IMPLIED >
<!ELEMENT line (#PCDATA) >
<!ATTLIST line n CDATA #IMPLIED >
<!ELEMENT link IGNORE >
]>Since the elements related to inscription can overlap each other, they cannot all be in the same grammar; we will require one grammar each for additions, deletions, and highlighting. Since the elements are defined as not spanning page boundaries, we can conveniently include the page element in each of the inscription grammars.
Since additions do not self-overlap, their grammar is very straightforward:
<!GRAMMAR text [ <!ELEMENT text (page+) > <!ELEMENT page (#PCDATA | add)* > <!ELEMENT add (#PCDATA | add)* > <!ELEMENT link IGNORE > ]>
Highlighting is slightly more complex, since highlighted phrases can both nest within each other and overlap each other:
<!GRAMMAR text [
<!ELEMENT text (page+) >
<!ELEMENT page #overlap((#PCDATA
| ~hi~ | #tag(hi))*) >
<!ELEMENT hi #overlap((#PCDATA
| ~hi~ | #tag(hi))*) >
<!ELEMENT link IGNORE >
]>Deletions add the complication that restore elements may occur within deletions and may be overlapped by deletions other than their parent(s):
<!GRAMMAR text [
<!ELEMENT text (page+) >
<!ELEMENT page #overlap((#PCDATA
| ~del~ | #tag(del))*) >
<!ELEMENT del #overlap((#PCDATA | restore
| ~del~ | #tag(del))*) >
<!ELEMENT restore #overlap((#PCDATA
| ~del~ | #tag(del))*) >
<!ELEMENT link IGNORE >
]>The grammars thus far given have enforced all of the constraints outlined except one: that nothing may overlap any vol element.
This could probably be accomplished by suitably careful inclusion of #tag(vol) tokens at the right places in the right content models, but it is simpler just to ensure that vol is described as containing (directly or indirectly) each element other than text from all of the other hierarchies. Part of this has already been done: it follows from the physical and inscription grammars given above that the first-class elements in those grammars all nest within vol elements.19 The physical grammar also guarantees that vol does not overlap link. The remainder of the elements in the vocabulary can be handled with the following grammar, which requires only that the first-class elements all nest properly:
<!GRAMMAR text [ <!ELEMENT text (vol | head | div | p | s | a)* > <!ELEMENT vol (vol | head | div | p | s | a)* > <!ELEMENT head (vol | head | div | p | s | a)* > <!ELEMENT div (vol | head | div | p | s | a)* > <!ELEMENT p (vol | head | div | p | s | a)* > <!ELEMENT s (vol | head | div | p | s | a)* > <!ELEMENT a (vol | head | div | p | s | a)* > ]>
Rabbit/duck grammars may also usefully be applied in some cases not involving overlap. Consider the following example, based (very loosely) on an active vocabulary design project called to my attention by Fabio Vitale of the University of Bologna.
The vocabulary is to support documents about sales transactions of various kinds. The overall structure of the document is perfectly conventional; it differs from the compositional grammar given above in section 5-1-2 primarily in adding more phrase-level elements, including some semantically specialized elements which allow crucial information about the sales agreement to be captured.
Somewhere in the document, exactly one seller must be identified, either as a corporation or as an individual; similarly one buyer must be identified. For corporations, a VAT number must be given; for individuals, a personal identity number (PIN). Start- and end-dates of the agreement, amount to be paid, address of the buyer and seller, a description of the good being sold, and a statement of the controlling jurisdiction in cases of dispute, must also be indicated. For a variety of reasons, it proves impractical to require that this information come in a prescribed order or at a prescribed location in the document: the accepted form of these documents allows the information to occur pretty much anywhere in the document. It would be possible to record the information both in the running text and in a header, but that would involve redundancy and raise the likelihood of inconsistency. It would be possible to record the information just once, in a header, and use hyperlinks to refer to it from the text body; for purposes of the example, we assume that this solution was also found impractical for the project.
In a conventional schema, it will prove difficult to ensure that exactly one individual-buyer element or corp-buyer element occurs in the document, and so forth.
A pair of rabbit/duck grammars can, however, impose the constraints without much difficulty.
The main document structure grammar resembles the one given above, with additional phrase-level elements:
<!GRAMMAR text [ <!ENTITY % hypertext "a" > <!ENTITY % hilites "it | bd | tt" > <!ENTITY % buyer "individual-buyer | corp-buyer | buyer-PIN | buyer-VATnum | buyer-address" > <!ENTITY % seller "individual-seller | corp-seller | seller-PIN | seller-VATnum | seller-address" > <!ENTITY % misc "start-date | end-date | amount | description-of-goods | jurisdiction" > <!ENTITY % phrases "%buyer; | %seller; | %hypertext; | %hilites; | %misc;"> <!ELEMENT text (head, p*, div*) > <!ELEMENT div (head?, p*, div*) > <!ELEMENT p (#PCDATA | %phrases;)* > <!ELEMENT head (#PCDATA | %phrases;)* > <!ELEMENT a (#PCDATA | %hilites;)* > <!ELEMENT it (#PCDATA | %hilites;)* > <!ELEMENT bd (#PCDATA | %hilites;)* > <!ELEMENT tt (#PCDATA | %hilites;)* > <!ELEMENT individual-buyer (#PCDATA | %hilites;)* > <!ELEMENT corp-buyer (#PCDATA | %hilites;)* > <!ELEMENT buyer-PIN (#PCDATA | %hilites;)* > <!ELEMENT buyer-VATnum (#PCDATA | %hilites;)* > <!ELEMENT buyer-address (#PCDATA | %hilites;)* > <!ELEMENT individual-seller (#PCDATA | %hilites;)* > <!ELEMENT corp-seller (#PCDATA | %hilites;)* > <!ELEMENT seller-PIN (#PCDATA | %hilites;)* > <!ELEMENT seller-VATnum (#PCDATA | %hilites;)* > <!ELEMENT seller-address (#PCDATA | %hilites;)* > <!ELEMENT start-date (#PCDATA | %hilites;)* > <!ELEMENT end-date (#PCDATA | %hilites;)* > <!ELEMENT amount (#PCDATA | %hilites;)* > <!ELEMENT description-of-goods (#PCDATA | %hilites;)* > <!ELEMENT jurisdiction (#PCDATA | %hilites;)* > ]>
The constraints on the semantically rich phrase-level tags are enforced in a second grammar. For purposes of illustration, we assume that the & connector in the content model is interpreted as denoting interleave rather than denoting the same relation as the & connector in SGML.
<!--* We must have one buyer, with address and either
* SSN or VATnumber, also a seller with similar details.
* They can appear anywhere, and must appear exactly
* once each.
*-->
<!--* N.B. interleave interpretation of & needed here,
* not the SGML interpretation *-->
<!ELEMENT doc (#PCDATA
& ( (individual-buyer & buyer-PIN)
| (corp-buyer & buyer-VATnum)
)
& ( (individual-seller & seller-PIN)
| (corp-seller & seller-VATnum)
)
& buyer-address
& seller-address
& start-date
& end-date
& amount
& description-of-goods
& jurisdiction) >
This paper has presented rabbit/duck grammars, which are a modest advance on CONCUR as a mechanism for validating overlapping structures. The essential difference from CONCUR is that rabbit/duck grammars allow the boundaries of elements in one hierarchy to be visible in document grammars for other hierarchies; this makes it possible to constrain interactions between hierarchies in ways not feasible with CONCUR.
In addition, Brzozowski derivatives have been extended to apply to the parsing of well-formed XML or MECs, to the validation of XML against one or more DTDs, and to the validation of markup with overlapping structures against rabbit/duck grammars.
Some of the goals with which the work was begun have been met, at least in part. Rabbit/duck grammars are powerful enough to be useful, without being Turing-complete. They are built around well understood ideas, notably the representation of overlapping structures as sets of tangled trees and the validation of trees using context-free grammars. There are simple validation algorithms using Brzozowski derivatives. Some goals have been met only incompletely. The requirement that no single grammar have two first-class elements which overlap each other means that grammars cannot always correspond 1:1 with the sub-groupings of a vocabulary which seem on conceptual grounds to belong together. The fact that in case of need a grammar may contain just one document element and one self-overlapping element (as in the case of the deletion grammar in section 5-1-4) serves as a reminder that every graph may be factored into trees, but may also suggest to some observers that that exercise is not necessarily an enlightening one.
One goal remains unmet: the rabbit/duck mechanism outlined here is well suited for validating data streams (TexMecs, or XML with Trojan horses, or concurrent XML in the style of XCONCUR), but there are at least some graph structures proposed in the literature (Goddag structures and multi-colored trees, to name two) which cannot be validated against a set of rabbit/duck grammars in any obvious way, since they have no structural analogue to the second-class tags in the data stream.
Most important, it is not clear whether rabbit/duck grammars can handle all of the kinds of constraints which one might naturally wish to impose on overlap, mostly because (in the absence of any serious discussion of validation) there is no consensus view in the community about what constitutes a natural constraint on overlap. The development of better intuitions on this subject remains a task for further work in the community. So does the elucidation of what classes of constraints can, and which classes of constraints cannot, be expressed by sets of rabbit/duck grammars.
The examples in this section are intended to illustrate the operations described in the main text. They need not be consulted unless the reader is interested in detailed information.
A simple example may serve to illustrate the process. Let the input stream d be
<a>www<b><c/>xxx</b>yyy<d/>zzz</a>
If we parse an ill-formed stream, however, we should get an error. Let the input stream d be
<a>xxx<b>yyy</a>zzz</b>
Let the DTD be
<!DOCTYPE a [
<!ELEMENT a (b, (c | d | e){2,∞}, b*) >
<!ELEMENT b EMPTY >
<!ELEMENT c EMPTY >
<!ELEMENT d EMPTY >
<!ELEMENT e EMPTY >
]><a><b/><d/><e/></a>
If we try an invalid document, we produce ∅, not a nullable expression. Let d be
<a><b/><d/></a>
Let grammar 1 be the grammar given in section 2-2 for the dramatic hierarchy; let grammar 2 be the (first) grammar given for the verse hierarchy.
Let the input be the fragment of Peer Gynt given in the text, with the addition of start- and end-tags for play, act, and scene.
<play|<act|<scene|
<sp who="Aase"|<L|Peer, you're lying!|sp>
<sp who="Peer"|<stage|without stopping|stage>
No, I'm not!|L>|sp>
<sp who="Aase"|<L| Well then, swear to me it's true! |L>|sp>
<sp who="Peer"|<L| Swear? Why should I? |sp>
<sp who="Aase"| See, you dare not! |L>
<L| Every word of it's a lie! |L>|sp>
|scene>|act>|play>
A trace generated by a validator for rabbit/duck grammars (and slightly cleaned up for presentation here) contains the following information. Each derivative takes the form E = F ∧ G; the sub-expression before the ∧ connector is from grammar 1 (dramatic hierarchy), the one after the ∧ is from grammar 2 (verse hierarchy).
If we make the document invalid against the verse hierarchy by attempting to embed a line within another line, the input is:
<play|<act|<scene|
<sp who="Aase"|<L|<L|Peer, you're lying!|L>|sp>
<sp who="Peer"|<stage|without stopping|stage>
<L|No, I'm not!|L>|L>|sp>
<sp who="Aase"|<L| Well then, swear to me it's true! |L>|sp>
<sp who="Peer"|<L|<L| Swear? Why should I? |L>|sp>
<sp who="Aase"|<L| See, you dare not! |L>|L>
<L| Every word of it's a lie! |L> |sp>
|scene>|act>|play>
A validation trace for this element (cleaned up slightly for presentation) follows:
As may be seen, the result makes clear that the document is valid against grammar 1 but invalid against grammar 2.
Let grammar 1 be the first grammar of section 3-2.
The sample document
<p|<a><d~1|<a><d~2|<a><b><a>|d~1><b>|d~2><b>|p>
|
The work reported here was undertaken in the context of the MLCD Project (Markup languages for complex documents) at the University of Bergen and supported in part by a grant from the Meltzer Foundation (L. Meltzers Høyskolefond). Earlier drafts have been read by Steven J. Derose, Patrick Durusau, Claus Huitfeldt, and Andreas Witt, as well as by the anonymous reviewers who reviewed the paper for this conference, and the paper has benefited from their comments. My thanks to all of them. |
|
|
This is a slight oversimplification: in principle, general entities appearing in the document might have different replacement text in different DTDs, and a record-end which was suppressed in one DTD by the record-end rules of ISO 8879 might count as data in a different tree. These complications are ignored here, as they were in all the work I am aware of that used concurrent markup. |
|
|
I am not aware of formal work on the classes of context-sensitive languages which may be recognized as the intersection of two, or of three, or of n context-free grammars. I conjecture, without being able to offer proof, that there are context-sensitive languages which can be recognized as the intersection of three context-free grammars, but not as the intersection of two, and more generally that as the number n increases, so does the set of languages expressible as the intersection of n context-free languages. |
|
|
Also in the use of CONCUR being worked out in Bielefeld by Andreas Witt and his colleagues and students, which I refer to here as XCONCUR or as ‘Bielefeld CONCUR’. |
|
|
As a convenience, in ISO 8879 the omission of explicit labels serves to mark a tag as belonging to every tree. |
|
|
The image was first made an object of scholarly inquiry by Joseph Jastrow [Jastrow 1899]; the image used here is a reproduction of that used by Jastrow, from [Kihlstrom 2004]. The same ambiguous figure occupied Ludwig Wittgenstein and it is sometimes erroneously attributed to him, although he appears to have learned of it from Jastrow. |
|
|
My colleague Liam Quin suggests that orthogonality would require a fifth class of element, in which the start- and end-tags would be visible but the contents of the element invisible. I have not yet seen an application for this class, so it has not been incorporated into the exposition below. |
|
|
Systems for practical use may prefer to use the keyword unbounded instead of ∞. |
|
|
This constraint is not intended to replace the stylesheet code which tells the formatter to begin each chapter on a new page; it merely provides a convenient way of making sure that the formatter actually did so, and that the output has not been garbled since. |
|
|
Because it is used in later discussion, the connector “∧” is also defined here, although it is not conventional in schema languages. When F and G are expressions, the content-model expression (F ∧ G) denotes the set of sequences which are members both of F and of G. |
|
|
The reader in search of clear definitions of these should consult [Grune/Jacobs 1990]. |
|
|
One can imagine that an experimental writer like Perec might try to write a verse drama in which the stage directions are part of the verse structure, but let us assume our grammar is for Shakespeare, Goethe, and Ibsen, who do not do this. |
|
|
The reader unfamiliar with TexMecs should be told that <e| is a start-tag with generic identifier g, and |e> is the corresponding end-tag. If the element is empty, it may be written as a sole-tag: <e>. Text enclosed between <* and *> is a comment. For further details see [Huitfeldt / Sperberg-McQueen 2001]. |
|
|
That is (as already noted), a normal overlap group will be ambiguous: the same element type will be treated both as a first- and as a second-class element. One purpose of introducing an explicit overlap-group is to allow the scope of the ambiguity to be marked explicitly. In a system for practical use, it would be convenient to allow overlap-tokens to occur without an explicit overlap-group, and to assume that the content model as a whole is the overlap-group. In such a system, explicit overlap groups would need to be written only when it was important to note that the scope of the ambiguity was less than the entire content model. |
|
|
As the reader will have noted, the overlap rules just described treat a self-overlapping element as first-class in one reading and second-class in the others. The grammars are simpler to read and write, though less fine control is available, if self-overlap is handled as first-class in one reading and third-class in the others. The third-class tags need not be provided for in the grammar. This alternative is only a convenience feature, though: it does not increase the formal power of the grammatical formalism, and so it is not illustrated here. [Probably it should be, though.] |
|
|
Since the content of fourth-class elements is suppressed, no first-class element can overlap a fourth-class element. |
|
|
For further information, the reader is directed to [Brzozowski 1964], which is quite accessible to non-mathematical readers, or to [Sperberg-McQueen 2005], which applies Brzozowski derivatives to the processing of XML documents. |
|
|
The operator denoted by “&” is unrestricted interleave: the interleave (s 1 & s 2) of two strings s 1 and s 2 is any string which contains both the elements of s 1, in order, and the elements of s 2, in order, with no element of the interleave being assigned both to s 1 and to s 2. Applied to sets, the interleave (F & G) of F and G denotes the set of strings s formed by the interleave of some s 1 ∈ F and some s 2 in G. This is not the place to provide a detailed motivation for an interleave operator; it is included here partly because some current schema language proposals have it, and partly because the example in section 5-2 uses it. |
|
|
The grammars do go beyond the explicit formulation of the constraint, which requires only that nothing overlap with vol. The constraint does not forbid vol from nesting within a section or a paragraph. But either such a volume would be unnaturally small, or else such a paragraph unnaturally long. |
|
|
As the sequence of derivatives makes clear, the ill-formedness is visible as soon as the end-tag for a is encountered. |
[ACH/ACL/ALLC 1994] Association for Computers and the Humanities, Association for Computational Linguistics, and Association for Literary and Linguistic Computing. 1994. Guidelines for Electronic Text Encoding and Interchange (TEI P3). Ed. C. M. Sperberg-McQueen and Lou Burnard. Chicago, Oxford: Text Encoding Initiative, 1994.
[Barnard et al. 1988] Barnard, David, Ron Hayter, Maria Karababa, George Logan, and John McFadden. 1988. “SGML-based markup for literary texts: Two problems and some solutions”. Computers and the Humanities 22: 265-276.
[Barnard et al. 1995] Barnard, David, Lou Burnard, Jean-Pierre Gaspart, Lynne A. Price, C. M. Sperberg-McQueen, and Giovanni Battista Varile. 1995. 1995. “Hierarchical encoding of text: Technical problems and SGML solutions”. Computers and the Humanities 29: 211-231.
[Brzozowski 1964] Brzozowski, Janusz A. 1964. “Derivatives of regular expressions”. Journal of the ACM 11.4 (1964): 481-494.
[Dekhtyar / Iacob 2005] Dekhtyar, Alex, and Ionut Emil Iacob. 2005. “A Framework For Management of Concurrent XML Markup”. Data and Knowledge Engineering 52.2: 185-215.
[DeRose 2004] DeRose, Steven. 2004. “Markup overlap: A review and a Horse”. Paper given at Extreme Markup Languages 2004, Montréal, sponsored by IDEAlliance. Available on the Web at http://www.mulberrytech.com/Extreme/Proceedings/html/2004/DeRose01/EML2004DeRose01.html.
[Durand et al. 1996] Durand, David G., Elli Mylonas, and Steven J. Derose. 1996. “What should markup really be? Applying theories of text to the design of markup systems.” ALLC-ACH'96, Joint Conference of the Association for Literary and Linguistic Computing and the Association for Computers and the Humanities. Bergen. Available on the Web at http://cs-people.bu.edu/dgd/ach96_talk/Redefining_long.html.
[Durusau / O'Donnell 2002a] Durusau, Patrick, and Matthew Brook O'Donnell. 2002. “Concurrent markup for XML documents”. Paper given at XML Europe 2002, Barcelona, sponsored by IDEAlliance. Available on the Web at http://www.idealliance.org/papers/xmle02/dx_xmle02/papers/03-03-07/03-03-07.html.
[Durusau / O'Donnell 2002b] Durusau, Patrick, and Matthew Brook O'Donnell. 2002. “Coming down from the trees: Next step in the evolution of markup?” Late-breaking paper given at Extreme Markup Languages 2002, Montréal, sponsored by IDEAlliance. Slides (but no full text) available on the Web at http://www.mulberrytech.com/Extreme/Proceedings/author-pkg/2002/Durusau01/EML2002Durusau01.zip.
[Durusau / O'Donnell 2004] Durusau, Patrick, and Matthew Brook O'Donnell. 2004. “Tabling the overlap discussion”. Paper given at Extreme Markup Languages 2004, Montréal, sponsored by IDEAlliance. Available on the Web at http://www.mulberrytech.com/Extreme/Proceedings/html/2004/Durusau01/EML2004Durusau01.html.
[Dybkjær et al. 1998] Dybkjær, Laila, Niels Ole Bernsen, Hans Dybkjær, David McKelvie, and Andreas Mengel. 1998. The MATE Markup Framework. MATE Deliverable D1.2. 30 November 1998. Available on the Web at http://mate.nis.sdu.dk/information/d12/
[Grune/Jacobs 1990] Grune, Dick, and Ceriel J. H. Jacobs. 1990. Parsing techniques: a practical guide. New York, London: Ellis Horwood, 1990. Postscript of the book is available from the first author's Web site at http://www.cs.vu.nl/~dick/PTAPG.html. A second edition is due to appear in 2006 from Springer Verlag.
[Hilbert et al. 2005] Hilbert, Mirco, Oliver Schonefeld, and Andreas Witt. 2005. “Making CONCUR work”. Paper given at Extreme Markup Languages 2005, Montréal, sponsored by IDEAlliance. Available on the Web at http://www.mulberrytech.com/Extreme/Proceedings/html/2005/Witt01/EML2005Witt01.xml.
[Huitfeldt / Sperberg-McQueen 2001] Huitfeldt, Claus, and C. M. Sperberg-McQueen. 2001. “TexMECS: An experimental markup meta-language for complex documents”. http://helmer.aksis.uib.no/claus/mlcd/papers/texmecs.html
[Huitfeldt 1995] Huitfeldt, Claus. 1995. “Multi-dimensional texts in a one-dimensional medium”. Computers and the Humanities 28: 235-241.
[Huitfeldt 1999] Huitfeldt, Claus. 1999. MECS—A multi-element code system. Working papers of the Wittgenstein Archives at the University of Bergen.
[ISO 1986] International Organization for Standardization (ISO). 1986. ISO 8879-1986 (E). Information processing — Text and Office Systems — Standard Generalized Markup Language (SGML). International Organization for Standardization, Geneva, 1986.
[Jagadish et al. 2004] Jagadish, H. V., Laks V. S. Lakshmanan, Monica Scannapieco, Divesh Srivastava, and Nuwee Wiwatwattana. 2004. “Colorful XML: One hierarchy isn't enough”. Proceedings of the 2004 ACM SIGMOD International conference on management of data, Paris, sponsored by the Association for Computing Machinery Special Interest Group on Management of Data. New York: ACM Press.
[Jastrow 1899] Jastrow, Joseph. 1899. “The mind's eye”. Popular Science Monthly 54 (1899): 299-312.
[Kihlstrom 2004] Kihlstrom, John F. 2004. “Joseph Jastrow and his duck — Or is it a rabbit?” Originally written 16 November 2004 as a letter to the editor of Trends in cognitive sciences; available on Kihlstrom's Web site at http://ist-socrates.berkeley.edu/~kihlstrm/JastrowDuck.htm
[Megginson 1992] Megginson, David. 1992. Re: Minimization, nesting, multiple types, etc. Contribution to thread in comp.text.sgml, 14 February 1992. Available on the Web at http://groups.google.com/group/comp.text.sgml/browse_thread/thread/e108fbd9e92fa833/82b3ff7b20f42651?lnk=st&q=&rnum=9#82b3ff7b20f42651
[Murata 1995] Murata Makoto. 1995. “File format for documents containing both logical structures and layout structures”. Electronic publishing 8: 295-317.
[Nicol 2002] Nicol, Gavin. 2002. “Core range algebra: Toward a formal theory of markup”. Paper given at Extreme Markup Languages 2002, Montréal, sponsored by IDEAlliance. Available on the Web at http://www.mulberrytech.com/Extreme/Proceedings/html/2002/Nicol01/EML2002Nicol01.html.
[Renear et al. 1993] Renear, Allen, Elli Mylonas, and David Durand. 1993. “Refining our Notion of What Text Really Is: The Problem of Overlapping Hierarchies”. In Research in Humanities Computing, ed. Nancy Ide and Susan Hockey. Oxford: Oxford University Press. Available on the Web at: http://www.stg.brown.edu/resources/stg/monographs/ohco.html
[Sasaki 2004] Sasaki, Felix. 2004. “Secondary information structuring: A methodology for the vertical interrelation of information resources”. Paper given at Extreme Markup Languages 2004, Montréal, sponsored by IDEAlliance. Available on the Web at http://www.mulberrytech.com/Extreme/Proceedings/html/2004/Sasaki01/EML2004Sasaki01.html.
[Sperberg-McQueen / Huitfeldt 1999] Sperberg-McQueen, C. M., and Claus Huitfeldt. 1999. “Concurrent document hierarchies in MECS and SGML”. Literary & Linguistic Computing 14.1: 29-42. http://www.w3.org/People/cmsmcq/2000/poddp2000.html
[Sperberg-McQueen / Huitfeldt 2000] Sperberg-McQueen, C. M., and Claus Huitfeldt. 2000. “GODDAG: A Data Structure for Overlapping Hierarchies”. In DDEP-PODDP 2000, ed. P. King and E.V. Munson, Lecture Notes in Computer Science 2023 (Berlin: Springer, 2004), pp. 139-160. Available on the Web at http://www.springerlink.com/index/98J1VBU5NBY73UL3 and http://www.w3.org/People/cmsmcq/2000/poddp2000.html.
[Sperberg-McQueen 2005] Sperberg-McQueen, C. M. 2005. “Applications of Brzozowski derivatives to XML Schema processing”. Paper given at Extreme Markup Languages 2005, Montréal, sponsored by IDEAlliance. Available on the Web at http://www.mulberrytech.com/Extreme/Proceedings/html/2005/SperbergMcQueen01/EML2005SperbergMcQueen01.html, http://www.w3.org/People/cmsmcq/2005/abdxsp.unicode.html, and http://www.w3.org/People/cmsmcq/2005/abdxsp.ascii.html.
[Tennison / Piez 2002] Tennison, Jeni, and Wendell Piez. 2002. “LMNL Syntax”. Available on the Web at http://www.lmnl.net/prose/syntax/index.html.
[Witt 2004] Witt, Andreas. 2004. “Multiple hierarchies: new aspects of an old solution”. Paper given at Extreme Markup Languages 2004, Montréal, sponsored by IDEAlliance. Available on the Web at http://www.mulberrytech.com/Extreme/Proceedings/html/2004/Witt01/EML2004Witt01.html