Representation of overlapping structures

C. M. Sperberg-McQueen

Abstract

Markup of overlapping structures is, depending on one's point of view, either a perpetual hot topic or a negligeable edge case in the theory of markup. Starting from the belief that overlapping structures are not just common but important, not only in the analysis of literary works but also in the management of changing content, we explore ways to represent overlapping structures in tractable ways. The requirements for representing overlapping structures differ, depending on what kinds of structures we expect to represent. The similarities and differences among existing proposals for overlapping structures are illuminated by defining ways to represent them in relational form. The experiment shows that surprising variations in structure can be created by very restricted variations on a schema for the relational representation of XML.

Keywords: Concurrent Markup/Overlap

C. M. Sperberg-McQueen

C.M. Sperberg-McQueen is a member of the technical staff at the World Wide Web Consortium; he serves on the W3C XML Schema Working Group, the XSL Working Group, the XML Processing Model Working Group, and the Service Modeling Language (SML) Working Group, and chairs the XML Coordination Group.

Representation of overlapping structures

C. M. Sperberg-McQueen [World Wide Web Consortium, MIT Computer Science and Artificial Intelligence Laboratory]

Extreme Markup Languages 2007® (Montréal, Québec)

Copyright © 2007 C. M. Sperberg-McQueen. Reproduced with permission.

Introduction

Recent work on overlapping structures has described a number of different models for representing and thinking about textual structures: concurrent trees, multi-colored trees, LMNL documents, MultiX documents, and Goddag structures among them.

One way to improve understanding of a data structure is to consider different ways of representing it. This paper outlines relational representations of several of these data structures, as a way of getting a slightly different angle on them and allowing more or less direct comparisons.

As a side effect, relational representation of these structures also provides a simple way to build proof-of-concept systems for searching and manipulating them. It should be noted, however, that the relations described here have been designed for simplicity in exhibiting the similarities and differences among the different data structures, and not to optimize performance of existing relational systems.

Background

It is assumed that the reader is familiar with SGML and XML markup and the kinds of tree data structures commonly used to represent SGML and XML documents.

The other data structures mentioned here have been well described in their original publications, or in reviews (see, for example, [Witt 2004] and [DeRose 2004]); the notes which follow provide only a cursory introduction to each.

Concurrent markup

Although originally introduced largely with the thought of providing both ‘logical’ and ‘formatted’ views of documents, the SGML feature called CONCUR was specifed with enough generality and abstraction that it is not limited to that original application. Multiple concurrent views of the same document can be provided, without restriction as to number or kind of view. Each view takes the form of a conventional document tree, and all views share the same frontier of character data.

Although seldom exploited by SGML users, concurrent markup has seen something of a revival in recent years, both in linguistic annotation (see for example [Witt 2004], [Hilbert et al. 2005], and [Schonefeld 2007]) and manuscript studies ([Dekhtyar / Iacob 2005]).

A simple example from [Schonefeld / Witt 2006] may illustrate the mechanism:1

<(L1)div type="dialog" org="uniform">
   <(L2)text>
     <(L1)u who="Peter">
       <(L2)s>Hey Paul! </(L2)s>
       <(L2)s>Would you give me
     </(L1)u>
     <(L1)u who="Paul">
       the hammer? </(L2)s>
     </(L1)u>
   </(L2)text>
</(L1)div>

This presents two trees. First, the L1 tree, whose conventional XML representation would be:

<div type="dialog" org="uniform">
     <u who="Peter">
       Hey Paul!
       Would you give me
     </u>
     <u who="Paul">
       the hammer?
     </u>
</div>
And second, the L2 tree:
  <text>
       <s>Hey Paul! </s>
       <s>Would you give me
       the hammer? </s>
   </text>

Multi-colored trees

Multi-colored trees are collections of trees, each a more or less conventional XML tree when viewed individually, but stored in such a way as to share structure wherever possible [Jagadish et al. 2004]. A collection of information may thus be addressed in several different XML forms, without having to store the information multiple times (which wastes storage space as well as causing problems when updating the material).

A collection of information about movies, directors, actors, and so on, for example, might in conventional XML be organized either by actor (with <movie> elements appearing as children of <actor> elements), or by movie (with <actor> as child of <movie>), or in some other way. Almost any organizational principle may involve redundancy and may be inconvenient for certain kinds of searches, while making other searches very simple (to find the movies in which an actor has appeared, just find the appropriate <actor> element and then select all of the children named <movie>). The use of multi-colored trees allows the information to be stored and addressed not as a single XML document, but as several documents, each optimized for certain kinds of query and retrieval. In XML databases where traversal of the parent-child or next-sibling links is fast, but value-based lookups of other elements are slow), multi-colored trees can help provide faster retrieval.

Multi-colored trees resemble CONCUR in providing multiple tree structures over the same collection of data. They differ from CONCUR in that they do not require or guarantee that the different trees have the same sequence of characters at the leaves, nor that all character data appear in every tree, nor that shared structures occur in the same sequence in different trees. They do not forbid these things, either, so the example given above for concurrent markup could also be stored as a pair of multi-colored trees.

A more characteristic example, however, might be the movie database described in [Jagadish et al. 2004] (as shown here, the example has been augmented slightly in the interests of verisimilitude with information from the Internet Movie Database). The user may view the database as a set of three different XML documents, one showing actors and their roles:

<actors>
   <actor>
     <name>Bette Davis</name>
     <movie-role id="RB023">
       <!--* Payment on Demand (1951) *-->
       <name id="RB024">Joyce Ramsey (nee Jackson)</name>
     </movie-role>
     <movie-role id="RB117">
       <!--* All about Eve (1950) *-->
       <name id="RB125">Margo Channing</name>
     </movie-role>
     <movie-role id="RB927">
       <!--* Beyond the Forest (1949) *-->
       <name id="RB925">Rosa Moline</name>
     </movie-role>
     <!--* ... *-->
   </actor>
   <actor>
     <name>Audrey Hepburn</name>
     <!--* ... *-->
     <movie-role id="RB097">
       <!--* The Children's Hour, 1961 *-->
       <name id="RB475">Karen Wright</name>
     </movie-role>
     <movie-role id="RB157">
       <!--* Breakfast at Tiffany's, 1961 *-->
       <name id="RB175">Holly Golightly</name>
     </movie-role>
     <movie-role id="RB837">
       <!--* The Nun's Story, 1959 *-->
       <name id="RB975">Sister Luke</name>
     </movie-role>
     <!--* ... *-->
   </actor>
</actors>
A second document shows movies, grouped by genre, and with the names of the parts in the movie:
<movie-genre>
   <name>all</name>
   <movie-genre>
     <name>comedy</name>
     <movie id="RG012">
       <name id="RG015">All about Eve</name>
       <movie-role id="RB117">
         <!--* Bette Davis *-->
         <name id="RB125">Margo Channing</name>
       </movie-role>
       <movie-role>
         <!--* Anne Baxter *-->
         <name>Eve Harrington</name>
       </movie-role>
       <movie-role>
         <!--* George Sanders *-->
         <name>Addison DeWitt</name>
       </movie-role>
       <!--* ... *-->
     </movie>
     <movie-genre>
       <name>slapstick</name>
       <!--* ... *-->
     </movie-genre>
   </movie-genre>
   <!--* ... *-->
   <movie-genre>
     <name>drama</name>
     <movie>
       <name>Breakfast at Tiffany's</name>
       <movie-role id="RB157">
         <!--* Audrey Hepburn *-->
         <name id="RB175">Holly Golightly</name>
       </movie-role>
       <movie-role>
         <!--* George Peppard *-->
         <name>Paul 'Fred' Varjak</name>
       </movie-role>
       <movie-role>
         <!--* Patricia Neal *-->
         <name>2-E (Mrs. Fallenson)</name>
       </movie-role>
       <movie-role>
         <!--* Buddy Ebsen *-->
         <name>Doc Golightly</name>
       </movie-role>
       <!--* ... *-->
     </movie>
   </movie-genre>
</movie-genre>

A third document shows awards won by movies:

<movie-award>
   <name>Oscar</name>
   <movie-award>
     <name>1950s Oscars</name>
     <movie id="RG012">
       <name id="RG015">All about Eve</name>
       <votes>14</votes>
     </movie>
     <!--* ... *-->
   </movie-award>
</movie-award>

The nodes shown with id attributes are shared among the different trees: the movie-role element whose ID is “RB117”, for example, appears both in the actors tree and in the movie-genre tree, as does its name child.2 This redundancy makes queries against the XML easier to specify (exploiting the fact that XML query languages typically make specification of parent : child or ancestor : descendant relations especially convenient), and (at least in the test implementation) also faster (presumably because traversal of parent : child links is highly optimized).

When a node is shared between trees, it may have different children in the different trees; in the example, element RG012 illustrates this phenomenon. When several trees of different colors do share a great deal of structure, with elements in different trees having the same children in the same sequence, then (notionally, at least) multi-colored trees will store the information about the parent : child and previous : next sibling relationships several times.

Goddag structures

Goddag structures are directed acyclic graphs in which the directed arcs represent a parent/child relation. They were developed to aid in the understanding, markup, and processing of documents with overlapping structures; they are rooted especially in work on manuscript transcription, where important phenomena (deletions, marks in the margin, paragraphs) often overlap each other. As in the trees used to represent XML documents, the children of any node in a Goddag are ordered (or typically so; there are exceptions). Unlike XML trees, however, Goddag structures allow any node to have multiple parents.

Several forms of Goddag structure may be distinguished:

  • Normal Goddags are those in which the containment relation (calculated on leaf nodes) and the dominance relation (the transitive closure of the parent/child relation) are almost entirely the same: if the leaf nodes contained by a node N1 are a proper superset of the leaf nodes contained by another node N2, then N1 dominates N2 directly (as parent/child) or indirectly (as ancestor/descendant). In a normal Goddag structure, the children of a node are either totally ordered (this is the usual case) or not ordered at all (this is the case when the element corresponds to an element written <|e|| ... ||e|> in TexMecs notation).
    To any non-normal Goddag structure, there correspond one or more normal Goddag structures, which can be constructed in a relatively straightforward way. For any node x, let “leaves(x)” denote the set of leaves dominated by x. Now, for any pair of nodes (a, b),
    • If leaves(b) ⊂ leaves(a) and leaves(b) ≠ leaves(a), then ensure that a dominates b, directly or indirectly.
    • If leaves(a) ⊂ leaves(b) and leaves(b) ≠ leaves(a), then ensure that b dominates a, directly or indirectly.
    • If leaves(a) = leaves(b), then ensure that one of them dominates the other, directly or indirectly.
  • Colored Goddags are Goddags in which each parent/child arc is associated with one or more colors. Taking the subset of the graph including just the arcs labeled with a given color, with the nodes incident to them, results in a normal Goddag. Colored Goddags are similar, though not quite identical, to the multi-colored trees of [Jagadish et al. 2004].
  • Mecs graphs or restricted Goddags are normal Goddag structures in which nodes which are children of multiple parents are ordered in the same way by all parents; there is thus a global ordering of leaf nodes and any node N1 adjacent to another node N2 among the children of any parent is adjacent to N2 according to any parent which has both N1 and N2 as children.

For example, consider the following TexMecs document, which has two roots, one for the scene (the dramatic view) and one for the embedded haiku.

<* The dramatic view *>
<scene|
<sp who="HUGHIE"|<p|How did that translation go?|p>
<lg type="haiku"|<l|da de dum de dum,|l>
<l@frog|gets a new frog,|l>
<l|...|l>|lg>
|sp>
<sp who="LOUIS"|<p|Er ...|p>
<lg|<l@new|it's a new pond.|l>|lg>
|sp>
<sp who="DEWEY"|
<p|Ah ...|p>
<lg|<l@pond|When the old pond|l>|lg>
<p|Right.  That's it.|p>
|sp>
|scene>

<* The embedded poem *>
<haiku|
<lg|<^l^pond><^l^frog><^l^new>|lg>
|haiku>

For further details, the reader is directed to [Sperberg-McQueen / Huitfeldt 2000].

Relations

This section describes simple relational representations of XML, concurrent XML, multi-colored trees, and Goddag structures.

The representations have two goals: primarily, they provide a clear and explicit definition of the data structures which exhibits clearly ways in which they are the same and ways in which they differ. As a secondary matter it provides an obvious mapping to widely available technology which can be used to support search and retrieval operations on overlapping structures. The requirements to be met by the representations defined here include:

  • They must capture all the information in the data structure and allow the structure to be recreated.
  • It should be possible to represent various important and useful operations on the data structures as operations upon the relations described here.
  • They need not be easy or fast to update.

Notation

Each relation is specified in the form

relation-name ( column-list )
where relation-name is the name of the relation and column-list is a comma-separated list of column specifiers. A column is specified by giving the name of the column, optionally preceded by an indication of its type. Parts of the relation's key are given in italics. The type names used are described below in section 3-2.

For example

nodes(node_id nID,
node-type type)
denotes the relation nodes, whose tuples have two parts: one named nID, of type node_id, which is the key, and the other named type, of type node-type.

Some conventional column names are used repeatedly with the same meaning:

  • nID (of type node_id) is a node identifier and serves as the key, or part of the key, of the relation
  • nPar, nReuben, nBenjamin, nL, nR are node IDs (values of type node_id) which are the parent, first child, last child, left sibling, and right sibling, respectively, of the current node (the one whose ID is the key of the relation).

For most column names, a simple form of ‘Hungarian notation’ is used: some names have a short prefix indicating the domain from which the values are drawn:

  • n = node_id
  • ln = list of node_id
  • nm = xsd:NCName
  • ns = xsd:anyURI (‘namespace name’)
  • s = xsd:string

When the type of a column is clearly indicated by its name, the explicit type indicator is often omitted. For example, the following two expressions specify the same relation:

child_parent(node_id
nCh, node_id nPar)
child_parent(nCh, nPar)

Domains

The values in the relations sketched here are drawn from the following domains:

  • the primitive simple types of XML Schema 1.0 (for the moment, only a few of these are actually used)
  • ns: a URI used as a namespace name (restriction of anyURI to require scheme name; intent is to require full URIs)
  • node_id: a unique value used to identify a node within a document. Not further analyzable; the only operation allowed is testing for equality. The initial implementation uses positive integers for this purpose.
  • node_list: a list of node_ids.
  • node_type: one of document, element, attribute, pcdata, comment, or pi (‘processing instruction’).

Individual nodes

The data structures described here have in common that they represent the document as a graph, i.e. as a set of nodes and a set of arcs connecting those nodes (with nodes and arcs possibly ordered in different ways). We postulate several kinds of node: documents, elements, attributes, comments, pcdata (‘text nodes’), and processing instructions.

The representation of these basic nodes is the same for XML, concurrent XML, colored XML, and Goddag structures. It uses the following relations:

  • node(nID): identifies nID as a node in the document. (Strictly speaking, redundant with the following relations.)
  • node_type(nID, node_type type): indicates that nID is a node of the given type. Redundant with the relations document, element, attribute, comment, pcdata, and pi.
  • document(nID): indicates that nID is a document node.
  • element(nID, ns, nmLocal): indicates that nID is an element, whose element type name is the expanded name (ns, nmLocal).
  • attribute(nID, nElement, ns, nmAtt, sValue): indicates that nID is an attribute, on element nElement, with name nmAtt in namespace ns, and with lexical form sValue. The normal rules for attributes have as a consequence that the columns nelement, ns, nmAtt form a second candidate key.
  • comment(nID, sData): indicates that nID is a comment whose contents are sData.
  • pcdata(nID, sData): indicates that nID is a text node whose value is sData.
  • pi(nID, nmTarget, sData): indicates that nID is a processing instruction with target notation is nmTarget and whose value is sData.

For example, the nodes in the example of section 2-1 could be represented thus (trimming whitespace, for convenience, using “-” to indicate null values, and omitting the uninformative node relation for brevity):

node_type(L1,document).
node_type(L2,document).
node_type(n1,element).
node_type(n2,attribute).
node_type(n3,attribute).
node_type(n4,element).
node_type(n5,element).
node_type(n6,attribute).
node_type(n7,element).
node_type(n8,pcdata).
node_type(n9,element).
node_type(n10,pcdata).
node_type(n11,element).
node_type(n12,attribute).
node_type(n13,pcdata).

element(n1,-,div).
element(n4,-,text).
element(n5,-,u).
element(n7,-,s).
element(n9,-,s).
element(n11,-,u).

attribute(n2,n1,-,type,"dialog").
attribute(n3,n1,-,org,"uniform").
attribute(n6,n5,-,who,"Peter").
attribute(n12,n11,-,who,"Paul").

pcdata(n10, " the hammer? ").
pcdata(n10, "Would you give me ").
pcdata(n6,"Hey Paul! ").
There are two document nodes (L1 and L2) because the example has two document structures.

XML

The different data structures described here differ in the graph structures in which they embed the nodes of the document. This and the following sections describe the representation of those structures.

XML defines a single tree over the document. Conventionally, n-ary trees like those of XML are often described as trees in which each node as a variable-length sequence of children. A straightforward translation of this idea into our notation would give us

  • parent_children(nID, node_list lnChildren): identifies nID as a node whose children are the nodes in lnChildren.

For our purposes here, however, it is more convenient to factor the information into two relations, each simpler:

  • node_firstchild(nID, nReuben): identifies nID as a node whose first child is nReuben.
  • node_next(nID, nR): identifies nID as a node whose next sibling is nR.
Both of these relations are injections: their inverse is also a function.

Cycles involving any mixture of these two relations are forbidden. And there must be exactly one document node (per XML document).

Various refinements and controlled redundancies are possible, which may make different operations or traversals of the tree more convenient. For example

  • node_relations(nID, nPar, nL, nR, nReuben, nBenjamin): identifies nID as a node with the specified parent, left sibling, right sibling, first child, and last child.

For example, given the node identities specified above for the examples in section 2-1, the L1 tree could be described thus using node_firstchild and node_next:

node_firstchild(L1,n1).
node_firstchild(n1,n5).
node_firstchild(n5,n8).
node_firstchild(n8,-).
node_firstchild(n10,-).
node_firstchild(n11,n13).
node_firstchild(n13,-).

node_next(L1,-).
node_next(n1,-).
node_next(n10,-).
node_next(n11,-).
node_next(n13,-).
node_next(n5,n11).
node_next(n8,n10).
Or thus using node_relations:
node_relations(L1,-,-,-,n1,-).
node_relations(n1,L1,-,-,n5,n11).
node_relations(n5,n1,-,n11,n8,n10).
node_relations(n8,n5,-,n10,-,-)
node_relations(n10,n5,n8,-,-,-)
node_relations(n11,n1,n5,-,n13,n13).
node_relations(n13,n11,-,-,-,-).

Concurrent documents

Concurrent XML constructs several trees over the same character data. Each of those trees may be described in the same way as any XML tree; for simplicity and regularity we assume here that all pcdata nodes in the document are shared among all trees, and that no other nodes are shared.3 The first change needed is to change some of the constraints on the relations by specifying that:

  • There may be more than one document node. The nodes reachable from a given document node by traversing the node_firstchild and node_next relations are said to appear in the given document.
  • Every pcdata node appears in every document.
  • Each other node appears in exactly one document.
From these rules, it follows that in a set of n concurrent documents over the same data, each pcdata node has n parent elements; each other non-document node has, as usual, exactly one parent.

The second necessary change is to specify, in the node_next relation, an additional column to indicate which tree the relation applies to. For convenience, we use the identifier used in the start- and end-tags of elements in that example. (In SGML, this is the identifier of the root element of that tree; in XCONCUR it is the arbitrary identifier used for the document node.):

  • node_next(nID, nmTree, nR): identifies nID as a node whose next sibling, in tree nmTree, is nR.
The additional column is redundant in rows representing non-PCDATA nodes. And even for PCDATA nodes, two entries for the same node in different trees may differ only in one having a value for the nR column and one not having it; if both have a value, the value must be the same, since the frontier of the tree is the same for all trees.

No similar change is needed for the node_first relation, since the only nodes which appear in more than one tree are PCDATA nodes, which have no children and thus no first children, in any tree.

The concurrent trees for the example in section 2-1 are represented thus:

/* document L1 */
node_firstchild(L1,n1).
node_firstchild(n1,n5).
node_firstchild(n5,n8).
node_firstchild(n8,-).
node_firstchild(n10,-).
node_firstchild(n11,n13).
node_firstchild(n13,-).

node_next(L1,L1,-).
node_next(n1,L1,-).
node_next(n5,L1,n11).
node_next(n8,L1,n10).
node_next(n10,L1,-).
node_next(n11,L1,-).
node_next(n13,L1,-).

/* document L2 */
node_firstchild(L2,n4).
node_firstchild(n4,n7).
node_firstchild(n7,n8).
node_firstchild(n9,n10).

node_next(L2,L2,-).
node_next(n4,L2,-).
node_next(n7,L2,n9).
node_next(n8,L2,-).
node_next(n9,L2,-).
node_next(n10,L2,n13).
node_next(n13,L2,-).

Colored XML

Like CONCUR, colored XML defines multiple trees over the document, distinguished by color so that a collection of data may have (for example) a red, a green, and a blue tree sharing nodes. The n-ary representation of multi-colored trees is almost the same as the n-ary representation of XML trees, but adds a color column as part of the key:

  • parent_children(nID, color, sequence of nChild): identifies nID as a node whose children, in the tree colored color, are the nChild nodes in the sequence.

The simpler relations are similarly augmented:

  • node_firstchild(nID, color nReuben): identifies nID as a node whose first child, in the tree colored color, is nReuben.
  • node_next(nID, color, nR): identifies nID as a node whose next sibling in the tree colored color is nR.
In both of these relations, no two tuples of the same color share any final column value; the pair (color, nReuben) is thus a candidate key for the node_firstchild relation, and the pair (color, nR) for the node_next relation.

The more general description of a node and all of its neighbors is again similar to that for XML trees, with the addition of a color column:

  • node_relations(nID, color, nPar, nL, nR, nReuben, nBenjamin): identifies nID as a node in the tree colored color with the specified parent, left sibling, right sibling, first child, and last child in that tree.

It follows from the rules of construction that in a set of n colored trees, any node may have up to n parents, no two of which share a color.

The relational representation of the example of section 2-1 as a set of multi-colored trees would be very similar to the one given above for concurrent trees, if we substitute “red” for “L1” and “blue” for “L2” in the node_next relation, and add a column with the appropriate value to the rows of the node_first relation. The individual node relations remain unchanged.

A more interesting example, however, would be the actors / movies / awards example of section 2-2. It is left as an exercise for the reader.

Goddag structures

Like multi-colored trees, Goddag structures allow nodes to have multiple parents. But there is no requirement analogous to the rule that the parents of a node must have different colors. The n-ary representation of Goddag structures is thus the same as the n-ary representation of XML trees:

  • parent_children(nID, sequence of nChild): identifies nID as a node whose children are the nChild nodes in the sequence.
The only difference from XML is that there is no rule saying that a node may appear among the children of at most one other node.

Restricted Goddags have the same parent_children relation, with the constraint that parents who share children are required to have them in the same sequence.4

In colored Goddags, as might be expected from the name, the relation is augmented with a color column:

  • parent_children(nID, color, sequence of nChild): identifies nID as a node whose children, in the Goddag colored color, are the nChild nodes in the sequence.

The node_firstchild relation is, for normal and restricted Goddags, the same as in XML.

  • node_firstchild(nID, nReuben): identifies nID as a node whose first child is nReuben.
For colored Goddags, it gets an additional color column.
  • node_firstchild(nID, color nReuben): identifies nID as a node whose first child, in the graph colored color, is nReuben.

The node_next relation for restricted Goddags is the same as for XML: for any node, there is at most one right sibling:

  • node_next(nID, nR): identifies nID as a node whose next sibling is nR.
In normal Goddags, though, different parents can assign different sequences to their shared children. So the node_next relation for any node varies with which parent node we take into account:
  • node_next(nID, nPar, nR): identifies nID as a node whose next sibling among the children of nPar is nR.
Finally, in colored Goddags, the successor to any node depends both on the parent in question and on the color of the graph:
  • node_next(nID, nPar, color, nR): identifies nID as a node whose next sibling among the children of nPar is nR in the graph colored color.

The more general node_relations relation described above for the other data structures has no single analog for Goddag structures. We could specify

  • node_relations(nID, nPar, [color,] nL, nR, nReuben, nBenjamin): identifies nID as a node [in the graph colored color] which has parent nPar, first child nReuben, last child nBenjamin, and whose left and right siblings, among the children of nPar [in the graph of that color] are nL and nR.
But this would be problematic, since the values of nL and nR depend upon the entire key, but the values of nReuben and nBenjamin depend only on part of it (on nID). It would be less confusing to factor the relation appropriately:
  • node_reuben_benjamin(nID, [color,] nReuben, nBenjamin): identifies nID as a node which has first child nReuben and last child nBenjamin [in the graph of the indicated color].
  • node_parent_siblings(nID, nPar, [color,] nL, nR): identifies nID as a node [in the graph colored color] which has parent nPar and whose left and right siblings, among the children of that parent [in the graph of that color] are nL and nR.

It is useful, although not strictly speaking essential, to record the identifiers used to identify certain nodes:

  • id_node(nm id, nID): indicates that the TexMecs notation of the Goddag structure specified id as a unique identifier for node nID.

The nodes of the sample Goddag of section 2-3 could be represented thus, using the relations for normal Goddags. I omit the node_type relation as redundant:

document(DD).
document(DV).

element(n01,-,"scene").
element(n02,-,"sp").
element(n04,-,"p").
element(n06,-,"lg").
element(n08,-,"l").
element(n10,-,"l").
element(n12,-,"l").
element(n14,-,"sp").
element(n16,-,"p").
element(n18,-,"lg").
element(n19,-,"l").
element(n21,-,"sp").
element(n23,-,"p").
element(n25,-,"lg").
element(n26,-,"l").
element(n28,-,"p").
element(n99,-,"haiku").
element(n98,-,"lg").
element(n97,-,"l").
element(n96,-,"l").
element(n95,-,"l").

attribute(n03,n02,-,who,"HUGHIE").
attribute(n07,n06,-,type,"haiku").
attribute(n15,n14,-,who,"HUGHIE").
attribute(n22,n21,-,who,"HUGHIE").

pcdata(n05,"How did that translation go?").
pcdata(n09,"da de dum de dum,").
pcdata(n11,"gets a new frog,").
pcdata(n13,"...").
pcdata(n17,"Er ...").
pcdata(n20,"it's a new pond.").
pcdata(n24,"Ah ...").
pcdata(n27,"When the old pond").
pcdata(n29,"Right.  That's it.").

id_node("frog",n10).
id_node("new", n08).
id_node("pond",n26).

The graph structure of the example can be represented thus. For brevity I omit rows for nodes which lack children, or which lack following siblings (except for the PCDATA nodes which have multiple parents):

node_firstchild(DD,n01).
node_firstchild(n01,n02).
node_firstchild(n02,n04).
node_firstchild(n04,n05).
node_firstchild(n06,n08).
node_firstchild(n08,n09).
node_firstchild(n10,n11).
node_firstchild(n12,n13).
node_firstchild(n14,n16).
node_firstchild(n16,n17).
node_firstchild(n18,n19).
node_firstchild(n19,n20).
node_firstchild(n21,n23).
node_firstchild(n23,n24).
node_firstchild(n25,n26).
node_firstchild(n26,n27).
node_firstchild(n28,n29).
node_firstchild(n99,n98).
node_firstchild(n98,n97).
node_firstchild(n97,n27).
node_firstchild(n96,n11).
node_firstchild(n95,n20).

node_next(n02,n01,n14).
node_next(n04,n02,n06).
node_next(n08,n06,n10).
node_next(n10,n06,n12).
node_next(n11,n10,-).
node_next(n11,n96,-).
node_next(n14,n01,n21).
node_next(n16,n14,n18).
node_next(n20,n19,-).
node_next(n20,n95,-).
node_next(n23,n21,n25).
node_next(n25,n21,n28).
node_next(n27,n26,-).
node_next(n27,n97,-).
node_next(n97,n98,n96).
node_next(n96,n98,n95).

Other approaches to overlap

It may be worth comparing the representations given here to some other approaches to overlap.

Tabling Overlap

First, there is the rather different relational representation of overlapping structures proposed by [Durusau / O'Donnell 2004]. Durusau and O'Donnell postulate no graph structure, but they do define a relation for representing documents:

  • segments(positiveInteger ID, ns NameSpace, nm GI, nm sID, nm eID, positiveInteger CL, string PCDATA): describes one tag in the input document, with the following text node if any, with sequence number ID, expanded name (NameSpace, GI), start- and end-ID values (for Trojan Horse markup [DeRose 2004]) sID and eID, and following text node PCDATA. All tags which share the same CL value occur between the same two text nodes, in the order specified by the sequence number.
The rows in this relation can be thought of as representing parser events: each row corresponds to one tag followed optionally by one text node.

The Trojan Horse markup used by Durusau and O'Donnell specifies start- and end-points for elements. The elements thus identified may be taken as nodes in a graph. But Durusau and O'Donnell explicitly refrain from specifying rules for determining the arcs of the graph. “The table here simply records the information required to resolve any ambiguity and position within a hierarchy, deferring the resolution of any ambiguity to [...] later processing.”

The markup postulated by Durusau and O'Donnell corresponds fairly directly to that of MECS; it follows that documents represented in the segments relation can be represented as restricted Goddag structures, and that any restricted Goddag structure can be represented using the segments relation.

The segments relation can also clearly be used to represent concurrent trees over the same frontier, so it can also be used to generate XConcur documents, and vice versa. For any one sequence of segments, there may be many corresponding sets of concurrent trees, but some representation as concurrent trees is guaranteed to be possible [Sperberg-McQueen / Huitfeldt 1999].

LMNL

The Layered Markup and Annotation Language (LMNL) proposed by Jeni Tennison and Wendell Piez and their collaborators specifies a data structure for overlap, but not one whose relationship to the graph structures described elsewhere in this paper is obvious [Tennison / Piez 2002], [Anonymous 2004], [Anonymous 2006], [Cowan 2006].

A naive translation of the data model described by [Anonymous 2006], removing some of the more inconvenient redundancies, might use the following domains:

  • id: unique identifier for a limen or document.
  • range_set: sequence of ranges.
On these domains (plus some of those already used above), LMNL defines the following relations. First, there are two kinds of layers:
  • document(id id, ns base_uri, limen_id value): identifiers the base_uri + value pair as a LMNL document and assigns it the identifier id. The base_uri + value pair is a candidate key.
  • limen(id id, id owner, string content, range_set ranges): identifiers id as a limen whose content is the sequence of atoms given in content (for simplicity, we use the type string instead of attempting to model the infinite regress of the LMNL data model, which defines atoms in terms of other atoms).
  • range(id id, id owner, ns namespace, nm name, nonNegativeInteger start, nonNegativeInteger end): identifiers id as a range over the atoms in the content of the owner limen, beginning at start and ending at end (with endstart).
The annotations property of ranges is omitted here for simplicity.

Using these relations, a naive representation of the example in section 2-1 might be as follows (attributes are omitted for simplicity).

document(D0,'http://www.example.com/lmnl/hammer',L0).

limen(L0,D0,
   "Hey Paul! Would you give me the hammer?",
   {d1,s1,s2,t1,u1,u2},).

range(u1,L0,-,u,0,28).
range(u2,L0,-,u,28,39).
range(d1,L0,-,div,0,39).
range(s1,L0,-,s,0,10).
range(s2,L0,-,s,10,39).
range(t1,L0,-,text,0,30).

The LMNL data model does not explicitly define documents as graphs; there may be some way of viewing instances of the LMNL data model usefully as graphs; such a view might help elucidate the similarities and differences between LMNL and the graph-based data structures described elsewhere in this paper. It seems plausible that a graph may be defined by taking atoms and ranges as vertices, and some relations among them as edges. But it is not clear what relations among ranges, if any, may best be taken as corresponding to the parent : child and previous : next relations in the other data structures described here; that remains work for the future.

Conclusion and further work

A surprising variety of variations on XML can be described by simple modifications to the sequence relation on nodes of the document.

If we allow multiple document nodes, and make it possible for PCDATA nodes to have following siblings in some but not all documents, then we have concurrent XML.

If instead we add a color value as part of the key for the node_firstchild and node_next relations, then the result is multi-colored trees.

If instead of color we add to the node_next relation a new nPar (parent node) column, again as part of the key, then the result is normal Goddag structures.

Further consideration of the expressive power afforded (and not afforded) by these variations may shed light on the nature of the overlap problem, and may help analyse the problem further. Many discussions of overlapping structures have assumed that overlap is the name of a single problem (or at least have passed in silence over the possible differences among examples of overlap); but if each of the data structures described here successfully handles a different set of problems (not necessarily disjoint), with different degrees of naturalness or redundancy, then it seems natural to suppose that there are different kinds of overlapping structures, and that different mechanisms may be appropriate to different problems.

While that outcome may deal a blow to any dream of a unified solution to ‘The Overlap Problem’, it may also mark a stage in the maturation of our understanding of the problem area.

Notes

1.

The identifiers L1 and L2 have been uppercased to avoid confusion with the numbers eleven and twelve.

2.

The reader may find it strange that the movie-role element seems to indicate neither the movie the role is in, nor the actor who played the role.

It is quite true that an XML structure intended as a stand-alone representation either of movies or of actors and their careers would represent this information differently, using perhaps a structure of the form

<actor>
   <name>Bette David</name>
   <role>
     <movie>All about Eve</movie>
     <character>Margo Channing</character>
   </role>
   <!--* ... *-->
</actor>
But Jagadish and his collaborators are not designing such an XML representation; they are designing a database from which such a representation might be generated.

In the actor context, the name of the actor is already known, given by the movie-role element's name sibling; it may be found the traversing the XPath “parent::actor/child::name” (or more simply “../name”). Similarly, in the movie-genre context, the name of the movie is also given by a name sibling and is the value of the expression “../name”. In each case, the relevant information is stored in a parent-child (actor/movie-role or movie/movie-role) link and can be retrieved by traversing the XML tree in a predicable way. In the example database, in which the actor tree is colored red and the movie tree blue, the two pieces of information can be retrieved from any movie-role element by traversing the two extended XPath expressions “{red}parent::actor/{red}child::name” and “{blue}parent::movie/{blue}child::name”.

3.

Sharing structures among trees may be possible, but the effect on the relational representation may vary depending on what constraints are assumed to hold among the trees. If two trees share a div element, for example, are they required, or not required, to share the same children for the div, in the same order? To use traditional SGML terminology: do they share the element? or just the tags? Concurrent trees which share elements will tend to resemble the Goddag structures described below; concurrent trees which share tags but not necessarily elements will tend to resemble multi-colored trees.

4.

This constraint is hard to check in relations of this form; that is one disadvantage of this formulation.


Bibliography

[Anonymous 2004] Anonymous [Jeni Tennison? and Wendell Piez?]. 2002 (rev. 2004?). “LMNL Data Model”. Available on the Web at http://www.lmnl.net/prose/data-model/data-model-spec.html.

[Anonymous 2006] Anonymous [Jeni Tennison? and Wendell Piez?]. 2006. “LMNL Data Model”. Available on the Web at http://www.lmnl.org/wiki/index.php/LMNL_data_model.

[Cowan 2006] Cowan, John. 2006. “LOM”. Available on the Web at http://www.lmnl.org/wiki/index.php/LOM.

[Dekhtyar / Iacob 2005] Dekhtyar, Alex, and Ionut Emil Iacob. 2005. “A Framework For Management of Concurrent XML Markup”. Data and Knowledge Engineering 52.2: 185-215.

[DeRose 2004] DeRose, Steven. 2004. “Markup overlap: A review and a Horse”. Paper given at Extreme Markup Languages 2004, Montréal, sponsored by IDEAlliance. Available on the Web at http://www.mulberrytech.com/Extreme/Proceedings/html/2004/DeRose01/EML2004DeRose01.html.

[Durusau / O'Donnell 2002a] Durusau, Patrick, and Matthew Brook O'Donnell. 2002. “Concurrent markup for XML documents”. Paper given at XML Europe 2002, Barcelona, sponsored by IDEAlliance. Available on the Web at http://www.idealliance.org/papers/xmle02/dx_xmle02/papers/03-03-07/03-03-07.html.

[Durusau / O'Donnell 2002b] Durusau, Patrick, and Matthew Brook O'Donnell. 2002. “Coming down from the trees: Next step in the evolution of markup?” Late-breaking paper given at Extreme Markup Languages 2002, Montréal, sponsored by IDEAlliance. Slides (but no full text) available on the Web at http://www.mulberrytech.com/Extreme/Proceedings/author-pkg/2002/Durusau01/EML2002Durusau01.zip.

[Durusau / O'Donnell 2004] Durusau, Patrick, and Matthew Brook O'Donnell. 2004. “Tabling the overlap discussion”. Paper given at Extreme Markup Languages 2004, Montréal, sponsored by IDEAlliance. Available on the Web at http://www.mulberrytech.com/Extreme/Proceedings/html/2004/Durusau01/EML2004Durusau01.html.

[Eckart / Teich 2007] Eckart, Richard, and Elke Teich. 2007. “An XML-based data model for flexible representation and query of linguistically interpreted corpora”. In Datenstrukturen für linguistische Ressourcen und ihre Anwendungen / Data structures for linguistic resources and applications: Proceedings of the Biennial GLDV Conference 2007, ed. Georg Rehm, Andreas Witt, Lothar Lemnitzer. Tübingen: Gunter Narr Verlag. Pp. 327-336.

[Hilbert et al. 2005] Hilbert, Mirco, Oliver Schonefeld, and Andreas Witt. 2005. “Making CONCUR work”. Paper given at Extreme Markup Languages 2005, Montréal, sponsored by IDEAlliance. Available on the Web at http://www.mulberrytech.com/Extreme/Proceedings/html/2005/Witt01/EML2005Witt01.xml.

[Huitfeldt / Sperberg-McQueen 2001] Huitfeldt, Claus, and C. M. Sperberg-McQueen. 2001. “TexMECS: An experimental markup meta-language for complex documents”. 2001, rev. 2003. http://decentius.aksis.uib.no/mlcd/2003/Papers/texmecs.html

[Huitfeldt 1999] Huitfeldt, Claus. 1999. MECS—A multi-element code system. Working papers of the Wittgenstein Archives at the University of Bergen.

[Jagadish et al. 2004] Jagadish, H. V., Laks V. S. Lakshmanan, Monica Scannapieco, Divesh Srivastava, and Nuwee Wiwatwattana. 2004. “Colorful XML: One hierarchy isn't enough”. Proceedings of the 2004 ACM SIGMOD International conference on management of data, Paris, sponsored by the Association for Computing Machinery Special Interest Group on Management of Data. New York: ACM Press.

[Schonefeld / Witt 2006] Schonefeld, Oliver, and Andreas Witt. 2006. “Towards validation of concurrent markup”. Extreme Markup Languages 2006.

[Schonefeld 2007] Schonefeld, Oliver. 2007. “XCONCUR and XCONCUR-CL: A constraint-based approach for the validation of concurrent markup”. In Datenstrukturen für linguistische Ressourcen und ihre Anwendungen / Data structures for linguistic resources and applications: Proceedings of the Biennial GLDV Conference 2007, ed. Georg Rehm, Andreas Witt, Lothar Lemnitzer. Tübingen: Gunter Narr Verlag. Pp. 347-356.

[Sperberg-McQueen / Huitfeldt 1999] Sperberg-McQueen, C. M., and Claus Huitfeldt. 1999. “Concurrent document hierarchies in MECS and SGML”. Literary & Linguistic Computing 14.1: 29-42.

[Sperberg-McQueen / Huitfeldt 2000] Sperberg-McQueen, C. M., and Claus Huitfeldt. 2000. “GODDAG: A Data Structure for Overlapping Hierarchies”. In DDEP-PODDP 2000, ed. P. King and E.V. Munson, Lecture Notes in Computer Science 2023 (Berlin: Springer, 2004), pp. 139-160. Available on the Web at http://www.springerlink.com/index/98J1VBU5NBY73UL3 and http://www.w3.org/People/cmsmcq/2000/poddp2000.html.

[Tennison / Piez 2002] Tennison, Jeni, and Wendell Piez. 2002. “LMNL Syntax”. Available on the Web at http://www.lmnl.net/prose/syntax/index.html.

[Witt 2004] Witt, Andreas. 2004. “Multiple hierarchies: new aspects of an old solution”. Paper given at Extreme Markup Languages 2004, Montréal, sponsored by IDEAlliance. Available on the Web at http://www.mulberrytech.com/Extreme/Proceedings/html/2004/Witt01/EML2004Witt01.html



Representation of overlapping structures

C. M. Sperberg-McQueen [World Wide Web Consortium, MIT Computer Science and Artificial Intelligence Laboratory]