Tabling the Overlap Discussion

Patrick Durusau
Matthew Brook O'Donnell

Abstract

The problem of overlapping markup has often been addressed using variant syntaxes, milestones, standoff markup and other means. This proposal combines the use of co-indexed milestones with a table structure in a database, plus CONCUR namespaces and "common location" (markup tokens that have no intervening PCDATA) information to illustrate one way to "parse" XML with more information than is available using standard markup. The result is a data structure that can be explored to determine the best data structure for overlapping hierarchies.

Keywords: Concurrent Markup/Overlap; Parsing

Patrick Durusau

Patrick Durusau is the Director of Research and Development at the Society of Biblical Literature. He is currently the Chair of the US National Technical Advisory Group (V1) to ISO SC34; technical lead for the Open Scriptural Information System (OSIS) project, an effort supported by the SBL and the American Bible Society, the United Bible Societies, SIL and others; co-editor of the Topic Maps Reference Model; Chair of the Topic Maps Published Subjects TC at OASIS, as well as chairing the Topic Maps GeoLang TC and the XML Voc TC; and is a member of the Text Encoding Initiative Board of Directors. His primary research interests are topic maps, overlapping markup, Unicode and markup systems for the encoding, display and analysis of Ancient Near Eastern texts in their original languages.

Matthew Brook O'Donnell

Matthew Brook O'Donnell is Director of Research and Development for OpenText.org, an initiative to develop XML-based tools and resources for linguistic analysis. His research interests include corpus linguistics, text encoding and the linguistic analysis of ancient Greek.

Tabling the Overlap Discussion

Patrick Durusau [Society of Biblical Literature]
Matthew Brook O'Donnell [OpenText.org]

Extreme Markup Languages 2004® (Montréal, Québec)

Copyright © 2004 Patrick Durusau and Matthew Brook O'Donnell. Reproduced with permission.

Introduction

Despite years of discussion and at best partially successful proposals, interest in the problem of overlapping markup persists. Interest in the overlapping hierarchy problem has persisted because a generalized solution is more than simply a convenience for logical or style oriented presentations. Solving the problem of overlapping markup will mean at a minimum, that true versioning of markup documents can be implemented in scientific-medical-technical as well as legal publishing. For the more academically minded, the ability to record complex and overlapping markup, sometimes at variance with each other by different scholars, holds the promise to make scholarship cumulative and not episodic. And at its core, hence the continued attraction of the problem, markup will be able to represent texts as we read them, not in a dumbed down version suitable for markup parsers.

The problem of overlapping markup is complex and multi-layered. Information entropy forms the first part of the problem that is addressed herein. Briefly stated, it is the problem of the parser not having sufficient information to parse a given syntax. (This is in addition to the known problem of overlap not presenting a tree to the parser.) Since markup is an artificial language, a syntax based markup is used that allows sets of markup to be distinguished (CONCUR's use of namespaces) and distinguish nested versus overlapping markup within a single set of markup for the investigative system proposed below.

Another aspect of the problem is that there is no commonly available system that allows experimentation with the impact of varying data structures for overlapping markup. A solution is proposed that stores both markup and PCDATA in a table structure that enables experimental investigation of strategies for processing overlapping markup.

This experimental system has led the authors to conclude that a true solution to the overlapping markup problem will need to be addressed both at the level of syntax along with development of non-tree data structures for the handling of such markup. The proposed system meets both of those requirements but should be considered as a means of investigation and not a production system to solve this problem.

NOTE:

Part of the dilemma in processing markup has been an almost cult-like status granted to tree based processing, reminiscent of the Druids in England prior to the arrival of the Roman Empire. Granted there have been a few hardy souls, DeRose, Huitfeltd, McQueen, Nichols, Tennison and Piez, among others, who have questioned the current orthodoxy, daring to ask for processing that meets their needs. That the problem persists is a reflection of the inertia of a system with a "good enough" result and not a commentary on the skill or cleverness of prior proposals.

There is no doubt that tree based processing of markup has been extremely useful in a variety of contexts. The question posed herein is whether that same utility can be matched by other data structures for processing markup.

Investigative Syntax

One of the problems of overlapping markup is distinguishing to what set a particular markup token belongs was solved by ISO 8879 by the use of what look like XML Namespaces. In ISO 8879, CONCUR prefixes (a forerunner to namespaces) were used to distinguish entire trees of markup. The use proposed herein, is more general in that no tree is required and markup may overlap even within a single set of markup tokens. The prefix for every markup token is either "namespace" or left empty, in which case the token belongs to all of the sets of markup recorded.

The other problem presented when one moves away from trees is how to treat the following case:

<div>

blah, blah

<div>

blah, blah

</div>

blah, blah

</div>

Does the first <div> element match the first or second </div>? Based solely on the information presented, that is, without the presumption of well-formedness supplied by XML, it is not possible to answer that question. That is to say that the markup tokens as represented above, lack the information required to provide an answer. The XML standards "supplies" the required information by simply declaring that the second <div> element must close the first </div> element that it meets. No exceptions.

Well-formedness works well enough for simplistic documents or those that can be encoded as though they were simplistic but it is a rather draconian solution. It leads to texts that are far poorer than our common experience with them would indicate.

The problem of a lack of information for recusive structures was noted as a shortcoming in our original JITTs (Just-In-Time-Trees) proposal by Marc De Scheemaecker (author of NanoXML, http://nanoxml.sourceforge.net/orig/). For purposes of this investigation, all elements in a text are encoded using milestone elements that have the namespace plus GI and either a sID or eID value that indicates its status as a start or end element along with its "start" or "end" element partner. Looking at the example recursive plus overlapping markup, we can distinguish:

<div sID="1">

blah, blah

<div sID="2">

blah, blah

</div eID="2">

blah, blah

</div eID="1">
from:
<div sID="1">

blah, blah

<div sID="2">

blah, blah

</div eID="1">

blah, blah

</div eID="2">
easily. In full, the syntax that is used herein appears as follows:
<ROOT
  xmlns:ln="www.jitts.org/ns/line"
  xmlns:sg="www.jitts.org/ns/segment"
  xmlns:div="www.jitts.org/ns/div"
>
  <sg:text sID="t1"/>
  <div:div sID="d1" id="d12"/>
  <ln:page sID="p1" n="135"/>
  <ln:line sID="l1" n="Bk.1.1"/>
  <sg:s sID="s1"/>
  <sg:seg sID="sg1"/>Of Man's first disobedience,<sg:seg eID="sg1"/>
  <sg:seg sID="sg2"/> and the fruit<ln:line eID="l1"/>
  <ln:line sID="l2" n="Bk.1.2"/>Of that forbidden tree whose mortal taste 
  <ln:line eID="l2"/>
  <ln:line sID="l3" n="Bk.1.3"/>Brought death into the World,<sg:seg eID="sg2"/>
  <sg:seg sID="sg3"/> and all our woe,<sg:seg eID="sg3"/>
  <sg:seg sID="sg4"/>
  <ln:line eID="l3"/>
  <ln:line sID="l4" n="Bk.1.4"/>With loss of Eden,<sg:seg eID="sg4"/>
  <sg:seg sID="sg5"/> till one greater Man <ln:line eID="l4"/>
  <ln:line sID="l5" n="Bk.1.5"/>Restore us,<sg:seg eID="sg5"/>
  <sg:seg sID="sg6"/> and regain the blissful seat,<sg:seg eID="sg6"/>
  <sg:seg sID="sg7"/>
  <ln:line eID="l5"/>
  <ln:line sID="l6" n="Bk.1.6"/>Sing,<sg:seg eID="sg7"/>
  <sg:seg sID="sg8"/> Heavenly Muse,<sg:seg eID="sg8"/>
  <sg:seg sID="sg9"/> that,<sg:seg eID="sg9"/>
  <sg:seg sID="sg10"/> on the secret top <ln:line eID="l6"/>
</ROOT>
This example is drawn from Book 1 of Milton's Paradise Lost, representing three separate hierarchies.

Investigative Structure

The syntax outlined above is tokenized to place it into the following relational database structure:

ID     NameSpace     GI     sID     eID     CL    PCDATA

The columns for each row are defined as:

ID: Auto-generated sequential database ID
NameSpace: The namespace portion of the element name
GI: The generic identifier for the element
sID: The value of the sID for the element. Stands for start-ID.
eID: The value of the eID for the element. Stands for end-ID.
CL: Common Location, autoassigned for element tokens, start or end, which have no intervening PCDATA
PCDATA: PCDATA up to the next markup token

The table that is constructed differs from Earley and following work in NLP systems that use chart parsing. Earley recorded the possible parsings for ambiguous natural language situations. That requires a possible resolution for each ambiguity. The table here simply records the information required to resolve any ambiguity and position within a hierarchy, deferring the resolution of any ambiguity to a later of processing. That deferral makes it possible to experiment with both the ambiguities and resolutions of those ambiguities.

It should be noted that at processing time, an element that lacks an sID or eID attribute value is automatically assigned a "null" value for that column in its row. Not only do the sID and eID represent the common reliance on syntax to indicate starting and ending syntax, they also serve to bind two markup tokens together, creating a containment of all that falls in between.

The CL attribute is automatically assigned to deal with the problem of forced hierarchies in markup trees. Perhaps the best known example is a portion of text that is marked as follows: <bold><italic>some text</italic></bold>, which could just as easily have been: <italic><bold>some text</bold></italic>, but for some abitrary rule for markup validation. In other words, tree based markup forces the creation of hierarchies when in fact a hierarchy may either not make any sense or is contrary to the relationship to be represented.

To What End?

Other than repackaging some old ideas and syntax workarounds, what does this solution offer that others don't?

Unlike traditional solutions, this approach provides a means to investigate the information required for parsing of overlapping hierarchies and data structures that lend themselves to the parsing of such hierarchies. While the effacacy of the traditional single stack with "pop" and "push" operations cannot be denied, neither can the artificial limitations of that approach. This proposal divorces the processing of markup from that artificial limitation and allows development of alternative data structures for the processing of markup.

It may well be the case that alternative data structures will not prove as effective for the general case as that of the single stack or will prove more effective for particular types of markup. The central point being that it is better to develop systems to investigate different data structures for the processing of markup as opposed to passive acceptance of one possible choice of data structures to be used.

For investigative purposes, the proposed system trivially handles overlapping markup of any degree of complexity, since it abandons any reliance upon traditional parsing of the source XML file. It should be borne in mind that it is not being suggested as a production system for such investigations, but as a means to represented alternative data structures for the processing of XML.

Observations

The building of the database tables to represent the markup and PCDATA in the sample files lead to a number of observations concerning the processing of markup. For purposes of illustration, consider the following fourteen lines extracted from the database:

Table 1
ID NameSpace GI sID eID CL PCDATA
1 NULL text t1 NULL 1 NULL
2 www.jitts.org/ns/div div d1 NULL 1 NULL
3 www.jitts.org/ns/line page p1 NULL 1 NULL
4 www.jitts.org/ns/line line l1 NULL 1 NULL
5 www.jitts.org/ns/segment s s1 NULL 1 NULL
6 www.jitts.org/ns/segment seg sg1 NULL 1 NULL
7 NULL NULL NULL NULL 1 Of Man's first disobedience,
8 www.jitts.org/ns/segment seg NULL sg1 8 NULL
9 www.jitts.org/ns/segment seg sg2 NULL 8 NULL
10 NULL NULL NULL NULL 8 and the fruit
11 www.jitts.org/ns/line line NULL 11 11 NULL
12 www.jitts.org/ns/line line 12 NULL 11 NULL
13 NULL NULL NULL NULL 11 Of that forbidden tree whose mortal taste
14 www.jitts.org/ns/line line NULL 12 14 NULL

What is a forest of trees?

When first viewing this result, there is an uneasy feeling that there is something quite familiar in the presentation but at first blush, it appears simply to be markup being presented in a table format. What is familiar is the forest of markup trees (and the PCDATA) is being presented via a "List" structure.

Knuth defines a "List" as follows:

A forest can, in turn, be regarded as a special case of what is commonly called a list structure. The word "list" is being used here in a very technical sense, and to distinguish the technical use of the word we will always capitalize it: "List." A List is defined (recursively) as a a finite sequence of zero or more atoms or Lists. Here "atom" is an undefined concept referring to elements from any universe of objects tht might be desired, so long as it is possible to distinguish an atom from a List.1

Knuth continues to describe the differences between a List and a tree: "The big difference between Lists and trees is that Lists may overlap (that is, sub-Lists need not be disjoint) and that they may even be recursive (may contain themselves)."2

Even though a closely related structure to a tree, the use of a List, allows for unlimited overlapping and provides the mechanism that supports the extraction of multiple trees from the list shown above. The separate trees are denoted by the NameSpace column and following the CONCUR mechanism of ISO 8879, rows where NameSpace is NULL, belong to all the trees.

Discovering Overlapping Markup

After developing even a means of investigating data structures for overlapping markup, it seemed a shame to not at least "kick the wheels" in terms of actually investigating overlapping markup. True, one can rely upon the reader or a tree-based markup processor to find overlapping markup (known as as well-formedness error in the latter case). But visual inspection is time consuming and error prone and using a standard markup parser for such a purpose would be even slower.

Fortunately, the table format used herein allows not only the reconstruction of the "container" aspect of the markup, but also the automatic discovery of markup that starts before and overlaps a container as well as markup that starts "inside" a container and then overlaps the end of the container. That analysis requires two steps, the first which isolates the start and ends of elements and the results of which are used in the second step, which isolates the cases of overlap.

To isolate the start and end points of elements, the following SQL query is used:

select m1.GI as GI, 
m1.id as startRow, m2.id as endRow 
from milton_ns3 as m1, milton_ns3 as m2 
      where m1.sid = m2. eid and m1.gi='line';
Where line is the markup element whose start and end row positions are returned as a result set by the query.

Taking the first result from this query, that of "line" have a start point at row (ID) 4 and an end point at row (ID) 11, the following SQL query is used:

select m1.GI as GI, 
m1.id as startRow, m2.id as endRow 
from milton_ns3 as m1, milton_ns3 as m2 
where m1.sid = m2.eid and ((m1.id > 4 and m1.id < 11 and m2.id > 11) 
or (m1.id<4 and m2.id<11and m2.id>4)) 
order by m1.id;
Note that the "id" values are the same as those that are part of the result set from the first SQL query. The results of using these values in this query are:

Table 2
GI startRow endRow
s 5 102
seg 9 17

Which indicates that both the <s> and <seg> elements, in normal markup appearance, begin inside of the >line< element and overlap its closing markup delimiter.

The second query will also find all instances of markup that begins prior to the element in question but that overlaps it to end within the element itself. Substituting row 106 (ID) for the start of a line element and row 114 (ID) for the end of that line element in the second query produces:

Table 3
GI startRow endRow
seg 104 108
seg 112 117

Which states that a <seg> element begins before the line in question but ends before the end of the line, but there is also a <seg> element that begins inside the line and overlaps the end of the line's markup.

And what of co-located markup?

The "CL" column allows for queries to discover cases of co-located markup, that is to say markup that co-exists at a point in the PCDATA. With the tree model of XML, markup is forced to be contain or be contained by other markup. Whether that containment relationship is meaningful to the encoder or actually represents the text as observed.

It should be noted that PCDATA bears the same CL value as the immediately preceeding cluster of markup. While not strictly necessary, it does make it easier to orient the user to the text string in question. For example, after querying the database for all cases where the ID of a row matches the CL value, the user has a list of all the clusters of markup in the text. This markup may be well-formed in the traditional XML sense or just as easily malformed.

Demonstration

While the system provides a way to query across overlapping markup, users are also interested in visual display of such texts. Since the experimental system described is not bothered by overlapping markup, the returned markup can be processed into any display, such as PostScript, PDF or SVG, that allows display of overlapping portions of text, even if the "markup" within such systems does not itself overlap.

The visualization we present here is a simple charting of the list structure in the database. A script queries the database to return all columns and rows (rows with a non-null sID are joined with the ID and CL values for the rows containing the matching eID value, as columns eCL and cID):

SELECT m1.*, m2.CL as eCL, m2.ID as cID
FROM milton_ns3 as m1
LEFT JOIN  milton_ns3 as m2 on m1.sID = m2.eID

Table 4
ID NameSpace GI sID eID CL PCDATA eCL cID
1 NULL text t1 NULL 1 NULL 174 179
2 www.jitts.org/ns/div div d1 NULL 1 NULL 174 178
3 www.jitts.org/ns/line page p1 NULL 1 NULL 174 177
4 www.jitts.org/ns/line line l1 NULL 1 NULL 11 11
5 www.jitts.org/ns/segment s s1 NULL 1 NULL 101 102
6 www.jitts.org/ns/segment seg sg1 NULL 1 NULL 8 8
7 NULL NULL NULL NULL 1 Of Man's first disobedience, NULL NULL
8 www.jitts.org/ns/segment seg NULL sg1 8 NULL NULL NULL
9 www.jitts.org/ns/segment seg sg2 NULL 8 NULL 17 17
10 NULL NULL NULL NULL 8 and the fruit NULL NULL
11 www.jitts.org/ns/line line NULL 11 11 NULL NULL NULL
12 www.jitts.org/ns/line line 12 NULL 11 NULL 14 14
13 NULL NULL NULL NULL 11 Of that forbidden tree whose mortal taste NULL NULL
14 www.jitts.org/ns/line line NULL 12 14 NULL NULL NULL
15 www.jitts.org/ns/line line 13 NULL 14 NULL 20 22
16 NULL NULL NULL NULL 14 Brought death into the World, NULL NULL
17 www.jitts.org/ns/segment seg NULL sg2 17 NULL NULL NULL
18 www.jitts.org/ns/segment seg sg3 NULL 17 NULL 20 20
19 NULL NULL NULL NULL 17 and all our woe, NULL NULL
20 www.jitts.org/ns/segment seg NULL sg3 20 NULL NULL NULL
21 www.jitts.org/ns/segment seg sg4 NULL 20 NULL 25 25
22 www.jitts.org/ns/line line NULL 13 20 NULL NULL NULL
23 www.jitts.org/ns/line line 14 NULL 20 NULL 28 28
24 NULL NULL NULL NULL 20 With loss of Eden, NULL NULL

The resulting table provides the coordinates for an SVG chart with 0,0 as top-left orientation. Values in the CL column provide the x-axis values and the ID values the y-axis values (scaled appropriately). Each namespace is represented by a different color: black = NULL, blue = www.jitts.org/ns/div, green = www.jitts.org/ns/line and red = www.jitts.org/ns/line. A bar is drawn for each row containing a non-null sID to show the extent of element (eCL - ID) the row represents, and then a line drawn down from this point to the point row containing the matching eID (from the cID value).

The following image shows the resulting chart for the 24 rows above:

[Link to open this graphic in a separate page]
Occurrences of markup are immediately apparent where ending lines cross bars, for example the end of the bar line : l1 (representing element <line sID="l1"/>) overlaps with both s : s1 and seg : sg1.

This form of visual representation offers an alternative view of markup from the usual tree and hierarchical-based visual representations used for example in XML editors. The method requires further development and refinement, but it provides opportunities to study overlap at both the micro or local and macro level. The following figure shows the whole section from the beginning of Book 1 of Paradise Lost.

[Link to open this graphic in a separate page]
Although the words are no longer readable, the repeated pattern of overlapp between the green and red bars stands out. These represent the conflict between the <line> and <seg> elements. Such visualizations could be applied to much larger selections of text and providing a visual analysis akin the the dotplot method used for detected repeated segments in text. (Church, 1993)

PS: It also does Trees

Just a quick aside before we conclude to assure any hard breathing XML advocates that the proposed system can also output valid and well-formed trees. Personally we don't understand the obsession but in our second concession to marketing (did you spot the first one?) we note that extraction of a well-formed tree results from a query that specifies a particular NameSpace and NULL for NameSpace (which retrieves all the elements common to all trees or PCDATA). The output includes the appropriate GI and the value of the sID or eID column allows trivial construction of a traditional markup tree.

It should also be noted that any result set can be tested for the occurrence of overlap and a decision made whether to avoid outputting a markup event or supplying a miletone equivalent to avoid overlapping markup. This is an application of the JITTs principle, not to parsing, but to the creation of output that is required to be well-formed. In such cases, the common location information can be used to reorder potential markup events to avoid syntactic overlap.

Conclusion

While the experimental system proposed is probably not robust enough for production use, it does provide a ready means for investigation of overlapping markup and data structures for its representation. It represents a blend of both syntax and data representation that divide and conquer the overlapping markup problem.

The results presented above suggest that using lists structures for some stages of processing of markup may bring advantages that are not possible with processing that is based upon trees. Such issues as native implementations of list processing for markup, as part of a processing pipeline, form the future direction of our research.

Notes

1.

Knuth, Donald. The Art of Computer Programming, volume 1, Fundamental Algorithms, Addison-Wesley, 1997, at page 315.

2.

Knuth, Donald. The Art of Computer Programming, volume 1, Fundamental Algorithms, Addison-Wesley, 1997, at page 316.


Bibliography

[Church/Helfman (1993)] Church, K.W., and Helfman, J.I. Dotplot: A Program for Exploring Self-Similarity in Millions of Lines of Text and Code. Journal of Computational and Graphical Statistics, V2, N2, pp. 153-174, June 1993

[Coombs/Renear/DeRose] Coombs, J.; Renear, Allen; and DeRose, Steve. Markup Systems and the Future of Scholarly Text Processing. http://www.oasis-open.org/cover/coombs.html

[Durusau/O'Donnell (2001)] Durusau, Patrick, and O'Donnell, Matthew. Implementing Concurrent Markup in XML (Extreme Markup 2001). http://www.sbl-site.org/TechResearch/files/Implementing_concur.pdf

[Durusau/O'Donnell (2002)] Durusau, Patrick, and O'Donnell, Matthew. Implementing JITTs (Just-in-Time-Trees) (XML 2002). http://www.sbl-site.org/TechResearch/files/Implementing_JITTS.pdf

[Earley (1970)] Earley, J. An Efficient Context-free Parsing Algorithm. Communication of the ACM 6(8):94-102, February, 1970

[Huitfeldt] Huitfeldt, Claus. Multi-Dimensional Texts in a One-Dimensional Medium. http://www.hit.uib.no/wab/el_texts/no5/huitfeldt.htm

[Knuth (1997)] Knuth, Donald R. The Art of Computer Programming, Fundamental Algorithms, 3rd Edition. Addison Wesley, 1997

[Nichol] Nichol, Gavin Thomas. Core Range Algebra: Toward a formal model of markup. http://www.mind-to-mind.com/library/papers/ara/core-range-algebra-03-2002.html

[Sperberg-McQueen/Burnard] Sperberg-McQueen, C.M., and Burnard, Lou (eds). Guidelines for Text Encoding and Interchange (P3) Chicago and Oxford: Text Encoding Initiative

[Tennison/Piez] Tennison, Jeni, and Piez, Wendell. The Layered Markup and Annotation Language (LMNL). http://xml.coverpages.org/LMNL-Abstract.html

[Tomita (1986)] Tomita, Masura. Efficient Parsing for Natural Language. Kluwer Academic Publishers, 1986



Tabling the Overlap Discussion

Patrick Durusau [Society of Biblical Literature]
Matthew Brook O'Donnell [OpenText.org]