The problem of overlapping markup has often been addressed using variant syntaxes, milestones, standoff markup and other means. This proposal combines the use of co-indexed milestones with a table structure in a database, plus CONCUR namespaces and "common location" (markup tokens that have no intervening PCDATA) information to illustrate one way to "parse" XML with more information than is available using standard markup. The result is a data structure that can be explored to determine the best data structure for overlapping hierarchies.
Despite years of discussion and at best partially successful proposals, interest in the problem of overlapping markup persists. Interest in the overlapping hierarchy problem has persisted because a generalized solution is more than simply a convenience for logical or style oriented presentations. Solving the problem of overlapping markup will mean at a minimum, that true versioning of markup documents can be implemented in scientific-medical-technical as well as legal publishing. For the more academically minded, the ability to record complex and overlapping markup, sometimes at variance with each other by different scholars, holds the promise to make scholarship cumulative and not episodic. And at its core, hence the continued attraction of the problem, markup will be able to represent texts as we read them, not in a dumbed down version suitable for markup parsers.
The problem of overlapping markup is complex and multi-layered. Information entropy forms the first part of the problem that is addressed herein. Briefly stated, it is the problem of the parser not having sufficient information to parse a given syntax. (This is in addition to the known problem of overlap not presenting a tree to the parser.) Since markup is an artificial language, a syntax based markup is used that allows sets of markup to be distinguished (CONCUR's use of namespaces) and distinguish nested versus overlapping markup within a single set of markup for the investigative system proposed below.
Another aspect of the problem is that there is no commonly available system that allows experimentation with the impact of varying data structures for overlapping markup. A solution is proposed that stores both markup and PCDATA in a table structure that enables experimental investigation of strategies for processing overlapping markup.
This experimental system has led the authors to conclude that a true solution to the overlapping markup problem will need to be addressed both at the level of syntax along with development of non-tree data structures for the handling of such markup. The proposed system meets both of those requirements but should be considered as a means of investigation and not a production system to solve this problem.
Part of the dilemma in processing markup has been an almost cult-like status granted to tree based processing, reminiscent of the Druids in England prior to the arrival of the Roman Empire. Granted there have been a few hardy souls, DeRose, Huitfeltd, McQueen, Nichols, Tennison and Piez, among others, who have questioned the current orthodoxy, daring to ask for processing that meets their needs. That the problem persists is a reflection of the inertia of a system with a "good enough" result and not a commentary on the skill or cleverness of prior proposals.
There is no doubt that tree based processing of markup has been extremely useful in a variety of contexts. The question posed herein is whether that same utility can be matched by other data structures for processing markup.
One of the problems of overlapping markup is distinguishing to what set a particular markup token belongs was solved by ISO 8879 by the use of what look like XML Namespaces. In ISO 8879, CONCUR prefixes (a forerunner to namespaces) were used to distinguish entire trees of markup. The use proposed herein, is more general in that no tree is required and markup may overlap even within a single set of markup tokens. The prefix for every markup token is either "namespace" or left empty, in which case the token belongs to all of the sets of markup recorded.
The other problem presented when one moves away from trees is how to treat the following case:
<div> blah, blah <div> blah, blah </div> blah, blah </div>
Does the first <div> element match the first or second </div>? Based solely on the information presented, that is, without the presumption of well-formedness supplied by XML, it is not possible to answer that question. That is to say that the markup tokens as represented above, lack the information required to provide an answer. The XML standards "supplies" the required information by simply declaring that the second <div> element must close the first </div> element that it meets. No exceptions.
Well-formedness works well enough for simplistic documents or those that can be encoded as though they were simplistic but it is a rather draconian solution. It leads to texts that are far poorer than our common experience with them would indicate.
The problem of a lack of information for recusive structures was noted as a shortcoming in our original JITTs (Just-In-Time-Trees) proposal by Marc De Scheemaecker (author of NanoXML, http://nanoxml.sourceforge.net/orig/). For purposes of this investigation, all elements in a text are encoded using milestone elements that have the namespace plus GI and either a sID or eID value that indicates its status as a start or end element along with its "start" or "end" element partner. Looking at the example recursive plus overlapping markup, we can distinguish:
<div sID="1"> blah, blah <div sID="2"> blah, blah </div eID="2"> blah, blah </div eID="1">
<div sID="1"> blah, blah <div sID="2"> blah, blah </div eID="1"> blah, blah </div eID="2">
<ROOT xmlns:ln="www.jitts.org/ns/line" xmlns:sg="www.jitts.org/ns/segment" xmlns:div="www.jitts.org/ns/div" > <sg:text sID="t1"/> <div:div sID="d1" id="d12"/> <ln:page sID="p1" n="135"/> <ln:line sID="l1" n="Bk.1.1"/> <sg:s sID="s1"/> <sg:seg sID="sg1"/>Of Man's first disobedience,<sg:seg eID="sg1"/> <sg:seg sID="sg2"/> and the fruit<ln:line eID="l1"/> <ln:line sID="l2" n="Bk.1.2"/>Of that forbidden tree whose mortal taste <ln:line eID="l2"/> <ln:line sID="l3" n="Bk.1.3"/>Brought death into the World,<sg:seg eID="sg2"/> <sg:seg sID="sg3"/> and all our woe,<sg:seg eID="sg3"/> <sg:seg sID="sg4"/> <ln:line eID="l3"/> <ln:line sID="l4" n="Bk.1.4"/>With loss of Eden,<sg:seg eID="sg4"/> <sg:seg sID="sg5"/> till one greater Man <ln:line eID="l4"/> <ln:line sID="l5" n="Bk.1.5"/>Restore us,<sg:seg eID="sg5"/> <sg:seg sID="sg6"/> and regain the blissful seat,<sg:seg eID="sg6"/> <sg:seg sID="sg7"/> <ln:line eID="l5"/> <ln:line sID="l6" n="Bk.1.6"/>Sing,<sg:seg eID="sg7"/> <sg:seg sID="sg8"/> Heavenly Muse,<sg:seg eID="sg8"/> <sg:seg sID="sg9"/> that,<sg:seg eID="sg9"/> <sg:seg sID="sg10"/> on the secret top <ln:line eID="l6"/> </ROOT>
The syntax outlined above is tokenized to place it into the following relational database structure:
ID NameSpace GI sID eID CL PCDATA
The columns for each row are defined as:
The table that is constructed differs from Earley and following work in NLP systems that use chart parsing. Earley recorded the possible parsings for ambiguous natural language situations. That requires a possible resolution for each ambiguity. The table here simply records the information required to resolve any ambiguity and position within a hierarchy, deferring the resolution of any ambiguity to a later of processing. That deferral makes it possible to experiment with both the ambiguities and resolutions of those ambiguities.
It should be noted that at processing time, an element that lacks an sID or eID attribute value is automatically assigned a "null" value for that column in its row. Not only do the sID and eID represent the common reliance on syntax to indicate starting and ending syntax, they also serve to bind two markup tokens together, creating a containment of all that falls in between.
The CL attribute is automatically assigned to deal with the problem of forced hierarchies in markup trees. Perhaps the best known example is a portion of text that is marked as follows: <bold><italic>some text</italic></bold>, which could just as easily have been: <italic><bold>some text</bold></italic>, but for some abitrary rule for markup validation. In other words, tree based markup forces the creation of hierarchies when in fact a hierarchy may either not make any sense or is contrary to the relationship to be represented.
Other than repackaging some old ideas and syntax workarounds, what does this solution offer that others don't?
Unlike traditional solutions, this approach provides a means to investigate the information required for parsing of overlapping hierarchies and data structures that lend themselves to the parsing of such hierarchies. While the effacacy of the traditional single stack with "pop" and "push" operations cannot be denied, neither can the artificial limitations of that approach. This proposal divorces the processing of markup from that artificial limitation and allows development of alternative data structures for the processing of markup.
It may well be the case that alternative data structures will not prove as effective for the general case as that of the single stack or will prove more effective for particular types of markup. The central point being that it is better to develop systems to investigate different data structures for the processing of markup as opposed to passive acceptance of one possible choice of data structures to be used.
For investigative purposes, the proposed system trivially handles overlapping markup of any degree of complexity, since it abandons any reliance upon traditional parsing of the source XML file. It should be borne in mind that it is not being suggested as a production system for such investigations, but as a means to represented alternative data structures for the processing of XML.
The building of the database tables to represent the markup and PCDATA in the sample files lead to a number of observations concerning the processing of markup. For purposes of illustration, consider the following fourteen lines extracted from the database:
|7||NULL||NULL||NULL||NULL||1||Of Man's first disobedience,|
|10||NULL||NULL||NULL||NULL||8||and the fruit|
|13||NULL||NULL||NULL||NULL||11||Of that forbidden tree whose mortal taste|
When first viewing this result, there is an uneasy feeling that there is something quite familiar in the presentation but at first blush, it appears simply to be markup being presented in a table format. What is familiar is the forest of markup trees (and the PCDATA) is being presented via a "List" structure.
Knuth defines a "List" as follows:
A forest can, in turn, be regarded as a special case of what is commonly called a list structure. The word "list" is being used here in a very technical sense, and to distinguish the technical use of the word we will always capitalize it: "List." A List is defined (recursively) as a a finite sequence of zero or more atoms or Lists. Here "atom" is an undefined concept referring to elements from any universe of objects tht might be desired, so long as it is possible to distinguish an atom from a List.1
Knuth continues to describe the differences between a List and a tree: "The big difference between Lists and trees is that Lists may overlap (that is, sub-Lists need not be disjoint) and that they may even be recursive (may contain themselves)."2
Even though a closely related structure to a tree, the use of a List, allows for unlimited overlapping and provides the mechanism that supports the extraction of multiple trees from the list shown above. The separate trees are denoted by the NameSpace column and following the CONCUR mechanism of ISO 8879, rows where NameSpace is NULL, belong to all the trees.
After developing even a means of investigating data structures for overlapping markup, it seemed a shame to not at least "kick the wheels" in terms of actually investigating overlapping markup. True, one can rely upon the reader or a tree-based markup processor to find overlapping markup (known as as well-formedness error in the latter case). But visual inspection is time consuming and error prone and using a standard markup parser for such a purpose would be even slower.
Fortunately, the table format used herein allows not only the reconstruction of the "container" aspect of the markup, but also the automatic discovery of markup that starts before and overlaps a container as well as markup that starts "inside" a container and then overlaps the end of the container. That analysis requires two steps, the first which isolates the start and ends of elements and the results of which are used in the second step, which isolates the cases of overlap.
To isolate the start and end points of elements, the following SQL query is used:
select m1.GI as GI, m1.id as startRow, m2.id as endRow from milton_ns3 as m1, milton_ns3 as m2 where m1.sid = m2. eid and m1.gi='line';
Taking the first result from this query, that of "line" have a start point at row (ID) 4 and an end point at row (ID) 11, the following SQL query is used:
select m1.GI as GI, m1.id as startRow, m2.id as endRow from milton_ns3 as m1, milton_ns3 as m2 where m1.sid = m2.eid and ((m1.id > 4 and m1.id < 11 and m2.id > 11) or (m1.id<4 and m2.id<11and m2.id>4)) order by m1.id;
Which indicates that both the <s> and <seg> elements, in normal markup appearance, begin inside of the >line< element and overlap its closing markup delimiter.
The second query will also find all instances of markup that begins prior to the element in question but that overlaps it to end within the element itself. Substituting row 106 (ID) for the start of a line element and row 114 (ID) for the end of that line element in the second query produces:
Which states that a <seg> element begins before the line in question but ends before the end of the line, but there is also a <seg> element that begins inside the line and overlaps the end of the line's markup.
The "CL" column allows for queries to discover cases of co-located markup, that is to say markup that co-exists at a point in the PCDATA. With the tree model of XML, markup is forced to be contain or be contained by other markup. Whether that containment relationship is meaningful to the encoder or actually represents the text as observed.
It should be noted that PCDATA bears the same CL value as the immediately preceeding cluster of markup. While not strictly necessary, it does make it easier to orient the user to the text string in question. For example, after querying the database for all cases where the ID of a row matches the CL value, the user has a list of all the clusters of markup in the text. This markup may be well-formed in the traditional XML sense or just as easily malformed.
While the system provides a way to query across overlapping markup, users are also interested in visual display of such texts. Since the experimental system described is not bothered by overlapping markup, the returned markup can be processed into any display, such as PostScript, PDF or SVG, that allows display of overlapping portions of text, even if the "markup" within such systems does not itself overlap.
The visualization we present here is a simple charting of the list structure in the database. A script queries the database to return all columns and rows (rows with a non-null sID are joined with the ID and CL values for the rows containing the matching eID value, as columns eCL and cID):
SELECT m1.*, m2.CL as eCL, m2.ID as cID FROM milton_ns3 as m1 LEFT JOIN milton_ns3 as m2 on m1.sID = m2.eID
|7||NULL||NULL||NULL||NULL||1||Of Man's first disobedience,||NULL||NULL|
|10||NULL||NULL||NULL||NULL||8||and the fruit||NULL||NULL|
|13||NULL||NULL||NULL||NULL||11||Of that forbidden tree whose mortal taste||NULL||NULL|
|16||NULL||NULL||NULL||NULL||14||Brought death into the World,||NULL||NULL|
|19||NULL||NULL||NULL||NULL||17||and all our woe,||NULL||NULL|
|24||NULL||NULL||NULL||NULL||20||With loss of Eden,||NULL||NULL|
The resulting table provides the coordinates for an SVG chart with 0,0 as top-left orientation. Values in the CL column provide the x-axis values and the ID values the y-axis values (scaled appropriately). Each namespace is represented by a different color: black = NULL, blue = www.jitts.org/ns/div, green = www.jitts.org/ns/line and red = www.jitts.org/ns/line. A bar is drawn for each row containing a non-null sID to show the extent of element (eCL - ID) the row represents, and then a line drawn down from this point to the point row containing the matching eID (from the cID value).
The following image shows the resulting chart for the 24 rows above:
This form of visual representation offers an alternative view of markup from the usual tree and hierarchical-based visual representations used for example in XML editors. The method requires further development and refinement, but it provides opportunities to study overlap at both the micro or local and macro level. The following figure shows the whole section from the beginning of Book 1 of Paradise Lost.
Just a quick aside before we conclude to assure any hard breathing XML advocates that the proposed system can also output valid and well-formed trees. Personally we don't understand the obsession but in our second concession to marketing (did you spot the first one?) we note that extraction of a well-formed tree results from a query that specifies a particular NameSpace and NULL for NameSpace (which retrieves all the elements common to all trees or PCDATA). The output includes the appropriate GI and the value of the sID or eID column allows trivial construction of a traditional markup tree.
It should also be noted that any result set can be tested for the occurrence of overlap and a decision made whether to avoid outputting a markup event or supplying a miletone equivalent to avoid overlapping markup. This is an application of the JITTs principle, not to parsing, but to the creation of output that is required to be well-formed. In such cases, the common location information can be used to reorder potential markup events to avoid syntactic overlap.
While the experimental system proposed is probably not robust enough for production use, it does provide a ready means for investigation of overlapping markup and data structures for its representation. It represents a blend of both syntax and data representation that divide and conquer the overlapping markup problem.
The results presented above suggest that using lists structures for some stages of processing of markup may bring advantages that are not possible with processing that is based upon trees. Such issues as native implementations of list processing for markup, as part of a processing pipeline, form the future direction of our research.
Knuth, Donald. The Art of Computer Programming, volume 1, Fundamental Algorithms, Addison-Wesley, 1997, at page 315.
Knuth, Donald. The Art of Computer Programming, volume 1, Fundamental Algorithms, Addison-Wesley, 1997, at page 316.
[Church/Helfman (1993)] Church, K.W., and Helfman, J.I. Dotplot: A Program for Exploring Self-Similarity in Millions of Lines of Text and Code. Journal of Computational and Graphical Statistics, V2, N2, pp. 153-174, June 1993
[Coombs/Renear/DeRose] Coombs, J.; Renear, Allen; and DeRose, Steve. Markup Systems and the Future of Scholarly Text Processing. http://www.oasis-open.org/cover/coombs.html
[Durusau/O'Donnell (2001)] Durusau, Patrick, and O'Donnell, Matthew. Implementing Concurrent Markup in XML (Extreme Markup 2001). http://www.sbl-site.org/TechResearch/files/Implementing_concur.pdf
[Durusau/O'Donnell (2002)] Durusau, Patrick, and O'Donnell, Matthew. Implementing JITTs (Just-in-Time-Trees) (XML 2002). http://www.sbl-site.org/TechResearch/files/Implementing_JITTS.pdf
[Earley (1970)] Earley, J. An Efficient Context-free Parsing Algorithm. Communication of the ACM 6(8):94-102, February, 1970
[Huitfeldt] Huitfeldt, Claus. Multi-Dimensional Texts in a One-Dimensional Medium. http://www.hit.uib.no/wab/el_texts/no5/huitfeldt.htm
[Knuth (1997)] Knuth, Donald R. The Art of Computer Programming, Fundamental Algorithms, 3rd Edition. Addison Wesley, 1997
[Nichol] Nichol, Gavin Thomas. Core Range Algebra: Toward a formal model of markup. http://www.mind-to-mind.com/library/papers/ara/core-range-algebra-03-2002.html
[Sperberg-McQueen/Burnard] Sperberg-McQueen, C.M., and Burnard, Lou (eds). Guidelines for Text Encoding and Interchange (P3) Chicago and Oxford: Text Encoding Initiative
[Tennison/Piez] Tennison, Jeni, and Piez, Wendell. The Layered Markup and Annotation Language (LMNL). http://xml.coverpages.org/LMNL-Abstract.html
[Tomita (1986)] Tomita, Masura. Efficient Parsing for Natural Language. Kluwer Academic Publishers, 1986