Overlap in markup occurs where some markup structures do not nest, such as where the structural division of the text into lists, sections, etc., differs from the syntactic division of the text into sentences and phrases. The Multiple Annotation solution to this problem (redundant encoding in multiple forms) has many advantages: it is based on XML, the modeling of alternative annotations is possible, each level can be viewed separately, and new levels can be added at any time. But it has the significant disadvantage of independence of the separate files. These multiply annotated files can be regarded as an interrelated unit, with the text serving as the implicit link. Two representations of the information contained in the multiple files (one in Prolog and one in XML) can be programmatically derived and used together for editing, for inference, or for unification of the multiply annotated documents.
Keywords: Concurrent Markup/Overlap
Markup expresses characteristics or interpretations of text. It is obvious that there is, at least potentially, more than one view for given text. Often it is necessary to express these different or alternative views of text explicitly, i.e. by markup. At the moment, it seems to be a tendency to annotate more and more information for a given text. This development definitely takes place in the field of linguistics, where language data is associated with information from several linguistic levels of description, e.g. semantics, syntax, morphology, phonology — levels which are (relatively) independent of each other. But also text simply published on the Web is combined with more and more meta-information. Since markup expresses meta-information about text, the amount of markup will increase, especially if the Semantic Web will emerge. And, of course, more markup implies that it becomes more likely to encounter multiple hierarchies.
This paper deals with two different problems:
As a solution to both problems the technique of annotating documents in multiple forms is proposed and described in detail. The paper also discusses the disadvantages of the approach, disadvantages that are definitely the reason for rejecting this solution by many projects: An obvious and also simple solution would be to make a separate file for each transcription. However, this makes comparison between levels unnecessarily cumbersome, and it is notoriously difficult to keep track of revisions in parallel files.[Haugen (2004)]
This paper shows how it is possible and what is needed to overcome these problems.
Publishing, especially print publishing, was the driving force behind the development of markup languages. Text was viewed as an OHCO [ordered hierarchy of content objects]. Consequently most markup languages are based on the OHCO assumption. The term and the acronym were introduced by [DeRose et al. (1990)] and was further discussed by [Renear et al. 1996].1
From a formal point of view SGML-based markup systems allow for the representation of exactly one hierarchy. Hence, in principle, only one structure can be represented in one document. In practice, this restriction often does not receive special attention as different structures often can be expressed within one hierarchy. Thus, e.g., the logical structure of a text, i.e. the division into captions, lists, sections etc., differs completely from the syntactic structure such as the division of the text into sentences and phrases. Especially, none of elements belonging to the different tag sets overlap. Hence, it is possible to project both structures into one hierarchy without problems. The disadvantage is, however, that this necessarily results in a mixture of these structures, in the annotated text as well as in the corresponding document grammar.
Nonetheless, the problem of multiple hierarchies is often discussed. The main reason for this might be the view of document engineers, who are faced with the fact that ranges of text marked up by SGML or XML elements must not overlap. The single-hierarchically structured text is a consequence of this restriction. If this overlapping does not occur, the problem of combining heterogeneous tag sets often is ignored. Hence, the mixture of structures can be found quite often in text represented in one syntactic hierarchy. One example was already given, another example is HTML. Even in its 'strict' version, different structures can be mixed, at least through the often promoted use of the elements seg and div combined with an assignment of a class information.
To avoid confusion when talking about multiply structured text and text ideally organized by multiple hierarchies, the terms 'level' or 'level of description' is used when referring to a logical unit, e.g. visual document structure or logical text structure. When referring to a structure organizing the text technically in a hierarchically ordered way the terms 'layer' or 'tier' are used. A level can be expressed by means of one or more layers and a layer can include markup information on one or more levels. (see also [Bayerl et al. 1999])2
The problem of representing multiple hierarchies has often been addressed and several solutions have been proposed, especially in the field of humanities computing, which is by nature concerned with text and its interpretation or its description. Consequently, the best collection of techniques is presented by the TEI [Text Encoding Initiative] ([ACH/ACL/ALLC 1994] and [Barnard et al. (1995)]). The TEI describes the techniques for using SGML for annotating multiple hierarchies.
Another technique not mentioned directly by the TEI guidelines is stand-off annotation, i.e. (new) layers of annotation are added by building a new tree whose nodes are SGML elements which do not contain textual content (#PCDATA in terms of the DTD [Document Type Definition] syntax), but links to another layer.
In some respects stand-off annotation is a generalization of the virtual joins, because not only contents of elements are joined, but also ranges between points within the document. Sometimes these ranges make use of markup already contained in a layer, sometimes special pointers are used in the annotation to refer to the specific text elements which are the object of the annotation. [Pianta and Bentivogli (2004)]. By the introduction of this concept [Thompson and McKelvie (1997)] this second approach was described.
In practice, however, most often a layer already annotated is taken as a primary annotation tier, to which the stand-off annotation is linked. In the case of linguistic annotation often the annotation level 'word' is used as such a primary annotation layer.
In most of its applications, stand-off annotation makes use of one layer as the link target of the new tier, but it is also possible to link to several already existing layers [Carletta et al. 2003].
In any case, stand-off annotation results in new hierarchies established by new annotation layers linked to already existing annotations. Sometimes this new layer is included in the same document, and sometimes the layers are separated.
This approach has the advantage that it is SGML/XML based and that different levels of description are separate. However, this approach has some drawbacks too:
The Namespaces standard provides a mechanism to specify where a specific element has been defined. [Bray et al. 1999] Connecting elements with their defining document grammars is done adding a prefix to the element or the attribute names. The prefix points, at least conceptually, to a document grammar, in which the element or the attribute is defined. Thus the logical structure of a text can be marked up with e.g. (X)HTML elements for captions, sections, lists etc. and its syntactic structure can marked up by using a adequate module of the DTD of the TEI [Text Encoding Initiative]. If a corresponding namespace has been defined, a caption belonging to the logical structure of the text can be referenced by html:h2 instead of only h2, whereas a word or a morph can be marked up by tei:w or tei:m instead of w or m. This enrichment of the annotation simplifies it to recognize the relation between the annotation and a specific level (here text structure and morphology).
Unfortunately, some problems remain. Sometimes a document grammar defines several different structures, possibly in a modular way. The document grammars defined by the TEI-DTD are a good example of this. As an ad-hoc solution, one could try to define different namespaces for the same document grammar. A first prefix teins1 and a second prefix teins2 could be defined. Because the prefixes have only the function of a place holder for the expanded name spaces, it is necessary to declare several different 'real' namespaces for one DTD. But this would definitely be against the intention of the standard.
Nonetheless namespaces are an important help when using markup that belongs to different levels of description since it provides a means to refer to an element not only by its name or its generic identifier but additionally by its defining document grammar.
A minor problem of namespaces might occur when using schema languages which allow for context-sensitive definitions of content models. With this technique it is possible to define a different content model for regions marked up with elements with the same element name. For example, Relax NG and XML Schema allow for such a definition. The (slightly) different definitions of an element para in sections and of an element para in the context footnote, where footnotes should be prohibited, is an often used example of the use of this option. But since the namespace points to the document grammar and not to the element definition, context-sensitively defined elements cannot be distinguished.
One problem has not been addressed by the namespace recommendation at all: the problem of overlapping hierarchies.
Some non-SGML-based markup languages have been proposed in the last few years. An example of such a markup language is the Multi-Element Code System (MECS) [Sperberg-McQueen and Huitfeld 1999] or TexMECS [Huitfeldt and Sperberg-McQueen, 2001]. Its major extension with respect to SGML and XML is that overlapping ranges are admitted within documents.
In 2002 another definition of a markup language was proposed. This is called LMNL [Layered Markup and Annotation Language] [Tennison and Piez 2002]. LMNL is a markup language which not only allows to annotate overlapping elements but also to connect the element names to corresponding annotation levels. All structures modeled by XML can also be modeled by LMNL.
The problem of annotating multiple hierarchies can be divided into two different and relatively independent problems: (1) SGML-based markup systems cannot handle 'overlapping hierarchies' and (2) the tag sets used or needed for a certain annotation task are sometimes quite heterogeneous. The first problem is addressed by the solutions proposed in the TEI guidelines, by stand-off annotation, and by the TexMECS markup language, which does not conform to SGML. The second problem is addressed by the namespace recommendation.
LMNL provides a solution for both problems: regions marked up by different elements may overlap and its layered annotation approach is specially designed for this task. But, since LMNL is does conform SGML, not to mention XML, it has — to my knowledge — not been applied up to now.
Another possibility mentioned is the redundant encoding in multiple forms. This approach is rarely used by the markup community. The reasons for this seem to be clear: First, most try to avoid redundancy. Second, and more important, multiple encodings in different forms are independent of each other, but those who want to deal with annotated text are only interested an integrated format.
On the other hand it is also an advantage if one annotated document is not related to another document, because then the document is an independent unit of information. This leads to several more advantages.
We therefore conclude that this approach has lot of advantages with respect to the aspects of editing, maintenance, interchange, and reusability of XML-annotated data. What remains to be solved is the main drawback of independent annotations: How is it possible to connect these layers?
We also conclude that a special representation model for these data is needed, because of the redundancy in the data. This representation format is desired for storing and processing this information. From a theoretical point of view, LMNL would be an ideal format. From a practical viewpoint a stand-off annotation approach is most suited for these tasks and, in fact, is used most frequently.
Beside the advantages of the annotation in multiple forms, the main problem of this approach has been addressed: the independence of the tiers. But interrelations of annotation layers are of interest for many persons concerned with structuring and modeling of information. In this section a method is presented which complements the advantages of redundant encoding of information in multiple forms with possibilities to link these multiple forms and represent them uniformly. Furthermore, conversion tools for the annotation format and possible representation formats are described.
One obvious way to interrelate different annotations of same textual data exists. The different annotations could be regarded as transformations of each other. Hence, the relations between the XML documents can declared in an XSLT-program or an XSLT-stylesheet. This stylesheet can be viewed as a description of relations between two XML vocabularies. But for composing such a stylesheet it is necessary to have information on the relation of the elements defined in the different vocabularies. Moreover, this approach could only be successful, if the relations between the elements can be stated unambiguously.
Another way to link the different forms was proposed by [Witt 2002]. The central idea of this approach is that the annotated text itself serves as the link. This is achieved by annotating exactly the same text several times.
This approach is described by means of a simple example. Below a part of a users' manual is given.
<xhtml><h1>TROUBLESHOOTING</h1> ... <table border="1"> <tr> <td align="center">Problem</td> <td align="center">Cause</td> <td align="center">Remedy</td> </tr> <tr> <td valign="top">Tape does not run.</td> <td valign="top"><ul> <li>Power cord is off.</li> <li>Tape is completely wound up.</li> <li>Tape is loose.</li> <li>Cassette is not loaded properly.</li> <li>Defective cassette.</li> </ul></td> <td valign="top"><ul> <li>Check power cord.</li> <li>Rewind tape.</li> <li>Tighten tape with a pencil, etc.</li> <li>Load cassette properly.</li> <li>Replace cassette.</li></ul></td> </tr> <tr> <td valign="top">Tape is not recorded when recording button is pressed.</td> <td valign="top"><ul> <li>No cassette is loaded.</li> <li>Erase prevention tab is broken off.</li> </ul></td> <td valign="top"><ul> <li>Load cassette.</li> <li>Cover hole with plastic tape.</li></ul> </td> </tr> </table></xhtml>
<r><h1>TROUBLESHOOTING</h1> ... <p-c-r> <description> <first>Problem</first> <second>Cause</second> <third>Remedy</third> </description> <case> <problem>Tape does not run.</problem> <potential_causes> <cause>Power cord is off.</cause> <cause>Tape is completely wound up.</cause> <cause>Tape is loose.</cause> <cause>Cassette is not loaded properly.</cause> <cause>Defective cassette.</cause> </potential_causes> <potential_remedies> <remedy>Check power cord.</remedy> <remedy>Rewind tape.</remedy> <remedy>Tighten tape with a pencil, etc.</remedy> <remedy>Load cassette properly.</remedy> <remedy>Replace cassette.</remedy></potential_remedies> </case> <case> <problem>Tape is not recorded when recording button is pressed.</problem> <potential_causes> <cause>No cassette is loaded.</cause> <cause>Erase prevention tab is broken off.</cause> </potential_causes> <potential_remedies> <remedy>Load cassette.</remedy> <remedy>Cover hole with plastic tape.</remedy> </potential_remedies> </case> </p-c-r></r>
The multiply annotated XML documents are the basis of the representations. For further processing of the text it is necessary to represent them uniformly. Two alternative representations are described in next subsections.
[Sperberg-McQueen et al. 2001] discuss the meaning and interpretation of markup. For explaining their approach the annotated documents are represented in the programming language Prolog. In their representation, every element, every attribute, and the content is saved as so-called Prolog facts. This approach has been extended, so that multiple annotations as described in the previous section, can be represented. Through this all separate annotation can be associated in a data basis which then can be used e.g. for automatic detection of the relations between the annotation levels (see next section).
In the simplest setting for any element, attribute and text node of each annotation level a Prolog fact is built which contains the following information:
Some Prolog facts containing information from the two levels of the examples should serve as an illustration.
node('tape-xhtml.xml', 729, 786, [1, 5, 3, 2], element('td')). node('tape-xhtml.xml', 729, 786, [1, 5, 3, 2, 1], element('ul')). node('tape-xhtml.xml', 729, 751, [1, 5, 3, 2, 1, 1], element('li')). node('tape-thema.xml', 729, 786, [1, 5, 3, 2], element('potential_causes')). node('tape-thema.xml', 729, 751, [1, 5, 3, 2, 1], element('cause')).
Attributes are represented in a similar way, using the Prolog predicate attr):
attr('tape-xhtml.xml', 729, 786, [1, 5, 3, 2], 'valign', 'top').
pcdata_node(729, 730, 'N'). pcdata_node(730, 731, 'o'). pcdata_node(731, 732, ' '). pcdata_node(732, 733, 'c'). pcdata_node(733, 734, 'a'). pcdata_node(734, 735, 's'). pcdata_node(735, 736, 's').
Multiply annotated XML files can also be represented in a XML-based format. Such a presentation could be achieved by transforming the Prolog facts in XML elements, e.g. the predicate node with its five arguments could be transformed to an empty XML element node with five attributes. However, such a Prolog-in-XML representation would not make to much sense.
A representation using the technique of virtual joins, or stand-off annotation, is more interesting, because this technique is used to represent multiple hierarchies. Moreover, most of the above mentioned disadvantages of this technique do not exist only when this format is an add-on for the multiple annotation of XML layers.
The European language technology project NITE developed a format for representing heavily annotated data. This format is well suited for the this task.
The NITE-format [Carletta et al. 2003] is a collection of several files forming a corpus. This files are interrelated with each other. One way to represent the two annotation layers tape-xhtml.xml and tape-thema.xml is given in the next examples. The NITE-corpus consists out of four separate files, in the examples this could be:
<char nite:id="char_727">e</char> <char nite:id="char_728">d</char> <char nite:id="char_729">.</char> <char nite:id="char_730">N</char> <char nite:id="char_731">o</char> <char nite:id="char_732"> </char> <char nite:id="char_733">c</char> <char nite:id="char_734">a</char> <char nite:id="char_735">s</char> <char nite:id="char_736">s</char>
The next example shows how the elements of the thematic annotation are linked to the text.
<nite:child href="o1.stream.xml#id('char_727')" /> <nite:child href="o1.stream.xml#id('char_728')" /> <nite:child href="o1.stream.xml#id('char_729')" /> </problem> <potential_causes nite:id="potential_causes_2" > <cause nite:id="cause_6" > <nite:child href="o1.stream.xml#id('char_730')" /> <nite:child href="o1.stream.xml#id('char_731')" /> <nite:child href="o1.stream.xml#id('char_732')" /> <nite:child href="o1.stream.xml#id('char_733')" />
The conversion from XML to Prolog is implemented in Python. The program xml2prolog.py receives as an input one or more XML documents and outputs a collection of Prolog facts.4
the element <Root> is represented as the fact:
node(AnnotationLayer, 0, n, , element(Root)).
attr(AnnotationLayer, 0, n, , 'att1', 'val1'). attr(AnnotationLayer, 0, n, , 'att2', 'val2').
Some options for the transformation process are:
For the conversion of text which is annotated in multiple forms to the NITE-format, another program has been developed. 5 This program is called nexus.pl and is implemented in the Perl programming language. The functionalities are similar to xml2prolog.py. The input is n annotations of the same text. The program outputs a NITE-corpus that consists of the n + 2 files described above.
It has been shown that the technique of annotating the same text in multiple forms has many advantages and that its main drawback can be avoided. Therefore it is necessary to annotate exactly the same data several times. With this prerequisite the multiply annotated files can be regarded as a unit heavily interrelated, because the text serves as the implicit link.
After that, two different formats have been described. One format is an interrelated Prolog representation of the information contained in the multiple files. The other format is based on XML and was developed for the processing and the exchange of linguistic corpora annotated on several levels of description.
Furthermore, programs for the automatic transformation of multiply annotated text to the integrated formats have been introduced.
In this section, techniques and software implementations for editing, inferring and unifying separately annotated texts are presented. Moreover, a technique of unifying the multiple forms will be addressed.
The editing of copies of text, each annotated separately definitely is not an easy task. One way to do this is annotating each file with the help of a standard XML editor. Since, at least in some scenarios, the text is given and must not be changed, this approach offers at least two advantages: standard XML-editing software is available and the automatic comparison of the textual content (e.g. by the option 'compare' of the transformation program xml2prolog described above) allows quality assurance, since it is highly unlikely that exactly the same change of the textual data occurred twice (or even more times) in different files. Unfortunately, this has also several drawbacks. One of this disadvantages is connected with the comparison of whitespace. Since sometimes whitespace matters, it makes no sense to collapse all whitespace. On the other hand most often this difference should be ignored. Therefore a special whitspace normalization program has been implemented.6 But if textual data should be changed the main practical problem occurs. The textual content must be changed several times. This task requires special editing software.
At the time of writing this paper two master's thesis projects are concerned with implementing special editing software for this task.
One editor is web-based (implemented in PHP) and allows for typing and changing the textual content of multiply annotated files. The two screenshots give an impression of this program. The first figure shows how text can be included. As can be seen, the markup cannot be changed in this mode.
As a second master's thesis an editor will be implemented in the Java programming language, using the Eclipse platform. The aim of this master's project is the implementation of an editor capable to associate several document grammars with one text. The insertion of elements is a two step process: first, the annotator refers to a document grammar the element should belong to and, second, (s)he can choose an element out of a list of the elements that are allowed at this point according to the schema. When saving the document for each associated schema one file will be saved. The validation will take place for each of these files.
The markup within a single document is hierarchically structured. The structure, leaving aside cross-references, can be represented as a tree. Certain relations between the nodes of these trees exist, i.e. subordination, (direct) neighborhood, etc. This relations can be used for queries for structural characteristics in one layer. Such queries can be formulated in several ways, as e.g. with [XSL Transformations], in query languages as [XQuery, 2003], or (when using the appropriate library) in Prolog (cf. [Sperberg-McQueen et al. 2002])
When regarding more than one annotated layer more relations can be found. The figure above depicts the two layers of the example annotation. This visualization shows some of these relations.
An aligned representation of both layers shows that an identical range in the primary data is marked up with different elements.
...<potential_causes><cause>No cassette is loaded.</cause>... ...<td valign="top"><ul><li>No cassette is loaded.</li>...
Start-tag identity <a>..................................</a> <b>............</b> Full inclusion <a>..................................</a> <b>.........</b> Total identity <a>..................................</a> <b>..................................</b> End-point identity <a>......................</a> <b>..................................</b> Ranges annotated by different elements overlap <a>....................</a> <b>..............................</b> The end-position of one element is shared by the start-tag of another element <a>.................</a> <b>................</b> etc.
Within our project, the Prolog fact base is used as a base for the inferences of these relations. For inferring special Prolog predicates have been implemented,7
Alternatively, the NITE XML search tools 8 could be used for the representation conforming to the NITE representation.
More general information on the relation between element classes, i.e. the set of all instances of an element, for the annotation layers is more interesting than a comparison of relation between single element instances. To do this, certain meta relations have been defined. A meta relation holds under certain conditions.
The meta relation identity between the element classes a and b holds, if for every occurrence of an element instance a the same range of text is annotated by an element instance b and vice versa.
Meta-relation identity: <a>....................</a> <b>....................</b>
The meta relation inclusion between the element classes a and b holds, if for every occurrence of an element instance a the same range of text is annotated by an element instance b and if the meta-relation identity does not hold, i.e. for all occurrences, one of the following configurations can be found.
<a>..................</a> <b>................................</b> <a>....................</a> <b>............................................</b> <a>....................</a> <b>.......................................</b> <a>....................</a> <b>....................</b>
The inferred meta-relations indicate whether theoretical constructs modeled by (certain elements of) two document grammars are in some relation to each other. So it might investigated whether certain constructs used by different linguistic theories (e.g. in traditional Japanese grammar and in 'modern' phrase structure grammars) are alphabetical variants of each other. Moreover, with these meta-relations, generalizations stated by researchers or inferred automatically on a small empirical basis can be falsified.
Unfortunately, however, the research conducted by projects of the DFG research group mentioned above showed that these meta-relations do not hold very often. The reason for this lies in the way they are defined: a meta relation between two elements holds if certain conditions hold for all occurrences these elements. It could be interesting, whether certain meta relations exist under certain conditions.
One possibility for a refinement of the meta relations is a description of specific contexts where these relations do hold. Context specifications allow for expressing such a condition.
A context specification could be expressed by a set of XPath expressions, but XPath seems to be a language which is too powerful for context specifications. Therefore, an alternative format to express the structural properties called "Context Specification Document" (CSD) has been developed. [Sasaki and Pönninghaus (2003)]
Of course, sometimes an integrated XML representation is necessary. Therefore a unification of multiply annotated documents has been developed.9 With this Prolog program two document layers can be merged. The architecture of this program is visualized in the next figure.
Prolog Implementation of the Unification The predicate (semt) receives four arguments:
In case the unification results to a layer where the elements would not be properly nested, a second result layer (a difference list) is created. The result database is re-converted to XML, again using a Python program.
If no difference list exists, the result of the merging of two layers can be linearised as an XML document straightforwardly. In case the result fact base contains a difference list, two different linearizations can be generated. The default processing uses milestone elements to mark the borders of incompatible elements. Alternatively, the technique of fragmentation of elements can be invoked.
In this paper it was argued that the problem of representing and processing multiply structured data should be subdivided into two separate problems. First, it is necessary to declare and/or apply for these data elements and attributes defined by different document grammars or belonging to different tag sets. It is desired to be able to distinguish these elements according to their origins. Furthermore it can happen that the elements of these several tag sets mark overlapping regions, which would result in structures that are difficult to handle with SGML-based markup languages. Several proposed solutions for both problems have been discussed. It was argued that the most simple solution, i.e. the annotation of these multiple structures or hierarchies in multiple files, can be a way to overcome both problems and that this approach offers many benefits. However, it is necessary to ensure that the multiple files can be represented as a single unit. For doing this, some preconditions have to be accepted by the users of this approach.
One of the reviewers of this paper pointed out that better evidence for this view can be bound in the ODA [Open Document Architecture]-Specification, which is no longer generally available because ISO discards all electronic copies of standards when they expire. This specification states under '7.1.1 General principles': The specific layout and specific logical structures of a document are hierarchical structures of objects.
Since we use the term annotation level to refer to an abstract level of analysis (such as the level of morphology in a linguistic grammar), we introduce the term annotation layer to refer to the actual realization of the annotation in e.g. XML.([Bayerl et al. 1999] p.163)
One of the reviewers of this paper noted: Isn't rarely enough! Not sure what "extremely rarely" would mean. Well, to my knowledge only one SGML-parser has been implemented which accepts an SGML declaration containing the line CONCUR YES. This is not really surprising since even the father of the SGML standard [SGML] discourages the use of this feature: I therefore recommend that CONCUR not be used to create multiple logical views of a document, such as verse-oriented and speech-oriented views of poetry.([Goldfarb (1990)], p. 304)
This program is mainly written and maintained by Daniel Naber and Oliver Schonefeld. It is available via the project Web pages (http://www.text-technology.de; 'Projekt Sekimo').
This program has been developed by Jan Frederik Maas. Also this program is available via the project Web pages.
This program is written and maintained Oliver Schonefeld. It is available via the project Web pages (http://www.text-technology.de; 'Projekt Sekimo').
This program was mainly written by Daniela Goecke. It is available via the project Web pages.
NXT Search for freely available (binaries, documentation, and source code) via http://www.ims.uni-stuttgart.de/projekte/nite/download.shtml
This program was mainly written by Daniela Goecke and is maintained by Harald Lüngen. It is called semt.pl and it is also available via the project web pages. It is also described by [Witt et al. 2004].
The different aspects of this approach are used within several project of 'Research Group: Text-technological Modeling of Information' which is funded by the German Research Foundation (DFG).
I would like to thank Harald Lüngen and Neill Kipp for their help and all the reviewers of this paper for their helpful comments.
[ACH/ACL/ALLC 1994] Association for Computers and the Humanities, Association for Computational Linguistics, and Association for Literary and Linguistic Computing. 1994. Guidelines for Electronic Text Encoding and Interchange (TEI P3). Ed. C. M. Sperberg-McQueen and Lou Burnard. Chicago, Oxford: Text Encoding Initiative, 1994.
[Barnard et al. (1995)] Barnard, David; Burnard, Lou; Gaspart, Jean-Pierre; Price, Lynne A.; Sperberg-McQueen, C. M.; Varile, Giovanni Battista. "Hierarchical Encoding of Text: Technical Problems and SGML Solutions." The Text Encoding Initiative: Background and Contents, Guest Editors Nancy Ide and Jean Vèronis = Computers and the Humanities 29/3 (1995) 211-231.
[Bayerl et al. 1999] Bayerl, Petra Saskia, Harald Lüngen, Daniela Goecke, Andreas Witt, and Daniel Naber: Methods for the semantic analysis of document markup. In: Roisin, C., E. Munson and C. Vanoirbeek (Ed.): Proceedings of the ACM Symposium on Document Engineering (DocEng 2003). pp. 161 — 170
[Bray et al. 1999] Bray, Tim, Dave Hollander and Andrew Layman (ed. 1999). Namespaces in XML. W3C Recommendation, World Wide Web Consortium.
[Carletta et al. 2003] Carletta, Jean, Jonathan Kilgour, Tim O'Donnell, Stefan Evert, and Holger Voormann, The NITE Object Model Library for Handling Structured Linguistic Annotation on Multimodal Data Sets. Proceedings of the EACL Workshop on Language Technology and the Semantic Web (3rd Workshop on NLP and XML, NLPXML-2003).
[DeRose et al. (1990)] DeRose, Steve, David Durand, Elli Mylonas, and Allen Renear. 'What is Text, Really?', Journal of computing in higher education, 1. 2. 1990.
[Durand (1999) ] Durand, David G., 1999. Palimpest: Change-Oriented Concurrency Control for the Support of Collaborative Applications, Dissertation, Boston University
[Durusau & O'Donnell (2002)] Durusau, Patrick and Matthew Brook O'Donnell Concurrent Markup for XML Documents, XML Europe 2002
[Goldfarb (1990)] Goldfarb, Charles F. (1990). The SGML handbook. Oxford: Clarendon Press.
[Haugen (2004)] Haugen, Odd Einar. Parallel Views: Multi-level Encoding of Medieval Nordic Primary Sources. In: Literary and Linguistic Computing. (19.1) pp. 73 — 91
[Huitfeldt and Sperberg-McQueen, 2001] Huitfeldt, Claus and C. M. Sperberg-McQueen. (2001) TexMECS: An experimental markup meta-language for complex documents. (http://www.hit.uib.no/claus/mlcd/papers/texmecs.html.)
[Pianta and Bentivogli (2004)] Pianta, Emanuele and Luisa Bentivogli (2004). Annotating Discontinuous Structures in XML: the Multiword Case. In: Witt et al. (2004). pp. 30 — 37
[Renear et al. 1996] Renear, Allen, Elli Mylonas, and David Durand (1996). 'Refining Our Notion of What Text Really Is: The Problem of Overlapping Hierarchies' In: International Association for Literary and Linguistic Computing: Selected papers from the ALLC, ACH Conference: Christ Church, Oxford, April 1992. Oxford: Clarendon Press 1996.
[Sasaki and Pönninghaus (2003)] Sasaki, Felix and Jens Pönninghaus. Testing structural properties in textual data: beyond document grammars. In: Literary and Linguistic Computing. (18.1)
[Sasaki et al. (2003)] Sasaki, Felix, Andreas Witt, and Dieter Metzing. Declarations of Relations, Differences and Transformations between Theory-specific Treebanks: A New Methodology In: Nivre, Joakim (Ed.): Proceedings of the The Second Workshop on Treebanks and Linguistic Theories, Växjö, pp. 141 — 152
[SGML] ISO 8879:1986. Information processing — Text and office systems — Standard Generalized Markup Language (SGML).
[Sperberg-McQueen and Huitfeld 1999] Sperberg-McQueen, C. M. and Huitfeldt, Claus (1999). Concurrent Document Hierarchies in MECS and SGML. In: Literary and Linguistic Computing (14.1). pp. 29 — 42
[Sperberg-McQueen et al. 2001] Sperberg-McQueen, C. M., Claus Huitfeldt, and Allen Renear. 2001. “Meaning and interpretation of markup.” Markup Languages: Theory & Practice 2.3 (2001): 215 — 234. http://www.w3.org/People/cmsmcq/2000/mim.html
[Sperberg-McQueen et al. 2002] Sperberg-McQueen, C. M., David Dubin, Claus Huitfeldt, and Allan Renear. Drawing inferences on the basis of markup. In Proceedings of Extreme Markup Languages 2002 (Montreal, Canada, August 2002), B. T. Usdin and S. R. Newcomb, eds.
[Tennison and Piez 2002] Tennison, Jeni and Wendell Piez (2002). The Layered Markup and Annotation Language. Extreme Markup 2002.
[Thompson and McKelvie (1997)] Thompson, Henry S. and David McKelvie. Hyperlink semantics for standoff markup of read-only documents. In: Proceedings of SGML Europe '97: The Next Decade — Pushing the Envelope (Barcelona, Spain, May 1997).
[Witt 2002] Witt, Andreas, Meaning and interpretation of concurrent markup In: ALLCACH2002, Joint Conference of the ALLC and ACH, Tübingen, 2002.
[Witt et al. (Ed.)] Witt, Andreas, Ulrich Heid, Henry S.Thompson, Jean Carletta, and Peter Wittenburg. Proceedings of the LREC — Satellite Workshop on XML-based Richly Annotated Corpora. Lisbon 2004.
[Witt et al. 2004] Witt, Andreas, Harald Lüngen, Felix Sasaki, Daniela Goecke. Unification of XML Documents with Concurrent Markup. In: ALLCACH2004, Joint Conference of the ALLC and ACH, Göteborg, 2004.
[XQuery, 2003] Boag, Scott, Don Chamberlin, Mary F. Fernàndez, Daniela Florescu, Jonathan Robie, Jerome Simèon (Ed.), XQuery 1.0: An XML Query Language, W3C Working Draft 12 November 2003.
[XSL Transformations] Clark, James (Ed.), XSL Transformations (XSLT) Version 1.0, W3C Recommendation 16 November 1999