Secondary Information Structuring - A Methodology for the Vertical Interrelation of Information Resources

Felix Sasaki
felix.sasaki@uni-bielefeld.de

Abstract

Secondary Information Structuring is the key part of an architecture, which is used for the knowledge-based, vertical interrelation of information resources: Primary Information Structuring (document grammars, marked-up instance documents) and abstract, conceptual resources (conceptual models, ontologies). Secondary Information Structuring encompasses a small set of predicates to select and interrelate declarations in document grammars and information items in - possibly multiple - instance documents. The selections are integrated in a conceptual hierarchy, which then can be used to conceptually validate and query document grammars and instance documents, to transform instance documents, and to relate the selected information from Primary Information Structuring to other conceptual resources. An integrated implementation of the approach, using information from document grammars and multiple instance documents, is currently under development. Two applications from the field of linguistics and multilingual documentation of document grammars have been developed so far.

Keywords: RDF; Semantics; Semantic Web

Felix Sasaki

Felix Sasaki is working in a research project which is concerned with markup languages and their application to the modeling of (linguistic) textual data. His main interest lies in the relation between different models, i.e. models for markup and for knowledge representation formats, e.g. formats developed in in the Semantic Web initiative.

Secondary Information Structuring - A Methodology for the Vertical Interrelation of Information Resources

Felix Sasaki [University of Bielefeld]

Extreme Markup Languages 2004® (Montréal, Québec)

Copyright © 2004 Felix Sasaki. Reproduced with permission.

Introduction

This paper1 describes an architecture, its operationalization and an ongoing implementation for the vertical interrelation of information resources. Two kinds of information resources are used (see Figure 1): (1) Primary Information Structuring (in the lower part of the Figure) encompasses document grammars like the TEI [Text Encoding Initiative] ([Sperberg-McQueen et al. 1994]) or DocBook ([Walsh and Muellner 1999]) and the respective XML [eXtensible Markup Language] ([Bray et al. 2004]) instance documents. (2) Resources of an abstract, conceptual level (in the upper part of the Figure) are for example language-specific, lexical resources like WordNet ([Fellbaum 1998]) or general, ontological models like SUMO [Standard Upper Merged Ontology] ([Niles and Pease 2001]). Most of the conceptual resources can be represented with RDF [Resource Description Framework] ([Klyne and Carroll 2004], [Manola and Miller 2004]), RDF Schema ([Brickley et al. 2004]) or related formats.

Secondary Information Structuring2 (see the center of Figure 1) is a methodology to interrelate parts of Primary Information Structuring to each other, e.g. parts of the TEI and the DocBook document grammars (and the respective instance documents), and - optionally - Primary Information Structuring to the conceptual level, e.g. the TEI and DocBook to WordNet.

Figure 1
[Link to open this graphic in a separate page]

Outline of the architecture: Primary Information Structuring, Secondary Information Structuring and conceptual level.

Secondary Information Structuring allows the user to interrelate the information resources via a small set of predefined predicates within logical statements. Prototypical statements are shown in Figure 2.

Figure 2

sekStruk:Paragraph sekStruk2primStruk "element p { attribute type {'gloss'} ... }".
sekStruk:Paragraph-first-child-in-div sekStruk2primStruk "pathEx:'p' isFirst up 'div'".
sekStruk:Paragraph-first-child-in-div subConceptOf sekStruk:Paragraph.
sekStruk:Paragraph-tei equal sekStruk:Paragraph-docbook.
sekStruk:Paragraph-tei sekStruk2conLevel wordnet:Paragraph.
sekStruk:Paragraph-docbook sekStruk2conLevel wordnet:Paragraph.

Examples of statements within Secondary Information Structuring

The statements mainly fulfill five tasks:

  • via the predicate sekStruk2primStruk selecting properties of constructs from document grammars for a concept3 in Secondary Information Structuring. The example selects the declaration4 of an p element with an type attribute which has the value gloss, for a concept sekStruk:Paragraph. The statement is (sekStruk:Paragraph sekStruk2primStruk "element p { attribute type {'gloss'} ... }".).
  • selecting information items ([Cowan and Tobin 2004]) from instance documents via their structural properties. Such statements again make use of the predicate sekStruk2primStruk, which then takes as one of its arguments a path expression, which is introduced by the prefix pathEx:5. An example is the statement (sekStruk:Paragraph-first-child-in-div sekStruk2primStruk "pathEx:'p' isFirst up 'div'".). It selects a p element node which is the first child of a div element.
  • via the predicate subConceptOf creating a conceptual hierarchy of the concepts, by defining general and specific concepts. An example is the statement (sekStruk:Paragraph-first-child-in-div subConceptOf sekStruk:Paragraph.). Via this statement, the concept sekStruk:Paragraph-first-child-in-div which selects first-child p elements with a parent div element becomes subordinate to the concept sekStruk:Paragraph which selects p elements in general.
  • via the predicate equal (optionally) relating concepts of different document grammars to each other. For example the statement (sekStruk:Paragraph-tei equal sekStruk:Paragraph-docbook.) relates the concept sekStruk:Paragraph-tei which selects the TEI-specific p element to the concept sekStruk:Paragraph-docbook which selects the para element from DocBook. The selections of p and para have to be expressed within separate statements, i.e. with the predicate sekStruk2primStruk.
  • (optionally) via the predicate sekStruk2conLevel relating the concepts from Secondary Information Structuring to concepts or interconceptual relations6 from the conceptual level. For example the statement (sekStruk:Paragraph-docbook sekStruk2conLevel wordnet:Paragraph.) relates declarations from the TEI and DocBook document grammars to the WordNet-specific concept wordnet:Paragraph.

On the basis of these statements, the architecture described in this paper allows for various automated operations between the resources interrelated. The operations are also visualized in Figure 1. It can be conceptually validated if the document grammars own the properties which are stated in Secondary Information Structuring, e.g. whether there is a declaration of a type attribute with the value gloss which is attached to the declaration of a p element. This is what is stated for the concept sekStruk:Paragraph in Figure 2. In instance documents also a validation is possible. For example it can be stated in Secondary Information Structuring that all p elements must be instances of the concept sekStruk:Paragraph-tei. If a p element is not an instance of this concept, an error occurs. In addition, instances of the concepts can be queried from instance documents. Finally, if concepts which select constructs from one document grammar have been related via the predicate equal to concepts which select constructs from another document grammar, annotations in the respective instance documents can be transformed. A document which is based upon the TEI document grammar can be transformed into a document which is based upon DocBook.

What is the purpose for this architecture and the operations on information resources? From a theoretical perspective, there are two research fields which can benefit from the approach of Secondary Information Structuring: markup semantics and semantic markup, a terminology which is taken from [Renear et al. 2002]. Markup semantics is concerned with the formal description of the meaning of document grammars and instance documents. The BECHAMEL project, initiated by [Sperberg-McQueen et al. 2000], is a prominent approach in this research area. Semantic markup is the addition of semantic information to markup, making use of abstract, conceptual resources which are represented in RDF or related formats. Such approaches are for example applied within the Semantic Web initiative. Approaches to markup semantics often work bottom-up: Markup is given, and a (formal) description of its meaning is added. Approaches to semantic markup work top-down: From a given conceptual level, markup is generated, or conceptual information is integrated into documents. Secondary Information Structuring combines the approaches of markup semantics and semantic markup. It can be used to mediate between the conceptual level and the markup, i.e. the Primary Information Structuring. The mediation works bottom-up and top-down: The architecture proposed in this paper allows for the bi-directional interrelation of the conceptual level and Primary Information Structuring in a declarative way. Another important aspect of Secondary Information Structuring is that it strongly emphasizes the role of document grammars. Especially most approaches to semantic markup concentrate only on instance documents. The main benefit of Secondary Information Structuring is that document grammars, marked-up documents and the conceptual level can be interrelated, without a need to change them. As proposed in the architecture described by [Melnik and Decker 2000], this ensures the interoperability between information resources: Existing conceptual resources can be combined with existing document grammars and instance documents.

From a practical point of view, Secondary Information Structuring can contribute to three tasks. First, the formal description of relations between large, general document grammars like TEI and DocBook is an important issue for the flexible integration of their vocabularies, see e.g. [Rahtz et al. 2004]. Second, describing the meaning of Primary Information Structuring separately from document grammars and instance documents reduces the need to create new document grammars for unforeseen structural constraints. If for example during the authoring process a certain kind of p element which is the first child of a div element becomes necessary, this constraint can be declared in and validated through Secondary Information Structuring, without changing the document grammar. For the discussion of such needs, see for example [Ramalho et al. 1999]. And third, from the perspective of the Semantic Web, the automatic annotation of semantic markup can be realized by Secondary Information Structuring. If the relation between a p element from TEI and the concept Paragraph from WordNet has been declared in Secondary Information Structuring with the respective statement, the annotation of p elements nodes as instances of the WordNet Paragraph can be automatically integrated in instance documents.

The remainder of this paper is organized as follows. Section “Input to Secondary Information Structuring from Primary Information Structuring” describes the input from Primary Information Structuring into Secondary Information Structuring, i.e. what is selected in concepts: input from document grammars, from single instance documents and from multiple instance documents7. During this discussion, the application of the predicate sekStruk2primStruk which is responsible for the selection is explained in detail. Section “Operationalization of Secondary Information Structuring” describes the operationalization of Secondary Information Structuring, its formal prerequisites and operations in detail. Section “Summary of predefined constructs in Secondary Information Structuring and the syntax of their representation” summarizes the constructs for Secondary Information Structuring and introduces an RDF-based and an XML-based syntax for the representation of the statements. The ongoing implementation and application scenarios are discussed in section “Implementation and applications”, and related approaches are introduced in section “Related approaches”.

As an example from Primary Information Structuring, mainly the document grammars and instance documents in Figure 3 and Figure 4 will be used. They are fragments from XHTML 1.0 and the TEI P4 document grammar. Both fragments define lists. In XHTML, the list is specified as a definition list via the element name html:DL. In the TEI example, the specification as a definition list is realized in the instance document via a tei:list element with a type="gloss" attribute. The identity of meaning of these markup constructs can be specified via Secondary Information Structuring. In addition, they can be related to the concept from the WordNet lexical database which denotes a definition.

Figure 3
namespace html="http://example.com/defList-html"
start =
  element html:DL {
    element html:DT { text },
    element html:DD { text }
  }

namespace tei="http://example.com/defList-tei"
start =
  element tei:list {
    attribute type { xsd:Name },
    element tei:head { text },
    element tei:item { text }
  }

Example document grammars for Primary Information Structuring

Figure 4
<html:DL xmlns:html="http://example.com/defList-html">
<html:DT>V-TE</html:DT>
<html:DD>Label for the annotation of Japanese verbs in
the assimilation form</html:DD>
</html:DL>

<tei:list xmlns:tei="http://example.com/defList-tei"
       type="gloss">
<tei:head>V-TE</tei:head>
<tei:item>Label for the annotation of Japanese verbs in
the assimilation form</tei:item>
</tei:list>

Example instance documents for Primary Information Structuring

Input to Secondary Information Structuring from Primary Information Structuring

It is necessary to separate the selection mechanisms for document grammars, single instance documents and multiple instance documents. If only constructs of document grammars are selected, the selection holds for a class of documents. If information items from a single instance document are selected, the selection might fail for other instance documents. And if the selection relies on multiple instance documents, a path expression like 'p' isFirst up 'div' (see Figure 2) cannot be applied, since there is no tree structure to execute the path expressions between multiple instance documents. This leads to the notion of two kinds of Secondary Information Structuring: document grammar based Secondary Information Structuring is concerned with the selection of document grammar constructs, instance based Secondary Information Structuring is concerned with the selection of configurations of information items, found in single instance documents.

The question then arises which schema language should be used for the declaration of document grammar constructs. The approach of Secondary Information Structuring uses structural properties of markup as the central selection criterion for Primary Information Structuring. Hence, a powerful language to express structural properties like RELAX NG ([Clark and Murata 2001]), without for example any constraints on the ambiguity of content models, is the most feasible one.

Input from document grammars

The name of a concept in document grammar based Secondary Information Structuring is the name of a pattern in RELAX NG. Statements with the predicate sekStruk2primStruk select the declarations of elements, attributes etc. which are contained in the pattern. For example, the statement (sekStruk:Paragraph sekStruk2primStruk "element p { attribute type {'gloss'} ... }".) in Figure 2 denotes a concept and a pattern respectively, called sekStruk:Paragraph, which contains the declaration for the p element. In a RELAX NG document grammar, such a pattern can be written like

start = sekStruk:Paragraph
sekStruk:Paragraph = element p { attribute type {'gloss'} ...}
The concepts and patterns respectively are only created in Secondary Information Structuring, i.e. the document grammars are not changed. This allows for a specialization of declarations from the document grammar. If for example the datatype for the type attribute is { text } in the document grammar, it can be specialized to {'gloss'}8. This is what happens with the statement above.

Different to RELAX NG, in Secondary Information Structuring, also the relations between concepts and patterns respectively which select constructs from document grammars are expressed. This is done via a predicate componentOf. Imagine that for each element declaration for HTML in Figure 3, a concept is created, i.e. sekStruk:definitionList-html which selects the html:DL element, a concept sekStruk:dt-html which selects the declaration of html:DT and a concept sekStruk:def-html which selects the declaration of htdml:DD. Then the concepts sekStruk:dt-html and sekStruk:def-html would be a component of the concept sekStruk:definitionList-html. The statement to describe these relations are


sekStruk:dt-html componentOf sekStruk:definitionList-html.
sekStruk:def-html componentOf sekStruk:definitionList-html.

With the predicate componentOf it is possible to select declarations in document grammars in a context-sensitive way. For example the property of the html:DT element being inside of the content model of the html:DL element can be expressed via the statement above. If there is another declaration for html:DT in the document grammar which is not in the content model of html:DL, it would not be selected by the concept sekStruk:dt-html.

The componentOf predicate is motivated by and formally grounded on the XDD [XML declarative descriptions] ([Wuwongse et al. 2003]) framework. XDD represents document grammars (and instance documents) as sets of expressions. The expressions are derived from other expressions stepwise. The application of the derivation steps is exemplified in Figure 5.

Figure 5
A   start =                      A' start = 
     element html:DL {               element html:DL {
     element html:DT { text },       $sekStruk:dt-html,
     element html:DD { text }}       $sekStruk:def-html
A'' start = $sekStruk:definitionList-html
B  start =                             B' start=
    element tei:list {                  element tei:list
    attribute type { xsd:Name },        { attribute type
    element tei:head { text },          { "gloss" },
    element tei:item { text }}          $sekStruk:dt-tei, $sekStruk:def-tei }
B'' start = $sekStruk:definitionList-tei

Example of derivation steps in the XDD framework

In the first derivation step A, the whole document grammar for the definition list in XHTML is given. In the next step A', the element declarations for html:DT and html:DD are hidden in expressions, i.e. $sekStruk:dt-html and $sekStruk:def-html. The statements in Secondary Information Structuring which fulfil the same role as this derivation step have been described above, e.g. (sekStruk:dt-html sekStruk2primStruk "element html:DT { text }"). In step A'', the expression $sekStruk:definitionList-html is derived. It contains the expressions which have been defined before. In Secondary Information Structuring, the relation between the derivation steps A'' and A' is expressed via the componentOf predicate. Similar derivations are created for the document grammar which defines a definition list in accordance with the TEI.

Input from single instance documents: path expressions and logical statements

The expressive power of document grammars is not sufficient to describe many selections for concepts in Secondary Information Structuring. Consider a selection criterion that a DT element should be the second descendant node of a p element which is a first child element. Such complex selections are hard to formulate within document grammars and therefore rely on the selection of information items in instance documents. For this purpose, a path language to select the information items is necessary. The approach of Secondary Information Structuring uses a path language called caterpillar expressions. These are defined by [Brüggemann-Klein, A. and D. Wood 2000] and encompass a set of movements and tests in the tree-alike structure of instance documents: up, left, right, first, last, isFirst, isLast, isLeaf, isRoot, and a test for the name of the current node. A sample caterpillar expression in the definition list in an XHTML instance document, using the DT element as the initial node of the expression, is visualized in Figure 6.

Figure 6
[Link to open this graphic in a separate page]

Example of a caterpillar expression

There are two reasons for choosing this path language9. First, its expressive power can be described in terms of regular tree grammars, which are also the formal basis of RELAX NG (see [Murata et al. 2001]). This relation between the path expression and the schema language eases the task of creating instance based Secondary Information Structuring. With a more expressive language like XPath ([Clark and deRose 1999]), it would be more difficult to avoid path expressions which cannot hold for a given document grammar. And second, the test for the current node does not have to be a test for the element name. It can also be tested whether the node fulfills the selection criteria which are defined for another concept, which is declared in Secondary Information Structuring. Figure 7 gives an example on how these selection criteria are inferred.

Figure 7
[Link to open this graphic in a separate page]

Concepts, the conceptual hierarchy and interconceputal relations within path expression.

The concept sekStruk:Paragraph-tei selects the declaration for the p element, via the predicate sekStruk2primStruk. Via the predicate subConceptOf, the concept sekStruk:Paragraph-in-div is subordinate to sekStruk:Paragraph-tei. That is, the initial node p for the caterpillar expression in sekStruk:Paragraph-in-div is inherited from sekStruk:Paragraph-tei. In addition, the caterpillar expression contains a tests which refers to the concept sekStruk:div-tei. Since this concept selects the div element, it is tested whether the name of the current node is div.

Of course, the caterpillar expression up sekStruk:div-tei could also be written explicitly, e.g. 'p' up 'div', but this would have a different logical status. With the statements which are visualized in Figure 7, the inferential power of logical statements in Secondary Information Structuring is exploited. The inference of the initial node p of the caterpillar expression relies on the conceptual hierarchy, and the inference of the test for the current node div relies on interconceptual relations. That is, path expressions and logical statements can be combined. This allows the user to create selections of document grammar constructs for superordinate concepts and to operate only with path expressions and logical statements for subordinate concepts, without directly referring to names of elements, attributes, attribute-values etc. In this way it is possible to test the same structural properties for different document grammars, without changing the path expressions. Only the superordinate selections of document grammar constructs have to be modified, e.g. from p (TEI) to para (DocBook) and from div (TEI) to section (DocBook)10. However, a restriction for these inferences is that they must not lead to cyclic structures, e.g. a statement like (sekStruk:Concept-X sekStruk2primStruk "p up sekStruk:Concept-x".) is not allowed.

Input from multiple instance documents: multiple annotations of the same primary data

Why would it be necessary to select information items from different instance documents, and not a single one? The reason is that there are cases, described for example by [Caton 2002], in which the creation of a single hierarchy in a single instance document is not adequate. An example is given in Figure 8.

Figure 8
<layout:paragraph
xmlns:layout="http://example.com/defList-layout">
<layout:line>V-TE</layout:line>
<layout:line>Label for the annotation of Japanese verbs in</layout:line>
<layout:line>the assimilation form</layout:line>
</layout:paragraph>

<syntax:corpus
xmlns:syntax="http://example.com/defList-syntax">
<syntax:np>V-TE</syntax:np>[...]
<syntax:np>Label</syntax:np>
[...]<syntax:pp>in the assimilation form</syntax:pp>[...]
</syntax:corpus>

Annotation of the same textual data, with respect to layout and syntax.

Here, an annotation of the same textual data is made for layout and syntax (in the linguistic sense). This leads to structures which cannot be represented within a single hierarchy. The prepositional phrase (syntax:pp) overlaps with the segmentation of the lines (layout:line); the overlapping tags are visualized in bold face. One might argue that such overlaps can be annotated for example via the milestone-mechanism which has been proposed in the TEI-Guidelines. But this would make the hierarchical structure of one of the annotation implicit, that is hard to analyze, query or validate structurally.

To be able to deal with such cases, the approach of multiple annotations of the same primary data which has been developed by [Witt 2004] is applied. All instance documents share the same primary, textual data. In this way, the - normally implicit - absolute order of characters servers as a link between the separate annotation layers, i.e. XML instance documents. The absolute order of characters is made explicit in Figure 9.

Figure 9
L  A  B  E  L     F  O  R     T  H  E     A  N  N  O  T  A  T  I  O  N
1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Example of the enumeration of characters

A Prolog-based implementation of the approach allows for the analysis of relations between several annotation layers like identity, start point identity etc., which are visualized in Figure 10. For example, the relation between the syntax:pp element and the layout:line element described above can be described as an overlap.

Figure 10
[Link to open this graphic in a separate page]

Relations between several annotation layers

In Secondary Information Structuring, for each relation a respective predicate is defined. Similar to the prefix pathEx: for caterpillar expressions, these predicates are introduced by a reserved prefix, i.e. layerRel:. The relation is followed by the name of an element, e.g. layerRel:overlap syntax:pp, or of another concept. In Secondary Information Structuring, statements which encompass selections from document grammars, path expressions and such relations between multiple instance documents can be created. An example is visualized in Figure 11.

Figure 11
[Link to open this graphic in a separate page]

An example of Secondary Information Structuring which uses selections from document grammars, via path expressions and relations between several annotation layers

The path expression which is used as the selection criterion in the concept sekStruk:First-Paragraph-in-div ensures that only p elements are selected which are the first child of a div element. Only this subset of p elements is used for the selection in the concept sekStruk:Paragraph-as-Topic. In this concept the relation identity of p elements to the concept sekStruk:Topic is used as a selection criterion. Only the subset of the p elements is selected which have an identity relation to the topic element. This example shows how Secondary Information Structuring allows for a declarative description of relations between annotations of document structure and annotations of thematic structure. Such relations have been investigated empirically with the methodology of multiple annotations of the same primary data by [Bayerl et al. 2003].

Operationalization of Secondary Information Structuring

Formal characterization of Secondary Information Structuring: A terminological ontology

In section “Input to Secondary Information Structuring from Primary Information Structuring”, the information from document grammars and instance documents which servers as an input to Secondary Information Structuring has been described. The question now is how the statements which make use of this information can be operationalized. To answer this question it is necessary to discuss the formal properties of Secondary Information Structuring in more detail.

[Fischer 1998] serves as a basis for this discussion. With reference to [Sowa 1996], he defines a so-called terminological ontology. It consists of unary predicates, i.e. the concepts, and logical statements which describe the relations between concepts. Most of the predicates which are used for the creation of the statements have been described already. An important characteristic of a terminological ontology is the notion of models. Concepts belong to exactly one model. For example a concept sekStruk:Paragraph-tei belongs to a model sekStruk:Tei, and a concept sekStruk:Paragraph-docbook to a model DocBook. To express models in Secondary Information Structuring, a predefined concept sekStruk:models is used. Every concept which is directly subordinate to sekStruk:models denotes a model. For example to create a model for concepts which select constructs from the TEI document grammar, the statement (sekStruk:Tei subConceptOf sekStruk:models.) can be made.

For the operationalization of Secondary Information Structuring, the intensional and extensional interpretation of a terminological ontology is a central aspect. Intensionally speaking, the conceptual hierarchy has to be in accordance with given constraints. For example, a concept must not be subordinate to itself, i.e. a statement like (sekStruk:concept-x subConceptOf sekStruk:concept-x.) is not allowed. Extensionally speaking, subordinate concepts inherit the properties, i.e. the predicates of superordinate concepts, and their instances. In the case of Secondary Information Structuring, the instances are to be found in document grammars and instance documents.

The differentiation between intension and extension also contributes to the solution of an ontological questions in the field of markup semantics, which has been raised by [Sperberg-McQueen et al. 2002]. The question is how it can be described that a markup construct is polysemous, i.e. has several meanings. For example a para element often is used to mark up paragraphs, but it can also be used to mark up thematic units (cf. section “Input from multiple instance documents: multiple annotations of the same primary data”). With the notion of intension and extension, this question can be answered. There are separate intensional descriptions of the para element, one for the concept sekStruk:Paragraph, and one for the concept sekStruk:ThematicUnit concept. sekStruk:Paragraph belongs to a model sekStruk:Tei-structural, and sekStruk:ThematicUnit belongs to another model sekStruk:Tei-thematic-units. In contrast to Secondary Information Structuring, object-oriented approaches like the type-hierarchy in XML Schema do not allow for such ambiguities. Every information item needs to have a single, non-ambiguous type.

Conceptual queries and validation of Primary Information Structuring

The main operations for Secondary Information Structuring, those representation syntax is currently under development, are conceptual queries and validations of Primary Information Structuring. Instances of concepts from Secondary Information Structuring are to be retrieved, and it has to be validated if the resources in Primary Information Structuring own the properties which are declared in the statements. The query procedure is rather simple: A subset of information resources in Primary Information Structuring can be interpreted as the extension of a concept if it is selected by the concept, via a statement with the predidcate sekStruk2primStruk. This query approach is based on the Closed World Assumption, as it has been developed in the field of Artificial Intelligence. For example the concept sekStruk:Paragraph from Figure 2 selects the declaration of a p element in a document grammar as an instance, and the respective p element nodes in XML instance documents. The creation of URI-References to the declaration of p in the document grammar and to the set of p element information items is the output of the query procedure.

The validation procedure is more complex than the query procedure. It relies on the notion of an abstract concept versus a non-abstract, concrete concept, which are visualized in Figure 12.

Figure 12
[Link to open this graphic in a separate page]

Conceputal validation of document grammars and instance documents via the notion of abstract and non-abstract concepts.

The Figure is a slightly modified version of Figure 7. The concept sekStruk:Paragraph-tei is declared as being abstract, via the statement (sekStruk:Paragraph-tei abstract true.) Due to this statement, sekStruk:Paragraph-tei is not allowed to have direct instances: There has to be a subordinate concept with instances, i.e. sekStruk:Paragraph-in-div. In this way it is ensured that all p element nodes in instance documents are direct child elements of div element nodes. The TEI document grammar allows the user to create p elements which have other parent elements than div. These p elements would be in accordance with the general, TEI document grammar, but would not be valid with respect to the concept sekStruk:Paragraph-tei.

The execution of the validation procedure can be described as parsing as instructions for a proof. A notion of parsing as an instruction for a proof has been described by [Sperberg-McQueen 2003]. He represents XML Schema in an abstract, Prolog-based format. In contrast to his aim, the scope of Secondary Information Structuring is not a document grammar as a whole. Whether the whole document grammar, only a part of it or only a subset of information items in instance documents are feasible for conceptual validation and query respectively, depends on the statements the user has created. In addition, one has to keep in mind the discussion from section “Formal characterization of Secondary Information Structuring: A terminological ontology” about intensional descriptions of concepts. A document grammar construct or a configuration of information items can be part of several proofs, for example proofs considering concepts for paragraphs versus proofs considering concepts for thematic units.

Interrelating document grammars

Relations between document grammars are described via the equal prediacte. A sample application of the predicate, interrelating declarations for sections from the DocBook and the TEI document grammar, is visualized in Figure 13.

Figure 13
[Link to open this graphic in a separate page]

Interrelating document grammars.

For each document grammar, a separate model is declared. For DocBook, this is done via the statement (Docbook subConceptOf sekStruk:models.). In the DocBook document grammar, the nesting level of sections can be made explicit via element names, e.g. sect1 or sect2. For each nesting level, a respective concept ist declared, which selects a document grammar construct. A statement for the sect1 element is (DB-sec1 sekStruk2primStruk "element sect1 { ... }".). The TEI also has elements which indicate the nesting level, but the user might choose the general div element instead, which allows for infinite nesting. To be able to interrelate div from the TEI to the various DocBook elements for sections, in the TEI-specific concepts the nesting level is selected via caterpillar expressions. For example, the concept TEI-sec1 selects only the div elements which are direct children of a body element. The statement is (TEI-sec1 sekStruk2primStruk "up 'body' ").

Statements with the equal predicate then make the relations between DocBook and the TEI explicit. For example, the concepts DB-sec1 and TEI-sec1 denote certain element nodes with the same nesting level. The respective statement is (TEI-sec1 equal DB-sec1.). An important aspect of this application of the equal predicate is the role of inheritance. Consider the statement (TEI-sections equal DB-sections.). The concept DB-sections inherits the selections of element declarations from superordinate concepts (e.g. DB-sec1 or DB-sec2), so the statement with the equal predicate holds for these element declarations (sect1, sect2 etc.) as well.

The query and validation operations described in section “Conceptual queries and validation of Primary Information Structuring” are also possible for interrelated document grammars. For example if a query is made for instances of the TEI-sec1 concept, also instances of the concept DB-sec1 can be retrieved. Hence, the result of the query procedure encompasses div elements which are direct children of body elements and sect1 elements. In addition, a transformation from instances of one document grammar to instances of another document grammar is possible. For example, div elements which are direct children of body elements can be transformed into sect1 elements.11.

Summary of predefined constructs in Secondary Information Structuring and the syntax of their representation

RDF-based syntax

The following list summarizes the predicates and other predefined constructs which are used for Secondary Information Structuring. They are represented as an RDF Schema12, see http://coli.lili.uni-bielefeld.de/~felix/phd/sekStruk/schemas/sekStruk.rdfs. The RDF Schema is visualized in Figure 14. The RDF representation of the example can be found at http://coli.lili.uni-bielefeld.de/~felix/phd/sekStruk/things/instance.rdf.

Figure 14
[Link to open this graphic in a separate page]

Visualization of the RDF Schema for the creation of Secondary Information Structuring

  • sekStruk: the namespace prefix for concepts in Secondary Information Structuring, for example sekStruk:Tei. Currently, just a dummy-namespace is being used: http://example.com/sekStruk. It also denotes the set of all things within Secondary Information Structuring.
  • subConceptOf: predicate for the creation of the conceptual hierarchy, for example (sekStruk:dt-tei subConceptOf sekStruk:Tei.). All concepts which denote a model are directly subordinate to the predefined concept sekStruk:models, e.g. (sekStruk:Tei subConceptOf sekStruk:models.).
  • componentOf: predicate which is used in document grammar based Secondary Information Structuring. It describes the property of a document grammar construct and the concept which selects the construct respectively, of being a component of another construct and another concept respectively. An example is (sekStruk:dt-html componentOf sekStruk:definitionList-html.).
  • equal: predicate for the description of relations between concepts in Secondary Information Structuring. This predicate is used only for concepts which belong to separate models.
  • sekStruk2PrimStruk: predicate for the selection of input from Primary Information Structuring, for example (sekStruk:def-tei sekStruk2primStruk "element tei:item { ... }".).
  • pathEx: prefix to refer to path expressions in Primary Information Structuring, for example (sekStruk:dt-html-in-first-para sekStruk2primStruk "pathEx up up 'p' isFirst".).
  • layerRel-1-n: prefix to refer to relations between several annotation layers, for example (sekStruk:def-html sekStrukprimStruk "layerRel:start_point_identity layout:line".).
  • sekStruk2conLevel: predicate for the mapping between Secondary Information Structuring and the conceptual level, for example (sekStruk:definitionLIst-tei sekStruk2conLevel "http://www.cogsci.princeton.edu/~wn/concept#103701336".) describes the mapping to the definition concept from the WordNet database. Here, the RDF-representation of WordNet developed by Sergey Melnik is being used, see http://www.semanticweb.org/library/.

Annotating RELAX NG with Secondary Information Structuring

An XML-serialization of the statements described above has been developed as well. There are two needs for this representation. First, the implementation of the operations described above becomes easier (see Section “Implementation and applications”), because the data model of RDF has no fixed XML-serialization. And second, the XML-serialization is realized as a predefined set of annotations for RELAX NG document grammars. That is, the document grammar for the XML-serialization (see http://coli.lili.uni-bielefeld.de/~felix/phd/sekStruk/schemas/sekStruk.rng) is a superset of the document grammar for RELAX NG. In this way, every existing document grammar in the format of RELAX NG can be the starting point for the creation of Secondary Information Structuring. The already defined patterns in the document grammar can be interpreted as concepts.

Figure 15 contains the XML-representation for the concepts list-tei-general and definitionList-tei. For each concept, a define element is created. Subordination of concepts is represented by the sekStruk:subConceptOf attribute. The selection of Primary Information Structuring has a unique identifier, represented by the value of a sekStruk:sekStruk2primStruk attribute. The mapping to the conceptual level is represented by the value of the sekStruk:sekStruk2conLevel attribute.

Figure 15
<sekStruk:models
 xmlns:tei="http://example.com/defList-tei"
 xmlns="http://relaxng.org/ns/structure/1.0"
 xmlns:sekStruk="http://example.com/sekStruk">
 <sekStruk:model name="sekStruk-tei">
  <define name="list-tei-general"
    sekStruk:sekStruk2primStruk="mapping2">
   <element name="tei:list">...</element>
  </define>
  <define name="definitionList-tei"
    sekStruk:sekStruk2primStruk="mapping4"
    sekStruk:subConceptOf="list-tei-general"
    sekStruk:sekStruk2conLevel=
"http://www.cogsci.princeton.edu/~wn/concept#103701336">
   <attribute name="type">
    <value>gloss</value>
   </attribute>
  </define>
 </sekStruk:model>
</sekStruk:models>

XML-serialization of Secondary Information Structuring

Implementation and applications

Two independent implementations for the selection of information items in instance documents have been created. The implementation of caterpillar expressions described by [Sasaki and Pönninghaus 2003] is used to select information items within a single instance document. The respective, XML-based format CSD [Context Specification Document] also allows for the creation of a conceptual hierarchy. Element nodes in instance documents are interpreted as extensions to concepts. References to the nodes are extracted and assigned to concepts in the conceptual hierarchy, based upon the matching of the respective caterpillar expressions. The conceptual hierarchy in a CSD also implements the notion of abstract and concrete concepts, which is crucial for conceptual validation of information items (cf. section “Conceptual queries and validation of Primary Information Structuring”). A CSD-processor has been implemented in the Python programming language. The Prolog-based implementation described by [Witt 2004] is used to analyze relations between multiple annotations in several instance document. Currently, the document grammar based Secondary Information Structuring is being implemented. The implementation will create URI-references to nodes in the instance documents which are in accordance with the relevant document grammar constructs.

The software described above will be used within a processing pipe. First, all document grammar constructs defined by logical statements in the Secondary Information Structuring are analyzed. As a result, URI-references to the instances of the constructs are created. These encompass URI-references to declarations in document grammars for document grammar based Secondary Information Structuring, and URI-references to information items in instance documents. The latter are used in the subsequent steps of instance based Secondary Information Structuring. Again, information about the processing results is created as URI-references. The processing pipe ends if all concepts in Secondary Information Structuring have been processed. The creation of URI-references can be compared with a partial PSVI [Post Validation Infoset], i.e. the output of the validation of an instance document by a document grammar in the format of XML Schema. In contrast to the PSVI, the information is separated from the instance document, i.e. the original information set remains unchanged.

Two applications have been developed so far. In the field of linguistics, relations between theory-specific (in the linguistic sense) syntactic annotations of sentences, so-called treebanks, are described in a declarative manner, cf. [Sasaki et al. 2003]. This application makes use of relations between multiple annotations of the same textual data. In [Sasaki 2004], the multilingual documentation of document grammars via the WordNet database and its counterparts in other languages is described. This application makes use of caterpillar expressions within a single instance document. For further applications which combine all kinds of document grammar and instance based Secondary Information Structuring, the integrated representation format for Secondary Information Structuring described in section “Annotating RELAX NG with Secondary Information Structuring” will be used.

Related approaches

The approach of Secondary Information Structuring combines information resources which are important for markup semantics and semantic markup. The characteristics of approaches to markup semantics and semantic markup are summarized and compared to Secondary Information Structuring in the following table.

Table 1
name of the approach general characterization of the approach scope in Primary Information Structuring path language for the selection of information items non hierarchical relations operations changing of information resources
Secondary Information Structuring declarative, knowledge-based document grammars, single and multiple instance documents caterpillar expressions + transformation, query, validation -
DTDs generated from conceptual models declarative, knowledge-based document grammars, single instance documents - - query, validation +
XML Schemata generated from conceptual models declarative, knowledge-based document grammars, single instance documents - - query, validation +
Architectural forms declarative document grammars, single instance documents - - transformation -
XML Schema declarative, object-oriented document grammars, single instance documents XPath (subset) - query, validation +
CLASSIC declarative, knowledge-based document grammars, single instance documents - - transformation, query, validation -
Standoff-markup declarative, knowledge-based document grammars, single and multiple instance documents - + transformation, query, validation -
Declarative descriptions of transformations declarative, knowledge-based document grammars, single instance documents - - transformation, query -
BECHAMEL declarative, procedural, object-oriented document grammars, single instance documents XPath - query, validation -
Metaschema declarative, knowledge-based single instance documents XPath - query -
Semantic Network mapping declarative, knowledge-based document grammars, single instance documents XPath - query -
XDD-approach declarative document grammars, single instance documents - - query, validation -
Empirical markup semantics based on instance documents document grammars, single and multiple instance documents - + query, validation -

The main characteristic of Secondary Information Structuring is that it is a declarative, knowledge-based approach to the formal interpretation of markup. A knowledge-based approach encompasses conceptual resources which can be described as logical statements, true for a given domain. Such an approach easily can be represented with the formats based on RDF. It differs from object-oriented approaches like XML Schema or architectural forms in several aspects (see [Manola and Miller 2004], Section 5.3). First, concepts in knowledge-based approaches and their properties have a global scope. In the object-oriented paradigm, it is possible to describe global and / or local attributes of objects. Second, the type hierarchy in object-oriented approaches requires a proper sub-setting of attributes of objects and instances. This is not necessary for concepts in Secondary Information Structuring. Third, types in object-oriented models are based upon a type system, which does not allow for type-ambiguity. This is the most important difference to Secondary Information Structuring: as has been exemplified in section “Formal characterization of Secondary Information Structuring: A terminological ontology”, a para element can denote various concepts, not only a single one.

For the comparison to approaches to semantic markup and markup semantics, several other dimensions than knowledge-based versus object-oriented are important. Some approaches focus on document grammars, other on single or multiple instance documents; various path languages can be used to select information items in instance documents; operations like the transformation of instance documents, query and validation are possible; and the interrelation might change the information resources or not.

Two approaches towards semantics markup are discussed, namely [Erdmann and Studer 1999] and [Klein et al. 2001]. Both describe a generic generation procedure for document grammars, out of given conceptual models. The respective instance documents then can be queried and validated on the conceptual level. Erdmann et al. rely on XML-DTDs, whereas Klein et al. use XML Schema. In addition, Klein et al. envisage a combination of the type hierarchy supplied by XML schema with conceptual hierarchies. The main difference to Secondary Information Structuring is that these two approaches do not relate existing information resources. Conceptual models are given, and markup is generated.

The other approaches discussed are in the area of markup semantics. Architectural forms, as defined by [ISO/IEC10744], describe relations between document grammars in a declarative way. The descriptions can also be used to transform instance documents. The relations can be compared to a document grammar based Secondary Information Structuring. The type system of XML Schema ([Thompson et al. 2001]) allows for a similar description of relations between document grammar constructs. In contrast to architectural forms, in XML Schema these relations become explicit in the type hierarchy, hence they can be queried with an appropriate query language like XQuery ([Boag et al 2003]). Nevertheless, this is only possible if the information set of instance document is augmented with type information, i.e. the PSVI has to be created. This is the main difference to Secondary Information Structuring: To use XML Schema in querying, information resources have to be changed, and type assignment must not be ambiguous. It is not possible to assign two types to a node in instance documents, whereas Secondary Information Structuring allows the user to select the same node for several concepts.

XML Schema is the prototype of an object-oriented approach towards markup semantics. In contrast, the CLASSIC system described by [Welty and Ide 1999] is the prototype of a knowledge-based approach. Its notion of subsumption and classification are strongly related to Secondary Information Structuring. Nevertheless, CLASSIC is mainly concerned with document grammars in the format of DTDs. It is not possible to describe paths in the document structure. The same is true for the approach of Cristea and Butnariu ([Cristea and Butnariu 2004]. Different to CLASSIC and all other approaches to semantic markup, they rely on standoff-markup: The textual data is in a basic XML-file, and additional annotations are in separate files which are linked to the basic file or other files via the ID / IDREF mechanism.

The approach of declarative descriptions of transformations ([Lenz et al. 2002]) is also closely related to Secondary Information Structuring. Instead of RDF Schema, Topic Maps ([Pepper and Moore 2001]) are used for the representation of transformation specifications. The automatic creation of adaptive hypertexts, making use of domain- and user-specific models, is the main application scenario for this approach, but it has also be discussed as a mean to express and process relations between document grammars. Different to CLASSIC and the Standoff-markup approach described above, this approach makes use of transformation rules which exceed the expressive power of document grammars.

A key part of the BECHAMEL project ([Sperberg-McQueen et al. 2000], [Renear et al. 2002]) are such rules. They are described as path expressions, so-called deictic expressions, i.e. pointers in the document structure. The deictic expressions are used to fill blanks in so-called skeleton sentences, which convey the semantic description of document grammars and instance documents. The approach is procedural and declarative: It uses the declarative programming language Prolog, and creates interpretation of markup in several steps.

As a path language, XPath is being used. The same is true for the Metaschema described by [Simons 2003] and the approach by [Lobin 2001]. Both approaches describe a declarative mapping between conceptual resources and instance documents. In the case of Lobin, the mapping is concerned with semantic networks. The Metaschema and the approach of Lobin describe an 'atomic' mapping: A set of information items in instance documents is an instance of a concept if the respective XPath expressions matches. In contrast, Secondary Information Structuring employs a more complex layer between markup and conceptual resources, which makes use of relations between mappings / concepts like subsumption and allows for the inference of information from superordinate mappings / concepts, and for the conceptual validation of Primary Information Structuring.

The XDD-approach has been discussed already in Section “Input from document grammars”. In contrast to all other approaches, the methodology described by [Bayerl et al. 2003] offers an empirical approach towards markup semantics. The meaning of an element in an instance document is described in terms of relations to elements in other instance documents. Similar to instance based Secondary Information Structuring, this approach relies on multiple annotations of the same primary data. It is mainly concerned with textual data, e.g. scientific articles. As has been exemplified in figure 11, the relation of document structure to thematic or rhetorical structure can be analyzed, using multiple instance documents.

Summary and future work

This paper described the motivation and characteristics of a methodology called Secondary Information Structuring, which is used for the vertical interrelation of information resources. Declarations in document grammars and information items in - possibly multiple - instance documents are selected in a knowledge-based approach. An integrated implementation of the approach, using information from document grammars and multiple instance documents, is currently under development. Two sample applications from the field of linguistics and multilingual documentation of document grammars have been developed so far.

Besides the implementation, several research questions are on the agenda. First, the relation between document grammar constructs and path expressions in a single instance document have to be examined. It would be useful to test whether a caterpillar expression can match at all, taking a given document grammar in account, without consulting a restricted set of instance documents. Second, more ontological properties have to be addressed in Secondary Information Structuring. So far, only sub- and superordination of concepts and binary interconceptual relations are realized. More operators, e.g. to create disjunctiveness of concepts within one model, negation / quantification etc. should be useful for a complex description of a domain. An interesting research question which also was out of the scope of this paper is how the methodology of Secondary Information Structuring relates to research which is concerned with the abstraction of various models (for relational schemata, document grammars, UML models etc.) into so-called meta models, cf. [Melnik et al. 2003]. Since the role of meta models - mapping heterogenous information resources - is similar to Secondary Information Structuring, a detailed comparison of the approaches seems to be fruitful.

Notes

1.

The work described in this paper has been carried out in the project SEKIMO [Secondary Information Structuring and Comparative Discourse Analysis], which is part of the Research Group Text-technological Modeling of Information, see http://www.text-technology.de.

2.

The terminology Primary Information Structuring versus Secondary Information Structuring is taken from [Lobin 2000]. He defines Secondary Information Structuring in a more narrow sense than this paper, i.e. as a means to describe the relations between document grammars, mainly via architectural forms ([ISO/IEC10744]).

3.

Following [Fischer 1998], a concept is defined formally as an unary predicate, i.e. by its name (See also section “Formal characterization of Secondary Information Structuring: A terminological ontology”). This basic notion of a concept is shared by the conceptual level and Secondary Information Structuring.

4.

The syntax of the example is the compact syntax of RELAX NG, for reasons which will be explained in section “Input to Secondary Information Structuring from Primary Information Structuring”.

5.

The underlying path language will be discussed in detail in section “Input from single instance documents: path expressions and logical statements”.

6.

Relating a concept to a relation is possible since both can be represented as URIs, which then become arguments of the predicate sekStruk2conLevel. For example an XML-serialization of the RDF-representation of WordNet developed by Sergey Melnik contains unique identifiers for every relations between concepts, in the terminology of WordNet Synsets.

7.

The motivation to use multiple instance documents and the specific annotation format for these documents are described in section “Input from multiple instance documents: multiple annotations of the same primary data”.

8.

This specialization resembles the restriction of a simple type in XML Schema ([Thompson et al. 2001]). Nevertheless, the specialization of document grammar constructs via Secondary Information Structuring is not part of the document grammar itself and does not have to be in accordance with typing-constraints as in the object-oriented approach of XML Schema.

9.

A more detailed discussion of caterpillar expressions can be found in [Sasaki and Pönninghaus 2003].

10.

Since the DocBook document grammar declares various kinds of elements for sections, i.e. section, sect1, sect2 etc., there have to be selections for all of these elements. For another example which interrelates elements for the annotation of sections from DocBook to the TEI, see section “Interrelating document grammars”.

11.

The applicability of the transformation depends on several conditions, e.g. that the priority of the concepts which are interrelated via the equal predicate is unambiguous.

12.

Using RDF Schema as the representation format eases the task of relating the conceptual level and Secondary Information Structuring to each other, since many conceptual resources can easily be represented in RDF Schema or related languages.


Bibliography

[Bayerl et al. 2003] Bayerl, P. S., D. Goecke, H. Lüngen and A. Witt. Methods for the Semantic Analysis of Document Markup. Proceedings of the 3rd ACM Symposium on Document Engineering (DocEng). Roisin, C., E. Munson and C. Vanoirbeek, eds., Grenoble, 2003.

[Boag et al 2003] Boag, S., D. Chamberlin, M. F. Fernandez, D. Florescu, J. Robie and J. Simeon. XQuery 1.0: An XML Query Language. W3C Working Draft 12 November 2003. http://www.w3.org/TR/2003/WD-xquery-20031112/

[Bray et al. 2004] Bray, T., J. Paoli, C. M. Sperberg-McQueen, E. Maler and F. Yergeau. Extensible Markup Language 1.0 (Third Edition). W3C Recommendation 04 February 2004. http://www.w3.org/TR/2004/REC-xml-20040204/

[Brickley et al. 2004] Brickley, D. and R. V. Guha. RDF Vocabulary Description Language 1.0: RDF Schema. W3C Recommendation 10 Februar 2004. http://www.w3.org/TR/2004/REC-rdf-schema-20040210/

[Brüggemann-Klein, A. and D. Wood 2000] Brügggemann-Klein, A. ; Wood, D.: Caterpillars: A Context Specification Technique. Markup Languages: Theory and Practice 2.1 (2000): pp. 313-331.

[Caton 2002] Markup's Current Imbalance. Markup Languages: Theory and Practice 3.1 (2002): pp. 1-13.

[Clark and deRose 1999] Clark, J. and S. deRose. XML Path Language (XPath) Version 1.0. W3C Recommendation 16 November 1999. http://www.w3.org/TR/1999/REC-xpath-19991116

[Clark and Murata 2001] Clark, J. and M. Murata. RELAX NG Specification. OASIS Committee Specification 3 December 2001. http://www.oasis-open.org/committees/relax-ng/spec-20011203.html

[Cowan and Tobin 2004] Cowan, J. and R. Tobin. XML Information Set (Second Edition). W3C Recommendation 4 February 2004. http://www.w3.org/TR/2004/REC-xml-infoset-20040204/

[Cristea and Butnariu 2004] Cristea, D. and C. Butnariu. Hierarchical XML Representation for Heavily Annotated Corpora. LREC-Workshop on XML-based richly annotated corpora. Lisbon, Portugal, 2004.

[Erdmann and Studer 1999] Erdmann, M. and R. Studer. Ontologies as Conceptual Models for XML Documents. KAW 99 - Twelfth Workshop on Knowledge Acquisition, Modeling and Managment. Alberta, Canada, 1999.

[Fellbaum 1998] Fellbaum, C., ed. WordNet. An Electronic Lexical Database. MIT Press, Cambridge, Mass., 1998.

[Fischer 1998] Fischer, D. From Thesauri towards Ontologies? Structures and Relations in Knowledge Organization. Proceedings of the 5th ISKO-Conference. Maniez, J. and St. A. Pollit. eds., Ergon Verlag, Würzburg, Germany, 1998.

[ISO/IEC10744] Information Technology - Hypermedia/Time-based Structuring Language (HyTime). International Organization for Standardization, 1997.

[Klein et al. 2001] Klein, M., D. Fensel, F. v. Harmelen and I. Horrocks. The Relation between Ontologies and XML Schemas. Electronic Transactions on Artificial Intelligence 5 (2001): pp. 65-94. Special Section on Semantics for the Web.

[Klyne and Carroll 2004] Klyne, G. and J. J. Carroll. Resource Description Framework (RDF): Concepts and Abstract Syntax. W3C Recommendation 10 February 2004. http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/

[Lenz et al. 2002] Lenz, E., A. Witt and A. Storrer. Towards Declarative Descriptions of Transformations: An Approach based on Topic Maps. ALLC / ACH Conference. Tübingen, Germany, 2002.

[Lobin 2000] Lobin, H. Informationsmodellierung in XML und SGML. Springer, Berlin, 2000.

[Lobin 2001] Lobin, H. Netzwerkbasierte Modellierung der Semantik von XML-Strukturen. Proceedings der GLDV-Frühjahrstagung 2001. Lobin, H., ed., Gießen, Germany, 2001.

[Manola and Miller 2004] Manola, F. and E. Miller. RDF Primer. W3C Recommendation 10 February 2004. http://www.w3.org/TR/2004/REC-rdf-primer-20040210/

[Melnik and Decker 2000] Melnik, S. and S. Decker. A Layered Approach to Information Modeling and Interoperability on the Web. ECDL Workshop 2000 on the Semantic Web. Lisbon, 2000

[Melnik et al. 2003] Melnik, S., E. Rham and P.A. Bernstein. Rondo: A Programming Plattform for Generic Model Managment. SIGMOD / PODS 2003 Conference. San Diego, California, 2003.

[Murata et al. 2001] Murata, M., D. Lee and M. Mani. Taxonomy of XML Schema Languages using Formal Language Theory. Proceedings of Extreme Markup Languages 2001. Montreal, Canada, 2001.

[Niles and Pease 2001] Niles, I. and A. Pease. Towards a Standard Upper Ontology. Proceedings of the 2nd International Conference on Formal Ontology in Information Systems (FOIS-2001). Ogunquit, Maine, 2001.

[Pepper and Moore 2001] Pepper, S. and G. Moore (eds.). XML Topic Maps (XTM) 1.0. TopicMaps.Org Specification 6 August 2001. http://www.topicmaps.org/xtm/1.0/

[Rahtz et al. 2004] Rahtz, S., N. Walsh and L. Burnard. A Unified Model for Text Markup: TEI, Docbook, and beyond. Proceedings of XML Europe 2004.

[Ramalho et al. 1999] Ramalho, J. C., J. G. Rocha, J. J. Almeida and P. Henriques. SGML Documents: Where does Quality go? Markup Languages: Theory and Practice 1.1 (1999): pp. 75-90.

[Renear et al. 2002] Renear, A., D. Dubin, C. M. Sperberg-McQueen and C. Huitfeldt. Towards a Semantics for XML Markup. Proceedings of the 2002 ACM Symposium on Document Engineering. Furuta, R., J. I. Maletic and E. Munson, eds. Virginia.

[Sasaki 2004] Sasaki, F. Combining Markup Semantics and Semantic Markup: A Secret Marriage. ALLC / ACH Conference. Göteborg, Sweden, 2004.

[Sasaki and Pönninghaus 2003] Sasaki, F. and J. Pönninghaus. Testing Structural Properties in Textual Data: Beyond Document Grammars. Literary and Linguistic Computing 18.1 (2003); pp. 89-100.

[Sasaki et al. 2003] Saski, F., A. Witt and D. Metzing. Declarations of Relations, Differences and Transformations between Theory-specific Treebanks: A New Methodology. Second Workshop on Treebanks and Linguistic Theories (TLT 2003). Vaxjö University, Sweden, 2003.

[Simons 2003] Simons, G. F. Developing Markup Metaschemas to Support Interoperation Among Resources. ALLC / ACH Conference. Athens, Georgia USA, 2003.

[Sowa 1996] Sowa, J.F. Ontologies for Knowledge Sharing. Manuscript of the invited talk at TKE 96.

[Sperberg-McQueen 2003] Sperberg-McQueen, C. M. Logic Grammars and XML Schema. Proceedings of Extreme Markup Languages 2003. Montreal, Canada, 2003.

[Sperberg-McQueen et al. 1994] Sperberg-McQueen, C. M. and L. Burnard, eds. Guidelines for Electronic Text Encoding and Interchange (TEI P3). ACH / ALLC / ACL Text Encoding Initiative, Chicago, Oxford, 1994.

[Sperberg-McQueen et al. 2000] Sperberg-McQueen, C. M., C. Huitfeldt, A. Renear. Meaning and interpretation of markup. Markup Languages: Theory and Practice 2.3 (2000): pp. 215-234.

[Sperberg-McQueen et al. 2002] Sperberg-McQueen, C. M., D. Dubin, C. Huitfeldt and A. Renear. Drawing Inferences on the Basis of Markup. Proceedings of Extreme Markup Languages 2002. Montreal, Canada, 2002.

[Thompson et al. 2001] Thompson, H. S., D. Beech, M. Maloney and N. Mendelsohn. XML Schema Part 1: Structures. W3C Recommendation 2 May 2001. http://www.w3.org/TR/2001/REC-xmlschema-1-20010502/

[Walsh and Muellner 1999] Walsh, N. and L. Muellner. Docbook: The Definitve Guide. O'Reilly, Sebastopol, California, 1999.

[Welty and Ide 1999] Welty, C. and N. Ide. Using the Right Tools: Enhancing Retrieval from Marked-up Documents. Computing in the Humanities 33.1(2) (1999): pp. 59-84.

[Witt 2004] Witt, A. Multiple Hierarchies: New aspects of an old solution. Proceedings of Extreme Markup Languages 2004. Montreal, Canada, 2004.

[Wuwongse et al. 2003] Wuwongse, V., K. Akama, C. Anutariya, E. Nantajeewarawat. A Data Model for XML Databases. Journal of Intelligent Information Systems 20.1 (2003): pp. 63-80.



Secondary Information Structuring - A Methodology for the Vertical Interrelation of Information Resources

Felix Sasaki [University of Bielefeld]
felix.sasaki@uni-bielefeld.de