Representing Discourse Models in RDF

Erik Hennum
ehennum@us.ibm.com

Abstract

The tension between syntax and model has a long tenure in XML markup. RDF and OWL provide a formal representation of data models. These languages can provide a formal representation for discourse models. The Open World Assumption that models can be extended with new knowledge are critical for discourse. By defining a formal model for discourse, we allow culturally specific XML syntax for authoring while obtaining interoperability and shared processing.

Keywords: RDF; Modeling; Interoperability

Erik Hennum

Erik Hennum works on the design and implementation of product information for the IBM Systems Technology Group. For DITA, he has helped shape the principles of domain specialization. He participates in the OASIS DITA Technical Committee as a member.

Representing Discourse Models in RDF

Erik Hennum [Information Architect; IBM]

Extreme Markup Languages 2006® (Montréal, Québec)

Copyright © IBM 2006. Reproduced with permission.

This paper explores a strategy for formal representation of discourse in the following sequence:

  1. Considerations for a formal model of discourse
  2. A representation of discourse instances in RDF
  3. A definition for discourse types in RDF
  4. A mapping between XML- and RDF-based representations of discourse.

Syntax and model

For many members of the markup community, thinking about markup vocabularies divides prominently between the syntax and model views. As Tim Bray distinguishes the two views [Bray2006]:

The first, which I’ll call syntax-centric, focuses on the language itself: what the tags and attributes are, and which can contain which, and what order they have to be in, and (even more important) on the human-readable prose that describes what they mean and what software ought to do with them. The second approach, which I’ll call model-centric, focuses on agreeing on a formal model of the data objects which captures as much of their semantics as possible.

The divergence of model and syntax parallels a distinct divergence between the data and discourse tendencies of markup. Data primarily consists of atomic, typed values and aggregating structures for those values. Developers often see data markup as a serialization of a data model, as with the Eclipse Modeling Framework (EMF)[EMF] (which can exchange data model declarations across XML Schema, UML, and Java class formats) and the Ontology Definition Metamodel [ODM].

By contrast, discourse primarily constitutes a flow of content. Writers tend to attach much greater importance to the syntax of the discourse markup, almost as a part of the content of the discourse flow. For instance, writers tend to have difficulty creating discourse when the ratio of markup to text gets too large, when the depth of markup structure breaks up the discourse flow. As a result, document types for discourse tend toward implicit semantics instead of rigorously manifested semantics.

Instead of compromising between model and syntax, an alternative strategy is to establish a separation of concerns [SOC] through a formalization of the distinction. By providing a machine-processable representation of the discourse model that's separate from the human-oriented view of the discourse syntax, we can provide better solutions for both concerns.

Solving this problem for discourse requires a definition of:

  • What belongs in the XML syntax.
  • What belongs in the formal model.
  • How to map between the two representations pragmatically.

One benefit is to allow cultural variants on the syntax, such as localized element names, without altering the fundamental model. Another benefit of recognizing a model for discourse is to provide a formal method for defining the commonality between XML vocabularies such as DITA, DocBook, and XHTML, enabling interoperability and shared processing for the vocabularies. Establishing commonality across variant syntax requires something more reliable than ad hoc transforms between document instances.

Related initiatives

Existing work that's relevant to these concerns includes the following:

XML Information Set (InfoSet) [Infoset] , XML Document Object Model (DOM)[DOM]

The InfoSet provides an abstract representation and the DOM an object representation of the XML document contents. Both are strongly tied to the hierarchical document structure, and neither provides a formal model of the content semantics. (An RDF representation of the XML Information Set[RDF Infoset], however, remains an interesting experiment.)

Groves[Groves]

Groves has some parallels with RDF models in its graph structure. A Grove provides a complete representation of the source syntax, however, rather than a semantic model of the content.

DITA[DITA]

The DITA architecture provides an unambiguous method for associating types with markup elements in validated document instances. In DITA 1.0 and 1.1, however, each type is closely bound to one XML element rather than defined independently as part of a model.

RDFa[RDFa]

The RDFa initiative provides a way to annotate element instances from XHTML and other XML vocabularies with types from RDF models. That is, RDFa makes it possible for a writer to map an instance of an XML vocabulary to a formal model. In the RDFa approach, however, the XML vocabulary itself provides no assistance for the mapping. That is, apart from hosting the annotation, RDFa doesn't leverage the XML vocabulary for producing an instance of the formal model.

Semantic annotation

A number of initiatives [Semantic Annotation] maintain a semantic annotation separate from the annotated document. In particular, frameworks like UIMA [UIMA] and GATE [GATE] apply the techniques of information retrieval and text analysis for creation of such annotations. Such initiatives typically look only at the text of the document and could benefit from an awareness of the discourse type as captured in a document design and encoded in document markup by a writer.

Web Ontology Language (OWL) [OWL]

OWL provides a formal model, but this model is applied only to data (including data extracted from documents) and not to representing a document.

Considerations for a formal model of discourse

A few considerations can guide the definition of models for representing discourse:

  • In most cases, a complete unit of discourse is fundamentally linear. Even within hypertext webs, each linked unit of discourse has a sequence enacted through reading or hearing a flow of text, graphics, or sound. An analysis of discourse, then, can be represented as an annotation of the types and properties of the parts that make up this flow.
  • When analyzed from a perspective, the parts of a discourse unit usually conform to a hierarchy. For instance, analyzing a Shakespeare sonnet from a metrical perspective produces a hierarchy of stanza, line, and foot while analyzing the same poem from a grammatical perspective produces a hierarchy of sentence, clause, phrase, and word. When integrated, the two annotations have complex overlaps, but, seen in isolation, each provides a straightforward hierarchical view on the discourse.
  • While the discourse structure consists of the types and properties of the description (such as the list, paragraphs, and phrases), the discourse semantics usually apply to the types and properties of the described subject. For instance, when a tourist brochure has a list of attractions, the brochure text has the list structure, but Paris has the attractions. The brochure itself doesn't have attractions like a Louvre Museum or an Eiffel Tower.
  • Annotating the structure and semantics allows discourse designers, creators, and analyzers to share their understanding, enabling correlation of different knowledge in the context of the annotated discourse. Because different communities may understand different aspects of the discourse, the annotation must be extensible. Consumers must be able to use the annotations they understand and ignore the annotations they don't understand. For instance, a search engine that understands the grammatical annotations of the Shakespeare sonnet shouldn't be hampered by metrical annotations.

Benefits of using OWL and RDF for discourse models

Representing the discourse model in RDF and particularly OWL has some obvious advantages. This approach leverages the community and infrastructure that is building up around the Semantic Web initiative. For instance, as the trust layer of the Semantic Web is implemented, the trust framework can be applied to discourse. In addition, this approach prevents a fundamental schism between the data and discourse webs. For instance, a conference announcement can integrate with RDF calendar data about the conference schedule and with a SKOS [SKOS] enumeration of the subject matter of the conference.

An RDF representation requires acceptance of some key assumptions, however, including the Open World Assumption. This principle declares that, in a web of knowledge, you cannot ordinarily assume that you have the only knowledge about an object. Others may have additional knowledge and including the ability to specify properties that you don't understand. These properties can even qualify the object as an instance of types that you don't understand. For instance, a tourist may understand Paris as a destination with attractions while a sociologist understands it as an urban phenomenon with cultural and economic factors. As part of the Open World Assumption, a consumer of knowledge ignores any properties that aren't familiar.

As noted in the previous section, similar considerations about extensible knowledge apply to discourse. We should be able to annotate a Shakespeare sonnet with either or both metrical and grammatical analysis. For another example, we could annotate a conference paper for its structures such as unordered lists, paragraphs, and long quotes, for its named entities such as Tim Bray and EMF, or for its rhetorical strategies such as appeals to ethos, logos, or pathos. (Whether the Open World Assumption might eventually apply to the composition of the discourse content itself as with Wikis and blog talkbacks is an interesting question but outside of the scope of the typing concerns of this paper.) The requirement to represent common understanding without constraining special insight moves beyond composition of vocabularies as with HTML modularization[XHTML_MOD]). To allow extension with new annotations, the model must support merging types provided by independent designers.

In addition to the Open World Assumption, the RDF Schema and OWL notion of a type as a set can help to recognize commonality of discourse models. RDF Schema applies a kind of duck typing (so called because anything that walks like a duck and talks like a duck is considered a duck). If an RDF class is declared as the subject of a property and an object has the property, the object belongs to the class. Thus, an object can belong to many classes. OWL extends this set-oriented approach to typing to classes by defining classes with respect to set operations (union, intersection, complement) on other classes. Discourse annotations can also be defined usefully as sets. For instance, we can recognized the content annotated with HTML <ul> and DocBook <itemizedlist> as belonging to the set of unordered lists.

In other areas, discourse has some mismatches with RDF-based approaches. Most importantly, RDF is property-centric. That is, the RDF model emphasizes the relationship type between objects or the attribute type of values. By contrast, annotations all have the same fundamental annotation relationship to the annotated flow. Thus, a discourse model emphasizes the annotation type. Accommodating this differences and the application of the Open World Assumption to annotation hierarchies requires some extension beyond typing in RDF Schema and OWL.

Representing discourse instances in an RDF equivalent to the DOM

The XML DOM provides a successful precedent for hierarchical nesting of discourse annotation. At a minimum, the discourse model in RDF must accommodate the same structure for interoperability with existing tools. Let's start with the following example of a simple XHTML document:

<html>
<body>
  <p>These are a few of my favorite things:</p>
  <ul>
  <li>Raindrops on roses.</li>
  <li>Whiskers on kittens.</li>
  </ul>
</body>
</html>

To express properties on an annotation, the model must represent the annotation as an object. Such objects can have an annotates relationship with the annotated flow. The set of such objects can be identified with an Annotator class. Specific annotation vocabulary like html:p and html:ul can be considered subclasses of Annotator. The RDF:List construct can represent the sequence of leaf text objects and other atomic content as well as any nested annotators within the annotated portion of the flow.

Applied to the simple XHTML example, this approach results in the following RDF representation. (For brevity and to avoid confusion between RDF and XML, subsequent examples will show only the N3 serialization of RDF graphs.)

Table 1
RDF/XML N3
<html:html rdf:about="#">
 <dowl:annotates rdf:parseType="Collection">
  <html:body>
   <dowl:annotates rdf:parseType="Collection">
    <html:p>
     <dowl:annotates rdf:parseType="Collection">
      <dowl:Text>
<dowl:hasText>These are a few of my 
favorite things:</dowl:hasText>
      </dowl:Text>
     </dowl:annotates>
    </html:p>
    <html:ul>
     <dowl:annotates rdf:parseType="Collection">
      <html:li>
       <dowl:annotates rdf:parseType="Collection">
        <dowl:Text>
<dowl:hasText>Raindrops on roses.</dowl:hasText>
        </dowl:Text>
       </dowl:annotates>
      </html:li>
      <html:li>
       <dowl:annotates rdf:parseType="Collection">
        <dowl:Text>
<dowl:hasText>Whiskers on kittens.</dowl:hasText>
        </dowl:Text>
       </dowl:annotates>
      </html:li>
     </dowl:annotates>
    </html:ul>
   </dowl:annotates>
  </html:body>
 </dowl:annotates>
</html:html>
    <> a html:html; dowl:annotates (
      [ a html:body; dowl:annotates (
        [ a html:p; dowl:annotates (
          [ a dowl:Text; dowl:hasText
    "These are a few of my favorite things:" ]
        )]
        [ a html:ul; dowl:annotates (
          [ a html:li; dowl:annotates (
            [ a dowl:Text; dowl:hasText
    "Raindrops on roses." ]
          )]
          [ a html:li; dowl:annotates (
            [ a dowl:Text; dowl:hasText
    "Whiskers on kittens." ]
          )]
        )]
      )]
    ).
Figure 1: Graph for a simple XHTML instance
[Link to open this graphic in a separate page]

An alternative representation of the text leaves might be as pointers into a range. That is, the text nodes could index into an immutable string.

Regardless of the representation of the text nodes, note that the annotation and text nodes in the RDF graph can have relationships outside the hierarchy. This capability contrasts with the DOM, where the cross-hierarchy relationships are not part of the model. For instance, where reuse in XML is expressed by a layer on top of the DOM with mechanisms like an XInclude, a DITA conref, or an XLink actuated on load, a reused Annotator in RDF can simply appear in multiple lists within the graph.

Merging annotations

To assemble different kinds of knowledge, it must be possible to merge the annotations that express understanding about discourse. The following example shows the result of combining the structural annotation of the text with linguistic annotation about the part of speech:

:p1 a html:p; dowl:annotates (
  [ a dowl:Text; dowl:hasText "These are a few " ]
  [ a pos:prepositional; dowl:annotates (
    [ a dowl:Text; dowl:hasText "of my favorite things" ]
  )]
  [ a dowl:Text; dowl:hasText ":" ]
) .

Figure 2: Graph of merged annotations
[Link to open this graphic in a separate page]

This treatment of merged annotations differs from the treatment of unknown vocabularies in an XML schema. For instance, using an unextended schema for XHTML, an XML parser would report an error for the following XML equivalent to the RDF graph:

<p id="p1">These are a few <pos:prepositional>of my favorite things<pos:prepositional>:</p>

Such validation is often important when receiving content from an XML input source. When interpreting a model in conformance with the Open World Assumption, an application should skip over annotations that it doesn't understand and examine the annotated content. In the example, an application that understands only the HTML annotations isn't impaired by the part-of-speech annotation that has enriched the understanding of the discourse.

Carrying this principle to the logical extreme, an application that understands none of the annotations sees only the leaf text, which can be entirely appropriate for applications such as simple search engines:

These are a few of my favorite things:
Raindrops on roses.
Whiskers on kittens.

The things described by the annotated discourse

In many or most cases, a unit of discourse describes something. A brochure describes a tourist destination, a white paper describes a new technology, and so on. By identifying description relationships between discourse and the things defined by an ontology, the model can represent the semantics of the discourse and bridge the concerns of data and textual processing.

Parts of a discourse flow can also describe things. In the earlier example, the brochure as a whole described Paris while an item in the list of attractions described the Eiffel Tower. Thus, any annotation can potentially have a describes relationship. The described object can express values extracted from the text. Processors that understand the value can make use of the value.

For instance, in the following example, the timeAnnotation object describes a calendar date, which can be represented with a CalendarClockDescription object from the Time Entry ontology and populated with values for the year, month, and day:

@prefix time-entry:   <http://www.isi.edu/~pan/damltime/time-entry.owl#> .

...
:p54 a html:p;
  dowl:annotates (
    [ a dowl:Text; dowl:hasText "Remember, the deadline is " ]
    [ a calendar:timeAnnotation;
      dowl:describes
        [ a time-entry:CalendarClockDescription;
          time-entry:unitType time-entry:unitDay;
          time-entry:year     "2006";
          time-entry:month    "4";
          time-entry:day      "7"];
      dowl:annotates (
        [ a dowl:Text; dowl:hasText "April 7, 2006" ]
      )]
    [ a dowl:Text; dowl:hasText "." ]
  ) .

Figure 3: Graph of annotation describing an object with parsed values
[Link to open this graphic in a separate page]

One of the most valuable contributions of an annotation is this ability to mediate between discourse and described things. For instance, an application that understands the CalendarClockDescription class but not the timeAnnotation annotator class can still make use of the values without having to understand the annotated text. Other structures within the RDF graph can have relationships to the CalendarClockDescription object without awareness of the annotated text. The CalendarClockDescription object can be referenced indirectly by way of the annotation or given an identifier so it can be referenced directly.

In addition, the describes relationship can provide a basis for inference based on the discourse. For instance, by defining a describedTime subproperty with a domain of timeAnnotation and a range of CalendarClockDescription, we can make it possible to infer the types of the annotation or described thing.

Alluding to or naming a thing within discourse can be considered a special kind of description. At the other end of the descriptive spectrum, some texts such as encyclopedia articles provide an unambiguous definition. These two special kinds of description are sufficiently common that it may be useful to provide standard subproperties for them.

Perspectives on discourse

Frequently, a piece of discourse can provide alternatives for a sequence. The RDF graph can express this by providing an annotation from different perspectives. For instance, locale perspectives can provide translations of the same piece of discourse:

:p1 a html:p;
  dowl:hasPerspective [ a nls:LocalePerspective;
    xml:lang "en";
    dowl:annotates (
      [ a dowl:Text; dowl:hasText "These are a few of my favorite things:" ]
  )];
  dowl:hasPerspective [ a nls:LocalePerspective;
    xml:lang "es";
    dowl:annotates (
      [ a dowl:Text; dowl:hasText "Estos son algunas de mis cosas preferidas:" ]
  )] .

Conditional text provides another common use case. Product information often provides content fragments that should appear in the discourse only for a specific operating system such as a Linux or Windows or other factor. These conditions can be modeled as perspectives on the discourse.

Perspectives can also be used for overlapping annotations. For instance, lyrics can be annotated from the linguistic and musical perspectives:

:p1 a html:p;
  dowl:hasPerspective [ a pos:LinguisticPerspective;
    dowl:annotates (
      :t1
      :t2
      [ a pos:prepositional; dowl:annotates (
        :t3
        :t4
        :t5
      )]
      :t6
    )];
  dowl:hasPerspective [ a music:MusicalPerspective;
    dowl:annotates (
      [ a music:bar; dowl:annotates ( :t1     )]
      [ a music:bar; dowl:annotates ( :t2 :t3 )]
      [ a music:bar; dowl:annotates ( :t4     )]
      [ a music:bar; dowl:annotates ( :t5 :t6 )]
    )] .
:t1 a dowl:Text; dowl:hasText "These are a " .
:t2 a dowl:Text; dowl:hasText "few " .
:t3 a dowl:Text; dowl:hasText "of my " .
:t4 a dowl:Text; dowl:hasText "favorite " .
:t5 a dowl:Text; dowl:hasText "things" .
:t6 a dowl:Text; dowl:hasText ":" .

Figure 4: Graph of two perspectives on the same text nodes
[Link to open this graphic in a separate page]

The RDF graph is particularly useful for annotating different, overlapping spans of the same sequence of text nodes. That is, because the text nodes are referenceable within the graph, a textual leaf can appear within more than one annotated list, removing the single parent limitation of DOM. For instance, an RDF API can backtrack from a text node to the annotations on the text from different perspectives.

For illustration, here is the same overlap serialized as LMNL [LMNL]:

[p [id}p1{]}[music:bar}These are a {music:bar]
[music:bar}few [pos:prepositional}of my {music:bar]
[music:bar}favorite {music:bar]
[music:bar}things{pos:prepositional]{music:bar]:{p]

Overlaps are also important for text analysis, for instance, to represent a UIMA Common Analysis Structure (CAS). That is, supporting overlap is a necessary to allow for representing the results of text analysis as annotation on the discourse.

Deriving classes for annotating discourse

By declaring types for an RDF model, we can identify the semantics of the objects in the model, check the internal consistency of the model, and warrant inference on the objects of the model. The following classes can provide a base for declaring types for discourse models:

Class Definition of an object in the set
dowl:Annotator

Annotates part or all of a sequence of discourse.

dowl:Content

Represents a leaf node of non-textual content (such as a graphic) within a sequence.

dowl:Perspective

Identifies a variation in the sequence or annotation for the annotated discourse flow.

dowl:RestrictedAnnotator

Declares a restriction on annotated content.

dowl:Text

Represents a textual node within a sequence.

dowl:Vocabulary

Assembles a set of annotators for checking the consistency of containment relationships.

The following example declares the html:p and html:em classes to be discourse annotators.

html:p a owl:Class;
  rdfs:subClassOf dowl:Annotator .

html:em a owl:Class;
  rdfs:subClassOf dowl:Annotator .

Declaring containment relationships between Annotators

An Annotator object can contain other annotations within the annotated span of discourse. For known vocabularies, those containment relationships can be declared. For instance, within the XHTML vocabulary, html:p can be declared to have a containment relationship with html:em.

html:p       dowl:contains      html:em;
             dowl:contains      dowl:Text .

html:em      dowl:contains      dowl:Text .

Because the containment of instances is expressed through dowl:annotates lists, potentially with intermediary Annotators, containment relationships cannot be declared as OWL or RDF Schema properties that represent a direct relationship between two objects. Thus, like the rdfs:subClassOf relationship, the contains relationship is a typing property that governs the relationship between instances of the specified Annotator classes.

Because of the potential for merging Annotators from other vocabularies, the contains relationship can be satisfied even if intermediaries appear in the containment hierarchy. The contains relationship doesn't require direct containment as with XML content models . For example, both of the following XML instances are valid for a containment relationship between html:p and html:em:

<p>These are a few of my <em>favorite</em> things:</p>

<p>These are a few <pos:prepositional>of my <em>favorite</em> things<pos:prepositional>:</p>

The validity of the containment relationships for pos:prepositional with respect to html:p or html:em is unknown. In fact, in the absence of additional information, the validity of html:em containing html:p is also unknown.

Checking the consistency of containment with respect to a vocabulary

Because unknown relationships cannot be assumed to be invalid under the Open World Assumption, identifying containment relationships isn't sufficient for consistency checking. Instead, the model has to identify the invalid relationships.

Enumerating the invalid relationships would be an onerous task. A better approach is to identify a scope in which the valid relationships have been enumerated. All other relationships within the scope can be assumed to be invalid.

To define a scope for consistency checking, annotators can be specified as members of a vocabulary. Two annotators that belong to the same vocabulary but don't have containment relationships cannot have instances in the same branch of an annotation hierarchy except through an intermediary belonging to the same vocabulary.

For example, the following declarations indicate that an html:body or html:ul cannot contain a html:em except through an intermediary html:p or html:li:

:v1 a dowl:Vocabulary .

html:body    dowl:inVocabulary  :v1;
             dowl:contains      html:p;
             dowl:contains      html:ul .

html:p       dowl:inVocabulary  :v1;
             dowl:contains      html:ul;
             dowl:contains      html:em;
             dowl:contains      dowl:Text .

html:ul      dowl:inVocabulary  :v1;
             dowl:contains      html:li .

html:li      dowl:inVocabulary  :v1;
             dowl:contains      html:p;
             dowl:contains      html:em;
             dowl:contains      dowl:Text .

html:em      dowl:inVocabulary  :v1;
             dowl:contains      dowl:Text .

A processor traversing the dowl:annotates structure could detect inconsistent relationships for the annotators in this vocabulary. The following XML instance provides one example of inconsistency with respect to the vocabulary:

<em><p>These are a few of my favorite things:</p></em>

Figure 5: Graph with instances that are inconsistent for the vocabulary
[Link to open this graphic in a separate page]

Text can be considered an implicit member of every vocabulary so that consistency checking can apply to text. In the example, an html:body or html:ul cannot contain text directly because they do not have explicit containment relationships with text.

An annotator can belong to more than one vocabulary. The following example defines a vocabulary containing html:p and pos:prepositional:

:v2 a dowl:Vocabulary .

html:p             dowl:inVocabulary  :v2;
                   dowl:contains      pos:prepositional;
                   dowl:contains      dowl:Text .

pos:prepositional  dowl:inVocabulary  :v2;
                   dowl:contains      dowl:Text .

Because pos:prepositional has a defined relationship with respect to html:p but not html:em, both of the following XML instances are consistent with respect to both the first and second vocabularies:

<p>These are a few <pos:prepositional>of my <em>favorite</em> things<pos:prepositional>:</p>

<p>These are <em>a few <pos:prepositional>of my favorite things<pos:prepositional></em>:</p>

The ambiguity is entirely appropriate. While the prepositional phrase has a nesting relationship with respect to a paragraph, it has no relationship with respect to styled emphasis. Within an instance, either annotation can contain the other without changing the meaning of the annotation.

Finally, two vocabularies can be unified by identifying equivalent annotators with owl:equivalentClass. The following example asserts that html:p and pos:paragraph have the same meaning. As a result, the containment relationships defined for html:p within the first vocabulary and for pos:paragraph within another vocabulary apply to instances of either annotator:

html:p owl:equivalentClass pos:paragraph .

Defining annotators with OWL set operations

While containment relationship can be defined between pairs of annotators, a more convenient mechanism similar to XML Schema groups would make it easier to declare and understand models. In addition, some containment relationships might be seen as central to the definition of an annotator.

These requirements can be met with the OWL set operations. The following example provides an alternative to the prior definition for the XHTML annotators and their containment relationships. First, the example declares some union equivalents to Schema groups such as html:Block and html:Inline. Then, annotators are defined in terms of intersection with those unions. For instance, html:body is defined as an annotator that contains a block annotation. Similarly, html:p is defined as an annotator that contains a flow annotation other than a paragraph (thus preventing self nesting).

:v1 a dowl:Vocabulary .

html:Block   owl:unionOf (
               html:p
               html:ul
             ) .

html:Inline  owl:unionOf (
               html:em
               html:strong
             ) .

html:Flow    owl:unionOf (
               html:Block
               html:Inline
               dowl:Text
             ) .

html:body    dowl:inVocabulary  :v1;
             owl:intersectionOf (
               [ rdfs:subClassOf    dowl:Annotator ]
               [ dowl:contains      html:Block ]
             ) .

html:p       dowl:inVocabulary  :v1;
             owl:intersectionOf (
               [ rdfs:subClassOf    dowl:Annotator ]
               [ dowl:contains      html:Flow ]
               [ owl:complementOf (
                 [ dowl:contains      html:p  ]
               )]
             ) .

html:ul      dowl:inVocabulary  :v1;
             owl:intersectionOf (
               [ rdfs:subClassOf    dowl:Annotator ]
               [ dowl:contains      html:li ]
             ) .

html:li      dowl:inVocabulary  :v1;
             owl:intersectionOf (
               [ rdfs:subClassOf    dowl:Annotator ]
               [ dowl:contains      html:Flow ]
               [ owl:complementOf (
                 [ dowl:contains      html:ul  ]
               )]
             ) .

html:em      dowl:inVocabulary  :v1;
             owl:intersectionOf (
               [ rdfs:subClassOf    dowl:Annotator ]
               [ dowl:contains      html:InLine ]
               [ owl:complementOf (
                 [ dowl:contains      html:em  ]
               )]
             ) .

html:strong  dowl:inVocabulary  :v1;
             owl:intersectionOf (
               [ rdfs:subClassOf    dowl:Annotator ]
               [ dowl:contains      html:InLine ]
               [ owl:complementOf (
                 [ dowl:contains      html:strong  ]
               )]
             ) .

The unions aren't defined themselves as annotators and thus couldn't be instantiated to annotate instances of XHTML discourse.

Declaring restrictions on content of annotators

For many annotators, there's no restriction on the cardinality of contained annotation objects. For instance, html:body can contain any number of html:Block objects, html:ul can contain any number of html:li objects, and so on.

For some annotators, however, cardinality is important. For instance, an html:table should contain no more than one html:caption. To handle these cases, the content of an annotator can be constrained in a way similar to an owl:Restriction on a property.

The following example uses OWL set operations to declare html:table as an Annotator with a restriction on the containment of an html:caption. Consistency checking on the vocabulary could then identify instances where an html:table contains multiple html:captions.

html:table dowl:inVocabulary   :v1;
           owl:intersectionOf (
             [ rdfs:subClassOf      dowl:Annotator ]
             [ owl:unionOf (
               [ a dowl:RestrictedAnnotator;
                 dowl:minCardinality  "0";
                 dowl:maxCardinality  "1";
                 dowl:contains      html:caption ]
               [ dowl:contains      html:tr ]
             )]
           ) .

The dowl:RestrictedAnnotator class cannot derive directly from owl:Restriction because OWL restrictions apply to properties of objects where containment restrictions apply to the annotation hierarchy. The dowl:RestrictedAnnotator class can use an approach similar to owl:Restriction and subclass dowl:Annotator. As in the example, this approach allows dowl:RestrictedAnnotator to appear in set operations that define annotators. A designer can apply owl:Restriction to the properties of an annotation class in parallel with applying dowl:RestrictedAnnotator to the annotation contents.

Note that the type declaration has no requirement to specify a sequence for the annotated discourse. Where an annotated part has a cardinality of 1, sequence is irrelevant. Where an annotated part has a cardinality greater than 1, the annotated instance provides the sequence. To specify cardinality on part of a flow, the model should provide an annotation for that content and express cardinality on the annotator.

Extending annotators and vocabularies

The combination of subclassing and set operations provides tremendous flexibility for defining new annotators and vocabularies. Annotators can be constrained or specialized when declaring new vocabularies.

An existing annotator can be restricted to a subset of its full content within a new vocabulary. In the following example, instances of html:p within the simpleV vocabulary cannot contain nested blocks:

:simpleV a dowl:Vocabulary .

:pSimple  dowl:inVocabulary  :simpleV;
          owl:intersectionOf (
            html:p
            [ owl:complementOf (
              [ dowl:contains      html:Block  ]
            )]
          ) .

New annotators can subclass existing annotators, restricting the content to subclasses of the based content. The following example implements part of the DITA task model by subclassing topic:ol as task:steps and by restricting the content to the task:step subclass of topic:li:

:taskV a dowl:Vocabulary .

task:steps  dowl:inVocabulary  :taskV;
            owl:intersectionOf (
              [ rdfs:subClassOf    topic:ol ]
              [ dowl:contains      task:step ]
            ) .

task:step   dowl:inVocabulary  :taskV;
            owl:intersectionOf (
              [ rdfs:subClassOf    topic:li ]
              [ dowl:contains      task:cmd ]
            ) .

task:cmd    dowl:inVocabulary  :taskV;
            rdfs:subClassOf  topic:ph .

In addition, new vocabularies can be defined by means of set operations on existing vocabularies:

  • For extension by addition, take the union of an existing vocabulary with new annotators and define the relationships of the new annotators in the new vocabulary.
  • For extension by restriction, apply restrictions to existing annotators through restriction as part of the new vocabulary.

Inference based on annotations

Where two annotators describe specific things, the containment relationship between the two annotators can be used to infer relationships between the described things. That is, the structure of strongly typed discourse can be leveraged to increase the processable knowledge.

In the following example, a brochure contains annotations for a location and a list of items. The location annotation is known to describe a tourist destination and the list items are known to describe tourist attractions. The DescribedProperty declaration permits inference of a hasAttraction property between the tourist destination and tourist attractions described by the brochure:

:hasAttraction a owl:ObjectProperty;
                rdfs:domain          :TouristDestination;
                rdfs:range           :TouristAttraction .

:brochure     owl:intersectionOf (
                [ rdfs:subClassOf    dowl:Annotator ]
                owl:unionOf (
                  [ dowl:contains      :location ]
                  [ dowl:contains      :ul ]
                )) .

:location     owl:intersectionOf (
                [ rdfs:subClassOf    dowl:Annotator ]
                [ dowl:contains      dowl:Text ]
                [ a owl:Restriction;
                  owl:onProperty    dowl:describes;
                  owl:allValuesFrom :TouristDestination ]
              ) .

:ul           owl:intersectionOf (
                [ rdfs:subClassOf    dowl:Annotator ]
                [ dowl:contains      :li ]
              ) .

:li           owl:intersectionOf (
                [ rdfs:subClassOf    dowl:Annotator ]
                [ dowl:contains      dowl:Text ]
                [ a owl:Restriction;
                  owl:onProperty    dowl:describes;
                  owl:allValuesFrom :TouristAttraction ]
              ) .

:annotatedAttraction a dowl:DescribedProperty;
                dowl:forProperty     :hasAttraction;
                rdfs:domain          :location;
                rdfs:range           :li .

Similarly, when the ontology is more complete, it should be possible to use the descriptive associations for inference about the annotation.

Mapping between the definitions of the model and syntax

A well-developed model doesn't do away with the need for a syntactic view of discourse. Far from it. Communities of writers care passionately about the syntax of their XML vocabularies. Depending on whether the writer values conciseness or clarity, the best mark up for a paragraph may be a <p> or a <para> element. Moreover, it is possible that a markup with compact, implicit semantics provides a better authoring medium for discourse than a markup with explicit semantics [Mazzocchi2004]. Finally, discourse creation can take advantage of the tools available for XML creation and validation instead of depending on a separate set of tools for an RDF-based representation of discourse.

The first step in creating a mapping is to select the type vocabularies to represent in the XML syntax. The XML syntax defines a closed world that can be validated.

For each class in the model (whether Annotator or a general data OWL class) and its properties, the mapping declares the following:

Abstract or concrete form

A base class can be omitted if the XML vocabulary uses its subclasses. In addition, an annotator can be virtual with respect to an XML vocabulary if the model instances of the type can be inferred from the XML instances. For example, a definition list entry can be inferred from the HTML <dt> and <dd> boundaries.

Element or attribute form

An Annotator or property that has a cardinality of one and no substructure can be mapped to an attribute.

Name

The same model type can map to different names in different XML vocabularies. For instance, the paragraph model type can map to <p> and <para> elements in different XML vocabularies. Such aliasing can also include or omit namespaces or localize element names to improve the usability of the XML vocabulary without impairing the interoperability based on the model.

Sequence

Even if the order of distinct classes is unimportant in the model, the XML vocabulary can impose a defined sequence to make the content easier to write and validate. The containment relationships and cardinality can be determined from the model.

The mapping can be used to make a binding between a model and an XML vocabulary. The declaration for the mapping might resemble the following example:

:v1 dowl:hasMapping
  [ a dowl:VocabularyMapping;
    dowl:mappedClass  html:p;
    dowl:mappedForm   dowl:Element;
    dowl:mappedItem   <http://www.w3.org/1999/xhtml#p>] .

Instead of maintaining a binding, a more efficient approach is to maintain additional information with either the model or the syntax and to generate the other representatoin. In particular, from a detailed mapping, it is possible to generate XML Schema, RelaxNG, and other syntactic validations for a vocabulary, thus enabling a broad spectrum of users. A dowl:hasChoice or dowl:hasSequence list can contain a dowl:ContentModel, which can in turn provide its own choice or list.

:v1 dowl:hasMapping
  [ a dowl:VocabularyMapping;
    dowl:mappedClass  html:p;
    dowl:mappedForm   dowl:Element;
    dowl:mappedItem   <http://www.w3.org/1999/xhtml#p>;
    dowl:mappedContent
      [ a dowl:ContentModel;
        dowl:hasChoice (
          html:InLine
          dowl:Text
        )]] .

If the form and mapped item are omitted, the default mapping can use the same name in the XML vocabulary and map an annotator with only textual content to an attribute and any other annotator to an element. The default content model can consist of a sequence of elements with fixed cardinality followed by a choice of elements with unlimited cardinality. This default encourages a canonical XML representation of a model.

The mapping can support other uses in addition to XML validation. For instance, a bundle of style or configuration properties associated with a model class can be attached to any element generated from the model via the mapping. Or, a model class can take a mustUnderstand property for a type of application.

Mappings can also be made to non-XML formats such as Wiki source, troff, LMNL, and so on. Perspectives with overlaps are, of course, mappable only to formats that support overlap, such as LMNL.

Processing instances of the model

The rationale for mapping class declarations to XML syntax also applies to mapping RDF and XML instances.

Processing RDF discourse instances with XML tools

For example, applying XSLT to the hierarchical annotation of the discourse enables processing of the largest subset of model instances with existing tooling. This bridge can be implemented through a thin adapter that traverses the containment hierarchy with an RDF API and exposes a SAX or StAX interface so the adapter can fill the role of an XML parser. Thereafter, XSLT rules can fire on the annotator classes.

Processing XML discourse instances in RDF applications

The more typical case, however, starts with an XML instance. Using the declared mapping, the instance can be converted to the model. A SAX or StAX application can act as a reader for an RDF API, constructing the model for RDF-based applications.

Processing XML instances based on RDF model typing

A highly desirable processing strategy is to avoid all conversion and interpret the XML instance with the RDF types. This approach makes it possible to share processing across different XML vocabularies. By associating RDF annotators with XML elements, the mapping provides the foundation for this processing strategy. An API (for instance, an XSLT function library) can look up the RDF annotator for an element instance so that rules can be fired based on the class rather than on the element. For a pragmatic solution, the mapping can be serialized in an alternative format for efficient lookup, and the document instance can be associated with its class declarations with an attribute similar to the XML schema binding or namespace attributes.

Summary

This paper explored a potential strategy for use of RDF and OWL to provide a formal model for discourse, including:

  • Indefinite nesting of annotations for a unit of discourse with alternative perspectives as needed.
  • Support for the Open World Assumption through merged annotation types
  • Consistency checking for a vocabulary through declared containment relationships.
  • Mapping from models to syntax to leverage existing XML authoring and processing tools as well as understanding.

A formal representation can make it easier to share understanding of discourse. Potential applications include representing the results of text analysis as well as enabling interoperable exchange and processing across XML vocabularies. Finally, integrating data ontologies with semantically analyzed discourse can increase the value of both kinds of knowledge asset.


Acknowledgments

The strategy for separating model and syntax proposed here has its roots in existing DITA practice. As result, this paper owes a great debt to the OASIS DITA Technical Committee and to DITA leads and contributors at IBM and across the community.


Bibliography

[Bray2006] On XML Language Design, Tim Bray, 9 January 2006, http://www.tbray.org/ongoing/When/200x/2006/01/09/On-XML-Language-Design#p-5

[DITA] OASIS DITA Architectural Specification, Michael Priestley, editor, OASIS Committee Draft 01 First Edition, 17 February 2005, http://xml.coverpages.org/DITA-CD11428-ArchSpec.pdf

[DOM] Document Object Model (DOM) Level 3 Core Specification Arnaud Le Hors et al, editors, W3C Recommendation, 7 April 2004, http://www.w3.org/TR/DOM-Level-3-Core/

[EMF] The Eclipse Modeling Framework (EMF) Overview June 16, 2005, http://www.eclipse.org/emf/docs.php?doc=references/overview/EMF.html

[GATE] Developing Language Processing Components with GATE Version 3 (a User Guide), Hamish Cunningham et al, March 2006, http://gate.ac.uk/sale/tao/index.html

[Groves] Addressing the Enterprise: Why the Web needs GrovesPaul Prescod, July 1999, http://www.prescod.net/groves/shorttut/

[Infoset] XML Information Set (Second Edition), John Cowan and Richard Tobin, Editors, W3C Recommendation, 4 February 2004, http://www.w3.org/TR/xml-infoset/

[LMNL] LMNL Syntax Jeni Tennison, Gavin Thomas Nicol, and Wendell Piez, editors, 11 October 2002, http://www.lmnl.net/prose/syntax/index.html

[Mazzocchi2004] A No-nonsense Guide to Semantic Web Specs for XML People, Explicit vs. Implicit Semantics, Stefano Mazzocchi,, 5 November 2004, http://www.betaversion.org/~stefano/linotype/news/78/

[ODM] Ontology Definition MetamodelOMG Ontology Working Group, Fourth Revised Submission, 14 November 2005, http://www.omg.org/docs/ad/05-04-13.pdf

[OWL] OWL Web Ontology Language Overview, Deborah L. McGuinness and Frank van Harmelen, editors, W3C Recommendation, 10 February 2004, http://www.w3.org/TR/owl-features/

[RDF Infoset] An RDF Schema for the XML Information Set, Richard Tobin, Editor, W3C Note, 6 April 2001, http://www.w3.org/TR/xml-infoset-rdfs

[RDFa] RDFa Primer 1.0Ben Adida and Mark Birbeck, editors, W3C Working Draft, 16 May 2006, http://www.w3.org/TR/xhtml-rdfa-primer/

[Semantic Annotation] Semantic Annotation for Knowledge Management, Victoria Uren et al, Journal of Web Semantics, Volume 4, Issue 1, 2005, http://www.websemanticsjournal.org/ps/pub/2005-34

[SKOS] SKOS Core Guide Alistair Miles and Dan Brickley, Editors, W3C Working Draft, 2 November 2005, http://www.w3.org/TR/swbp-skos-core-guide/

[SOC] Korpilla et al, Separation of concerns, Wikipedia, 2006, http://en.wikipedia.org/wiki/Separation_of_concerns

[UIMA] Unstructured Information Management Architecture (UIMA) SDK User's Guide and ReferenceDavid Ferrucci et al, February 2006, http://dl.alphaworks.ibm.com/technologies/uima/UIMA_SDK_Users_Guide_Reference.pdf

[XHTML_MOD] XHTML Modularization 1.1Daniel Austin et al, W3C Proposed Recommendation, 13 February 2006, http://www.w3.org/TR/xhtml-modularization/



Representing Discourse Models in RDF

Erik Hennum [Information Architect, IBM]
ehennum@us.ibm.com