Some Semantics for Structured Documents, Topic Maps and Topic Map Queries

Ann Wrightson

Abstract

This paper presents an abstract model for meaningful communication between parts of a distributed system using structured documents. It is a first step towards a semantics of SGML/XML mediated inter-system communication based on situation semantics and the logic of information flow. After presenting the main account, with respect to simple XML documents, and to queries on topic maps, two directions are outlined for further work. One of these is user-centred, relating structure to user experience by means of the concept of genre. The other extends this initial work into a fuller formal model.

Keywords: Semantics; Modeling; Querying; Topic Maps

Ann Wrightson

Initially trained in Philosophy (specializing in logic); 14 years total dedicated to computer-based publishing, with a few years in between spent as an academic; most recently, employed by the legal publishers Sweet and Maxwell Ltd as a technical architect, then, after a short period as a consultant with Ontopia AS, joined alphaXML, a new UK-based XML technology consultancy, as a Principal Consultant. Participated in SGML-related standards development since mid-1980s. Co-ordinating emerging industry-led research developing standards-based support for knowledge management on the Web (KnoW network - founding board member). Recent experience in academic research and advanced university teaching, with a multidisciplinary profile including XML technology (developed the first postgraduate course in this area in the UK), formal methods and high-integrity systems, requirements engineering, knowledge representation, systems analysis, computer supported cooperative work, and multimedia applications design. Member of programme board for KT2001.

Some Semantics for Structured Documents, Topic Maps and Topic Map Queries

Ann Wrightson [alphaXML Ltd]

Extreme Markup Languages 2001® (Montréal, Québec)

Copyright © 2001 Ann Wrightson. Reproduced with permission.

Introduction

Describing the semantics of markup and structured documents is not new, indeed the words semantic(s) and markup have been seen together many times, not least in the papers of this conference last year, which saw critical analyses of markup complexity [SHR00], and of the traditional semantic distinction between procedural and descriptive markup [Ren00]. This paper presents a different approach to the semantics of structured documents, based on situation semantics [Dev91], [Bar89] and the related logic of information flow [BS97].

For the benefit of readers whi have not met situation semantics before, here is a brief introduction to the background concepts of regularity and information (as used in this context - please note that this is quite different, both in content and in motivation, from the formal concept of information as used in information theory, eg in signal processing), followed by some examples.

Regularities and information flow

The term regularity refers to an observed pattern in the world. Situation theory is based on a philosophical viewpoint which acknowledges explicitly that each individual has a different view of the world, called a scheme of individuation, or SOI. Within these SOI, regularities occur - and these regularities are a formalization of a common human experience, of recognizing things that you perceive. This recognisiton is not so much of objects, as of situations, for example, when recognize a rugby match (even though I don't know all the rules), or a situation where I can cross a road safely. This approach has its origins in work on natural language semantics, concentrating on how it conveyed information rather than on grammatical analysis, which is well explained in [Dev91]. For the purposes of this paper, the main point to grasp is that in this framework, information flows by virtue of regularities in the world-as-perceived, together with the constraints perceived within and between these regularities. An item of information in this framework is callen an infon. Here is an example, taken from the author's earlier work applying situation semantics to communications from computers to humans[Wri95].

The information conveyed by an alarm clock

Consider an alarm clock, which has a face displaying the time, and can be set to ring a bell when the hands show some particular time, say 6-30am. When I wake and look at the clock, and see that it is 6am, then the clock has successfully conveyed to me the information that it is 6am. When the bell rings, then the clock has successfully conveyed to me the information that it is time to get up, or more precisely that the time which I decided last night was the time I should get up this morning, has arrived.

This meaning for the time on the clock face is there because of a link (a constraint) between two types of situations: S0, where the clock shows a time of 6am, and S1, it being actually 6am. Because I am aware of that constraint (I have learned to tell the time) the clock can tell me it is 6am. (The clock face cannot communicate successfully to my young son, since he has not yet learned to tell the time - even though he can describe to me exactly where the hands are on a clock.) Similarly, the meaning for the alarm is there because of a link between a situation of type S2 where the alarm goes off, and a situation of type S3, of its being time to get up. (Interestingly, although the meaning of the alarm could be seen as more variable, since it will vary according to whether I have an early start that day, it has more chance of communicating to my son, since it has a clear and uniform connotation that I am about to get up.)

What I want to examine here is the constraint which links S1 and S0. It has some connexion with how the clock works, since if the clock is not working correctly, eg is running fast, then the information that it is 6am will be misinformation - perhaps it is actually 5-45am. Yet the constraint is not dependent on the detailed working of that actual clock, for I could replace it with another clock with a different kind of mechanism (eg electric instead of clockwork), and the communication I am talking about would not be significantly different. Indeed, any of a number of quite different kinds of clock would do, with the form of the communication varying: a dial; a numeric display; a synthesized voice saying `it is 6am'. All these could be part of situations of type S0, and all can have the required link with S1.

Perhaps more can be seen about this constraint by seeing how the communication can fail. One example has already been mentioned---the clock may be wrong. It is still telling the time, but not doing it right. For example, the clock may be running fast, it may have been set ten minutes fast, or it may show Tokyo time when I am in England. Or I may be interpreting as a clock something which is not a clock at all, eg reading a dial as a clock when it is a barometer, or reading a numeric display as time of day when it is showing day of month. In the case of a clock being set wrong, or showing Tokyo time, I can still use it to provide me with the same information if I change the constraint so that eg a clock showing 6am, which I know to be 15mins fast, conveys to me that it is 5-45am. Similarly, if I pass a clock showing 6-30am Tokyo time, and it has a sign by it saying `Tokyo', then it can at least convey that it is half-past some hour, and if I can remember how many hours difference there is, I can work out the full local time. The other cases are hopeless. (A stopped clock belongs to this last group - a clock stopped at 6am is not actually telling me the time at all, even if it happens to be 6am when I look at it.)

Ascribing meaning to document structure

The previous section emphasized the central role of regularities - similarities and predictable relationships between situations - in gaining meaning from perceptions. A deeper analysis of these regularities and their role in determining available channels for information flow is found in [BS97]. Their motivating example (in their Introduction) concerns the use of a simple device (a flashing torch) to communicate a complex situation (someone needs rescuing) from one human to another; the key regularities which enable the flow of information by means of the flashing torch in their analysis of that example are very similar to those in the alarm clock example discussed above.

Following the line of argument in [BS97], the existence of an information channel depends on there being a classification, consisting of two parts: a set of objects to be classified, called tokens; another set of objects used to classify the tokens, called types; and a binary relation that tells which tokens are classified as being of which types. This classification is a formal analogue of the informally recognized regularities in human to human communication, and translates into SGML/XML terminology very naturally as the classification of document instances by document type, as defined by a DTD, or the classification of elements by element type, as defined in an XML Schema.

Furthermore, there is a concept of constraints supported by a classification, together with combination of constraints. This corresponds naturally to the partial specification of document structure found in XML Schema, and to using a repertoire of schema-defined element types within a range of well-formed XML documents. This is developed in the rest of this paper in the context of XML messages within a distributed or federated information system. First, however, there is a brief description of a system scenario for the semantic analysis which follows.

Meaningful communication using structured documents

XML documents are becoming widely used for data integration in heterogeneous systems. Typically, information held in one system component is represented in XML, sent to another system component, then unpicked by the receiver (using a DTD or schema to understand its structure) then mapped into the receiver's own information space. It is becoming widely recognized that this works on two levels; the mapping of the XML can be mechanical, but the information is only meaningful if there is a full common understanding of the significance of the information represented by the XML on the part of the sender and receiver. One way of achieving this common understanding within a large organization is to have an agreed catalogue of structured information items - a data dictionary or a data standards catalogue.

Consider the following scenario (based on the architecture of a real project). A data dictionary is used as a central resource by a number of departments who pass data both to each other, and to and from human users via several portals. The data dictionary itself is a structured document; its main function is to provide human readable information about the content and significance of each data type. This document is held in XML; each entry is an element validated with respect to an XML schema. Each entry also contains one or more examples of the data item, and these are validated using the same schema as is used for live data. Each entry in the data dictionary also serves as a subject identity for a corresponding topic in a topic map. In addition to the data items required for the actual system communications, the data dictionary includes an upper ontology, and a complementary "occurs-as" hierarchy recording where simpler datatypes occur as components of more complex structures (eg as when a positive integer occurs as a day value in a date). Finally, each data exchange between departments, and between departments and portals, uses one of a set of predefined messages; these messages are designed with reference to the data dictionary, typically as a sequence of elements of types defined in the dictionary. These message definitions are stored as XML Schemas, and each one of the message schemas is indexed by the data dictionary topic map as being an occurrence of each entry it uses from the data dictionary. To keep the indexing manageable, only the uppermost level of complex data type within a message is indexed in this way; the relationship between components within the data dictionary is held in the occurs-in index within the dictionary. The last part of this interlocking structure of structures is that the element type declarations within the data dictionary schema document are indexed by the topic map as occurrences of the corresponding data dictionary entries.

Modelling information flow using structured documents

This section is based on Lecture 2 of [BS97], using examples from the scenario above to show how this model works with structured documents.

The two underlying principles of this approach to explaining information flow are Principle 1: information flow results from regularities of distributed systems; and Principle 2: information flow crucially involves both types and their particulars. (These are motivated by a desire to unify and disambiguate sundry philosophical approaches to information flow; this is explained in Lecture 1 of [BS97].) In XML and Internet terms, it is often not clear where to put the boundary between types and particulars - for example, a webpage which is HTML document is a particular, with the HTML DTD its type; yet because of the way browsers work, everyone who looks at that HTML document is making a copy, which is in turn a particular with the originating webpage as its type. What counts as types and particulars is going to depend on the viewpoint being taken on a system under discussion. This is not a problem - the whole theory is grounded in this kind of relativity of viewpoints, as discused briefly above.

The terminology used is types and tokens, where token is a general term for instances or particulars which can be classified by type (there is no built-in notion of type hierarchy or other type theory). The two key questions about information flow are: What information flows through the system? and: Why does it flow?. The answer to the first question is characterized as a local logic, the second as an information channel. The second of these notions is explained first.

Information channel

The first step in defining an information channel is a precise concept of classification:

[Link to open this graphic in a separate page]

Within our scenario, relevant classifications include the classification of system messages according to their message type, and the classification of message components (XML document elements) according to their element type. These types are reified by XML Schemas. (Note to topic map enthusiasts: I am here claiming back a generic meaning of "reify", as applying to any well founded representation of an abstraction within the real world.)

In order to form part of an information channel, classifications need to be structurally related. The precise notion used is that of an infomorphism (for computer scientists, this is a Chu transformation):

[Link to open this graphic in a separate page]

Much is known about these structures; of particular interest here is that they can be added together by means of a colimit construction:

[Link to open this graphic in a separate page]

In the scenario above, this addition corresponds to building a message out of two or more components. For example, a message may be sent from a portal P through an access gateway, which is made up of two messages intended for different departments, Q and R, wrapped in an envelope which is understood by the gateway, and thus enables the gateway to authenticate the message, and send its two internal parts onward to their intended destinations. The significance of the colimit result is that it represents formally the fact that given an appropriate set of schemas describing this message and its parts, and given a schema-validated message, there is only one way in which the gateway will understand the message it receives, i.e. the information will get through as intended, and will not be chopped up in an unpredictable way.

The next step takes us to the full notion of an information channel:

[Link to open this graphic in a separate page]

This significance of this structure is summed up in Principle 3: It is by virtue of regularities among connections that information about some components of a distributed system carries information about other components. In the case of the XML mediated information flow we are discussing here, these connections are the correspondences in structure and content between (all or part of) XML documents held in different parts of the system. This corresponds with experience in XML-mediated data integration; all parts of the system need to have sufficient knowledge of the XML document types or element types being used, if the information flow using XML is to work as intended.

Local logic

Now, we need to account for what the information is that flows using such an information channel. Let's go back to the example of the multipart message passing through a gateway. Say that the content of the information flow from the portal P to department Q is being conveyed by the first of the two messages inside the envelope. Also, that this kind of message may be sent by the portal in combination with quite a few other messages to other departments. Looking at all these instances of information flow from P to Q, the key thing that determines whether the information flow works, is the relationship between P's use of that sub-message, and Q's use of it. For example, it could be a message giving a change of address, with the old address followed within the message by the new address. Both P and Q must understand that the first address is indeed the old address, and the second is indeed the new address - i.e. both must use the message properly - or the information will not flow as intended. In formal terms, key aspects of P's theory about the message structure must correspond to Q's theory about the message structure. A local logic is a formal model of this notion of there being just enough common knowledge at each end of the information flow for it to work properly. The local logic is a part of the full theory of the portal P's classification of all the messages it sends; it is also a part of the full theory of the department Q's classification of the messages it receives. (More accurately, there is a faithful translation between these two part-theories.)

The explicit representation of a local logic in XML technology can be done many ways; it often takes different forms at each end of the information flow, with the mapping between them depending on human to human communication. In the simplest case, the natural language meaning of the tags used in the XML serves the purpose; these are understood by a programmer, then this knowledge is used to write a simple processor for the message (I know what to do with that: it's a postcode). However, the shortcomings of this simple approach are well known, hence the development of additional mechanisms such as schema adjuncts [BS97], which can be understood as capturing more of this shared theory, so as to make the success of the information flow less dependent on other forms of shared knowledge.

One final aspect of this information flow theory is particularly intriguing. Because of the dependency of information channels and local logics not just on the characteristics of the individual parts of a system, but also on their classification for a particular purpose, and on their interrelationships, the emergent properties of the system may be able to be modelled and verified within them (emergent properteis are properties of a system as a whole which are not manifested in its parts). In particular, in the scenario discussed, a characterization of the information channel and its relationships to the local logics in the portal and in the departments, could potentially be used to verify the data protection policy of the gateway, whereby particular kinds of information are only sent to departments authorized to receive them.

Topic Maps and Information Flow

The account of information flow by means of structured documents in the previous section applies to topic maps, as a kind of structured document. It can also be applied to topic maps as an abstract structure in their own right (taking a different view of what consitutes the information channel). Analogous to the multipart XML message serving as an information channel as discussed above, a merged topic map is able to serve as an information channel for conveying information from either of the original topic maps - and the characteristics of topic map merging, amongast other things, provide constraints on the information which is able to be conveyed by means of such a channel.

In addition to merges, it will soon be possible to perform queries on topic maps. It is worth investigating the relationship between a topic map and the result of a query on that topic map, to see how the result is capable of carrying information about the original topic map, as intended. An analysis of intended query scenarios, along the lines of the theory outlined above, is strongly recommended to the TMQL editors as a validation exercise on their detailed query concepts - especialy in relation to the intended effect of scope within both the base topic map and the query result.

Even without addressing the more difficult subjects of scope and merging, semantic clarity is required on more basic properties of topic map queries, and hopefully this brief discussion will contribute to that.

Properties for elementary query operations

The information held in a topic map before and after the execution of a query operation are the key to gaining a clear understanding of the basic operations of read/query, add, and remove. Preconditions and postconditions for these operations are assertions about the information held in the topic map. The practical utility of this approach has been demonstrated by the implementation of Prolog-like querying [Gar01], but as discussed in that paper, the formulation of queries needs further investigation. (An important aspect of this is that the translation of query formulations, especially complex, nested query expressions, into effective searches on a regular information base, is computationally very sensitive to the exact expressiveness and clausal structure of the queries. This has been intensively investigated in computational logic studies, where Description Logics [CGLN99]have gained considerable support in query modelling and ontolgy analysis.)

It is assumed in what follows that the result of these operations is itself a topic map, and that the additional information within a query beyond the operator itself takes the form of a topic map.

Read/query

The information held in a topicmap is not changed by a read-query. Because of that, a read query can convey information, since what is true of the topic map after the query is known to be true before the query. (This is a simplification with respect to real environments, of course, where transaction management is required to maintain this assumption over known intervals.) Running a read-query of this type is a means of finding out about the information in a topic map; it amounts to testing or verifying an assertion about the topic map, and is called below verifying an assertion of the topic map.

Querying using a completely specified structure leads to a yes/no answer from a topic map about the information it holds. If the answer is yes - and this is especially important if the same query gets the answer yes from a number of different remote topic maps - then the content of the query serves as a local logic within which information flow between those topic maps has known properties. In contrast, getting unexpected information out of a topic map depends on querying using incompletely specified structures, and getting a collection of eligible results from each topic map queried. This is related to inferences where the conclusion is a set not a singleton, and these have different computational properties from single-result inferences. However, it is still possible to regard the query as a local logic across an information channel generated from the result set.

Add information

This is envisaged as an operation which takes a topic map before the add operation, tm1, and has as its result a different topic map, tm2. tm2 verifies all assertions verified by tm1. In addition, tm2 verifies some assertion which has an appropriate relationship to the information just added. Add should be the same or very closely related to topic map merging. Removing what has just been added will not in general get back to the original tm1 (because of the possible effect of topic identity conditions on calculating tm2). However, there should be only one possible result of the addition, that is, it should have the colimit properties discussed above in teh context of XML messages. It is worth noting that it is only of it does have those properties that a merged topic map is capable of acting as an information channel in the sense discussed in this paper.

Remove information

This is envisaged as an operation which takes a topic map tm1; and has as its result a different topic map, tm2. tm1 verifies all assertions verified by tm2. In addition, tm1 verifies some assertion, not verified by tm2, which has an appropriate relationship to the information just removed. Adding what has just been removed gets back to the original tm1.

Removing information from a topic map is notionally done by computing a partition of a topic map tm1 into two new topic maps, tm2 and tm3, which when merged yield the original topic map tm1. In addition, tm3 does, and tm2 does not, contain the information removed, and is in some sense the smallest topic map for which this is the case. tm2 is the result of the removal operation. A potentially problematic further condition is that no other information is lost. In general, it might be necessary to duplicate topics between tm2 and tm3, allocating some correponding topic characteristics from tm1 to both, and some to one but not the other.

Removal is a problematic concept, but is not to be confused with negating information which is obtained by a human or machine agent using the topic map. An additional algorithm on the topic map would be needed for that, providing inferencing using the content of topic map subjects and occurrences in addition to the structural operations on the topic map itself which are discussed in this paper.

Future directions

There are two promising further lines of work, apart from the ongoing work on clarifying the nature and semantics of topic map queries. One is concerned with the human side of the human computer interaction discussed above; the other extends the formal modelling.

Structure and genre

Humans using information depend heavily on implied social context, even when using computers [NST94]; in addition, the author's experience teaching study skills suggests that many of the problems faced by individuals using unfamiliar web based information sources are due to not being able to recognize how to use what they see. These concepts are closely related to the concept of a genre, as phenomenon and as social construct - when you meet a detective story, you probably know how to use it, what to expect, even how to act within it (eg in roleplaying games or at a murder mystery dinner party). In relation to the perspective developed in this paper, a genre can be understood as a local logic which supports effective use of a type of information resource. Further investigation of this area with analyses of human machine communication using situation theory look promising, with strong links emerging between the multiple structures within documents evident in much textual analysis, and expressed in SGML though TEI, and new analyses of genre emerging in the field of human computer interaction.

Modelling information flow as mediated by structured information

The work reported here has done no more than scratch the surface of what is a very promising approach to the semantics of structured documents as used in real systems. One promising line of enquiry being pursued by the author (as time permits!) is to use Labelled Deductive Systems [Gab96]. The comprehensive analysis and integrated presentation of different kinds of logics provided in Gabbay's work shows great promise for integrating the different viewpoints outlined above into a coherent understanding of information representation and retrieval using structured documents in distributed systems. The author would be happy to find academic collaborators for this work.


Acknowledgments

To my collaborator Dr Janet Finlay regarding our work in progress on genre and structure; to my colleagues on SC34 and in the topic map community for many constructive disagreements which have increased my understanding.


Bibliography

[Bar89] Barwise, The Situation in Logic, CSLI publications, Stanford University, 1989

[BS97] Barwise and Seligman, The Logic of Information Flow, Cambridge tracts in theoretical computer science 44, Cambridge UP 1997

[CGLN99] Calvanese, Giacomo, Lenzerini and Nardi, Reasoning in Expressive Description Logics, in ed Robinson and Vornokov, Handbook of Automated Reasoning, Elsevier Science, 1999

[Dev91] Devlin, Logic and Information, Cambridge UP 1991

[Gab96] Gabbay, Labelled Deductive Systems, Oxford Logic Guides 43, OUP 1996

[Gar01] Garshol, tolog, a Topic Map Query Language, XML Europe 2001, GCA

[NST94] Nass, Stever and Tauber, Computers are Social Actors, Proc. CHI '94

[Ren00] Renear, The Descriptive/Procedural Distinction is Flawed, Extreme Markup Languages 2000

[SHR00] Sperberg-McQueen, Huitfeldt and Renear, Meaning and Interpretation of Markup, Extreme Markup Languages 2000

[VRB00] Vorthmann, Robie and Buck, Schema Adjunct Framework, http://www.extensibility.com/resources/saf.htm

[Wri95] Wrightson, Situation Semantics, tutorial presented at BCS-FACS Xmas workshop 1995 (available from the author)



Some Semantics for Structured Documents, Topic Maps and Topic Map Queries

Ann Wrightson [alphaXML Ltd]