The gap between structured and unstructured information needs a bridge.

Michel Biezunski
mb@coolheads.com

Abstract

A new bridge is under construction: the one that connects structured information with unstructured information. The unavoidable need for finding whatever information is available on subjects of interest leads to the requirement to simultaneously access information regardless of the way it has been created and is maintained. Since Topic Maps claim to be able to address these two universes, it is no coincidence that the debates that are currently agitating this community fall on both sides of the rift. The purpose of this paper is to establish that the best way to avoid falling over the precipice is to recognize that it exists. Building a bridge might very well be the answer.

Keywords: Topic Maps; Information Architecture

Michel Biezunski

Michel Biezunski is working as a consultant for Coolheads Consulting. Michel Biezunski has been at the initiative of the Topic maps paradigm, together with Steven R. Newcomb. He is the co-editor of ISO/IEC 13250 Topic Maps and he was a founding Chair of TopicMaps.Org, the host of the XTM (XML Topic Maps) Specification. He is working to merge knowledge-based approaches with information management systems, both by designing custom applications and by fostering the development of new standards for the Web.

The gap between structured and unstructured information needs a bridge.

Michel Biezunski [Coolheads Consulting]

Extreme Markup Languages 2002® (Montréal, Québec)

Copyright © 2002 Michel Biezunski. Reproduced with permission.

Introduction

A new bridge is under construction: one that connects the universe of structured information with the universe of unstructured information. The unavoidable need for finding whatever information is available on subjects of interest leads to a requirement to simultaneously access information regardless of the way it has been created and is maintained.

Since Topic Maps claim to be able to address both universes, it is no coincidence that the debates that are currently agitating this community fall on both sides of the rift. The purpose of this paper is to establish that the best way to avoid falling into the precipice is to recognize that it exists. Building a bridge might very well be the the answer.

1. Topic Maps is a paradigm that is built on a structured markup (SGML/XML) way to organize information. This has nothing to do with the fact that a topic map is interchanged as a marked up document, but with the fact that each topic map can (optionally) define its own vocabulary, including topic types, association types, occurrence types, doctrines of scope, etc. These application templates affect what topic map applications will be able to process. There are advantages in defining ways to manipulate certain regular patterns (such as subclass-superclass associations). The application designers necessarily have to decide how to implement these aspects of topic maps. Some of them are generic enough to give rise to some application standardization, around the layer of the core standard. The current projects to develop a constraint language as well as a query language specific to topic maps clearly illustrate a desire to go further in this direction.

2. There are other ways to build and maintain topic maps than manipulating structure. Searches based on natural language techniques also provide ways to build networks of semantically interconnected information items. Such systems contain algorithms that are able to turn language into meaningful, interconnected topics. The way information is processed internally can be quite complex (and proprietary), and standardizing it might be trickier than in the former case.

Topic Maps have been designed to provide an application-neutral snapshot of the state of the connective tissue within and between information repositories.

Before SGML was invented and XML became widely adopted, document-based and database-based applications were considered two different universes. The fact that they are now viewed as two faces of the same coin results from the work which has been performed that has enhanced the schema/validation capabilities, which has taken place in a very heated atmosphere. The attempts to overcome the "great divide" between the structured universe and unstructured universe is provoking similarly passionate discussions, that will probably last for some time.

The Divide in Knowledge Management

There are two very distinct approaches to the problem of knowledge management:

(1) One approach requires the construction of a "closed universe" in which the design controls all semantic aspects of all of the information to be managed. All information is forced to conform to some semantic (or "logical") model. Anything that doesn't conform to the model is either lost, or it is converted in some way so that it can conform. The designs of all information assets are subservient to the specific purposes for which they are being maintained.

Without careful information design and semantic consistency within instances of that design, it is impossible to support logical operations. Without a design, there is no basis for logic, and knowledge that is not amenable to logical operations is not usually very useful. Many database queries, for example, would be insupportable in the absence of an underlying set of precise semantic expectations regarding the significance of each structural component.

Such a semantic universe is "closed" because it does not provide for the existence of any other universe. Other models for the support of logical operations are irrelevant. Information that conforms to other models cannot be accommodated.

(2) The other approach (here called an "open universe of semantic universes") provides a substrate on which instances of knowledge assets can directly participate in one another, to whatever limited extents their various semantic models may permit.

The Reference Model in preparation for Topic Maps captures what's going on behind the stage: assertions connect nodes (which are mere binding points) and the "properties" of each of the nodes is attached through assertions. Naming, for example, is an assertion. The assignment of a subject to a topic is another assertion. Typing a topic is an assertion. Giving a topic an occurrence is an assertion. Connecting two or several topics is an assertion. To summarize, everything gets connected to everything else through a simple, unique, well-defined, completely generic mechanism.

The two approaches are entirely compatible with each other, and they are both needed.

However, in our discussions about Topic Maps, greater awareness of the distinctions between the two approaches is needed. Communications about Topic Maps have suffered from misunderstandings that have arisen from incorrect assumptions about which of the two perspectives was the context within which the discussion was taking place.

Specifically, in design discussions about the Standard Application, it is important to establish whether:

(1) the discussion is taking place within the closed semantic universe that much of the Standard Application Model of Topic Maps is intended to define, or

(2) whether the discussion is taking place within the open, Application-neutral semantic universe defined by the Reference Model, or

(3) whether the discussion is taking place within the hybrid semantic universe that is implicit in the XTM and HyTM interchange syntaxes. (These are both "hybrids" because their implicit model is partly closed [names, occurrences] and partly open [user-defined associations]. In fact, their universe is the Reference Model's open universe, and it contains one closed universe: the Standard Application Model's pre-defined semantics.

Changing Boundaries

Communities of interest define their own vocabularies, ontologies, and taxonomies, which help their members to understand what they are talking about. Being a member of such a community basically amounts to understanding what is the common lore shared by its members. A corporation is an example of such a community. This model works fine as long as the boundaries are well established. What happens if members from different communities need to exchange information, in different languages, using different vocabularies, etc.? The world in which we live can be characterized by the ever changing boundaries between communities, countries, and people, and it is necessary to adjust to that reality to be able to access any vocabulary or ontology.

In Syntax Land, a parallel can be drawn with the ability of XML to be used by parsers which have no validating power and are basically limited to separating structure from content using the markup. In Topic Maps, the mere ability to access any given semantic through navigation is a strong point. Non-validating XML cannot be used to determine whether a particular document instance "makes sense" in the context of a particular application. Similarly, "open" topic maps can be used to provide access to information without providing the ability to process this information according to rules that have been defined for given, well-defined, controlled environments.

Structure versus Discovery

Automatic inferencing, resulting from a predefined set of rules, is a nice tool to have. But it basically returns something that we already know, i.e., something that somebody else has been kind enough to provide for us, either directly or indirectly.

In fact, too much structure might prevent discovery from happening, because it's only what is known to conform to a given structure that will be retrieved. If it is possible to use and design Topic Map systems to perform this kind of task, it's also true that it's possible to use plain XML or databases (or a combination thereof) to perform the same operations. In this perspective, Topic Maps provide merely a common platform to facilitate interchange, but interchange could have been done in another way. Topic Maps is a vitamin, it's not an exclusive pain killer.

Discovering information which was not known until now is different. Topic Maps (as well as RDF, although differently) enable discovery of information by enabling connections from one information node to another by means of assertions. At each node of the topic map graph, it is possible to access the neighboring nodes and dynamically get to know them by collecting all nodes associated with each of them. This decentralized navigation system has at least two advantages.

First, it is scalable (virtually infinitely). Second, it can connect a variety of diverse ontologies that do not necessarily have to have been designed knowing about each other and that can just coexist without even knowing that the others exist

Topic Maps actually show that there is no contradiction between the "structured" approach and the "unstructured" approach, because they eventually resolve to the same underlying representation. Therefore it looks likely that Topic Maps will have a role to play in filling the gap between the structured world and unstructured world, provided they are not strictly interpreted as belonging exclusively to the structured world, one of the underlying subjects of the current discussions.

The more closed topic maps are, the more powerful/useful they are -- in a given environment -- and the more familiar it looks to the XML community. The more open topic maps are, the "stranger" they look -- in comparison with other existing technologies -- , but the more scalable they are, and the more useful they can be for discovery purposes. Natural language processing techniques can be used in combination with structured data manipulation, to lead to unheard-of, strange-looking kinds of applications. Such topic maps necessitate different implementation design than traditional XML-based applications. The fact they also can be as easily used within a strictly well-defined, well-understood, XML environment is a plus, because it will facilitate the transition.


The gap between structured and unstructured information needs a bridge.

Michel Biezunski [Coolheads Consulting]
mb@coolheads.com