Making Elements from Arbitrary Sections: A practical application of XML Topic Maps

Paul Caton
Morris Hirsch

Abstract

Recent markup-related standardised technologies--XLink, XPointer, DOM Level 2, and XSL/XSLT--allow us to locate, retrieve, and transform an arbitrary section of content from an XML-encoded text and present this content to the user as a new, well-formed XML document. This ability will be especially useful for exploiting the full potential of XML topic maps.

Keywords: Topic Maps; XPointer; DOM; SAX; XSLT

Paul Caton

Paul Caton is a Project Analyst at the Scholarly Technology Group. He works on projects involving XSL/XSLT, CSS, or DynaWeb. He is also the Electronic Publications Editor for the Women Writers Project.

Morris Hirsch

Morris Hirsch is Lead Research Programmer/Analyst at the Scholarly Technology Group. He provides support for STG projects in the areas of web interaction and programming.

Making Elements from Arbitrary Sections

A practical application of XML Topic Maps

Paul Caton [Brown University, Scholarly Technology Group]
Morris Hirsch [Brown University, Scholarly Technology Group]

Extreme Markup Languages 2001® (Montréal, Québec)

Copyright © 2001 Paul Caton and Morris Hirsch. Reproduced with permission.

Introduction

The widespread acceptance of XML as a standard for the structural encoding of electronic information has generated a slew of related standards and proposed standards for ways of exploiting XML-encoded resources: linking to, pointing to, navigating, searching, transforming, displaying them, and so on. These enabling markup technologies operate within a common conceptual paradigm: the node tree of well-formed XML. Electronic information that is wholly or partially outside that paradigm becomes relatively impoverished because it is less easily exploitable. There is therefore a strong incentive to find ways of bringing this "disenfranchised" information into the XML node tree paradigm. Here we describe a method for taking an arbitrary section (that crosses element boundaries) of an XML-encoded resource and transforming the character data it contains into a node of a new XML document. This application generalises to many contexts, but we believe it particularly appropriate and valuable to exploiting XML topic maps. We first describe the basic problem and our solution; we then situate our work in the context of online research and explain the synergy between what our application does and what topic maps offer as information resources.

Creating alternate node hierarchies from arbitrary sections

XML topic maps can link a topic to an arbitrary point in, or section of, a well-formed XML resource. They can do this because the <resourceRef> element has the attribute "xlink:href=". The value of "xlink:href=" is a URI which can end in a fragment identifier in XPointer format. The fragment can start from a point inside one node and end at a point inside a different node. It can thus include whole and partial nodes of the resource's original node-tree.11

While such a fragment does not exist as an element in the original encoded resource, it could be an element in an alternate structural view of that resource. Suppose we have a resource called foo.xml encoded as a <document> with three <p> children (Figure 1).

Figure 1: Document with three children (foo.xml)
[Link to open this graphic in a separate page]

Now suppose a user is researching topic X, and finds in a topic map occurrences of X which are identified as ranges A-B and C-D in the resource foo.xml (Figure 2).

Figure 2: Occurrences of X in foo.xml
[Link to open this graphic in a separate page]

Because the user's interest lies in instances of the topic rather than in the resource per se, she would like a view of foo.xml that foregrounds the occurrences of X. We could generate this new view using XSL/XSLT, but only if we pass in well-formed XML. Suppose, however, the ranges in question arbitrarily cut across the element structure as encoded in foo.xml. To give the user the view she wants we need to re-encode the text of foo.xml so that the new hierarchy consists at least of elements created from the specified ranges (Figure 3), and possibly also of other "unrelated" elements, to the extent that well-formedness is maintained.

Figure 3: New structure
[Link to open this graphic in a separate page]

Then we can process this new structure with XSL/XSLT stylesheets and create the new view (Figure 4).

Figure 4: New view
[Link to open this graphic in a separate page]

Implementation

Possible implementations of a method for creating nodes from arbitrary sections range from the relatively simple to the very complicated depending upon the degree of relationship preserved between the section(s) and their original encoded context. The simplest case involves extracting the string-value of a range, making it the content of an element, and then adding an XML declaration and any necessary containing element to produce a well-formed resource. This represents zero-degree of preserved relationship. The highest degree would be reproducing the original resource with only the minimum changes necessary to accommodate the newly-created node(s). This could involve complex checking of inherited properties, for example, or encoding defaults set in a metadata section such as the TEI [Text Encoding Initiative] encoding scheme's <teiHeader>.

We created a "proof-of-concept" implementation using dom4j (www.dom4j.org), an open source Java library for working with XML. Both the DOM (tree) and SAX (stream) models are supported; we use the DOM (tree) model. We present the document, as a file or string, to the dom4j parser, which returns a DOM object that offers a variety of navigation methods by which we locate the required text and, for a more complex implementation, any included tags.

For example, given the following structured resource in which a range has been specified:

Document
|-Abstract
|-Body
|--Chapter 1
|---ChapterIntro
|--Chapter 2
|---ChapterIntro
|---Section 2.1
|----SectionIntro
|---Section 2.2
|----SectionIntro
|----Div 2.2.1
|----Div 2.2.2 [contains start point]
|----Div 2.2.3
|---Section 2.3
|----SectionIntro
|--Chapter 3
|--Chapter 4
|---ChapterIntro
|---Section 4.1
|----SectionIntro
|----Div 4.1.1
|----Div 4.1.2 [contains end point]
|----Div 4.1.3
|---Section 4.2
|---Section 4.3
|--Chapter 5
|---ChapterIntro
|-Conclusion
the application follows these rules:
  • locate the nodes containing the start and end points, here Div 2.2.2 and Div 4.1.2,
  • locate their nearest common ancestor, here Body.
  • recursively explore the subtree starting at the common ancestor, here the Body and all of the Chapters below it, and their contents in turn, in the same order they are shown above, subject to these rules.
  • until a visited node is an ancestor of the starting node, discard any contents, and do not recursively explore the subtree starting from it. (Here Chapter 1 and Section 2.1 would not be explored.)
  • when the node is an ancestor of the starting node, keep the node's start tag and attributes, but discard any contents.
  • until the node containing the starting point is visited, discard any contents. (Everything through Div 2.2.1 is discarded.)
  • when the node containing the starting point is visited, discard any contents before the starting point, and keep the rest. (Part of Div 2.2.2 is kept.)
  • between the starting node and the ending node, keep any contents (Everything through Div 4.1.1 is kept, possibly including sections and divs not shown in this figure.)
  • when the node containing the ending point is visited, keep any contents up to the ending point, discard the rest, and quit without any further exploration of the tree. (Part of Div 4.1.2 is kept.)
  • provide any missing end tags

We make no claim for this being an optimum implementation. There is, for example, an inherent inefficiency in extensive tree-walking, and working with an event-stream API might be more efficient, if there were a standardised means for the API to interpret XPointer expressions in terms of events.22 In this area of XML-related development the relevant standards are either very new or still in progress, and the same is true of the tools that implement them.

The W3C's XML Fragment Interchange specification [XML Fragment Interchange, 2001], currently a Candidate Recommendation, states in Section 2 "[it is] explicitly noted that this Recommendation does not consider interchange of information that is not well-formed XML," and so is not relevant to the simplest level of our application where our concern is chiefly to extract the relevant character string and make it the content of a new element. However, as we noted above more complex cases would involve retaining as much of the original encoding within (and possibly without) the section as possible, so that the transformation of the arbitrary sections would disrupt only those original elements that get bisected by the new elements.33 Integrating the XML Fragment Interchange fragment context specification notation with any implementation seems obviously desirable, once that specification becomes a standard.

Topic maps and intellectual appropriacy

Topic maps have generated interest in the markup community, though it is perhaps too early to say whether their future is assured.44 It might be argued that if the web does indeed become increasingly "semantic", full of structured self-describing information, then search engines will be sufficient to find, sort, and present links to exactly and only the information a user requests. However, this is only to say search engines should produce result sets that look very like ... topic maps. We believe topic maps important because they offer an intellectually appropriate architecture for mediated information. They represent a timely and natural development in a long tradition of knowledge "middleware" that includes library classification systems, encyclopaedias, journal articles, and annotated bibliographies.

In principle, topic maps do three things very well. Firstly, with respect to a topic they separate relevant from irrelevant information. Secondly, they sort and classify the relevant information so that users can go straight to a particular aspect of a topic. Thirdly--most significantly for our purposes--they can point very precisely to occurrences of the topic or sub-topic.

The importance of this last point cannot be overstated. Both in the abstract and the material, a topic as an information entity does not correspond to other information entities, including those we refer to by the terms "file," "document," "page," "book," "work," and "text." Typically the occurrences that materially instantiate a topic occur in more than one of these other entities, and even within a single entity they typically are not coextensive with that entity but only a part of it. Engaging with a topic as a whole means engaging with fragments of other information entities. The great value of topic maps lies precisely in the way they give formal and intellectual coherence to something which most current search engine result sets can only present as a list of instances with keyword-based rankings--not always an accurate indicator of the instance's relevance to the user's research needs.

The paradigm problem

An unfortunate consequence of this ontological integrity, however, is a paradigm boundary which the topic map's limited functionality does not bridge. Topic maps point the user to exactly the kinds of occurrences the user is interested in, but once the user leaves the topic map to visit the pointed-to resource, the relationship between topic map and occurrence ends. Now the user has moved into a different information entity paradigm, but the motivation for being there lies back in the topic paradigm. Ideally, then, the topic paradigm should penetrate that of the other information entity, allowing the user to engage the information in terms of the topic.

Occurrences can take many forms, such as photographs, text, music, animation, etc. For our purposes we are interested in textual occurrences; that is, character stings. Commonly, before anything else users want to see and read the occurrence(s). They may want to see the occurrence(s) plus a degree of context, or even see the whole entity. Additionally, they may want to do something with the occurrence(s): store it, add markup to it, annotate it, etc. all the time working within the paradigm of their research into the topic. However, the paradigm mismatch we noted earlier creates a problem. As many have pointed out, the structural concept of text that dominates contemporary text encoding commonly means that within a single resource only a relatively few kinds of information get encoded, and in the case of legacy texts the encoding has a strong tendency to follow the original's visual organisation.55 Rarely do we find encoded inline all the elements that represent multiple, perhaps orthogonal "analytic perspectives" (to borrow Renear, Mylonas, and Durand's term [Renear, Mylonas, and Durand 1996]), even more rarely are completely heterogeneous pieces of information encoded, especially if they span the deterministic visual boundaries.

Having the ability to access, transform, and display in a customised way any arbitrary sections that an XML topic map points to circumvents the paradigm and encoding mismatches and allows users to see heterogeneous information in a controlled, optimised manner within the XML topic map paradigm rather than that of the source information entity. It brings the topic-relevant information to the user, as should happen first in topic-based research. Users always have the option of (figuratively speaking) going out to the pointed-to resource if they wish.

Conclusion

Topic maps are not just switching stations taking users from one primary resource to another. As Steve Pepper points out, they are "information assets in their own right, irrespective of whether they are actually connected to any information resources or not," [Pepper 1998]. There is an urgent need to develop applications that interface users, topic maps, and occurrence resources so that users can easily access and exploit the knowledge that topic maps contain. Such work has already begun, for example with Benedicte Le Grand and Michel Soto's 3D visualization tool for topic maps [Le Grand and Soto 2000], Eric Freese's SemanText application for building semantic networks from topic maps [Freese 2000], and G. Ken Holman's experiments with XSLT and topic maps [Holman 2000]. We believe the "enfranchising" of arbitrary sections has a useful place among these efforts.

Notes

1.

See the XTM specification, section 3.9.2 [XTM 1.0, 2001]; the XLink specification, section 5.4[XLink 1.0, 2000]; the XPointer specification section 5.[XPointer 1.0, 2001]; and the DOM Level 2 Traversal and Range specification [DOM2 T-R, 2000]

2.

For a summary of and links to some discussion of this on the XML-DEV mailing list, see [Dodds 2001].

3.

Note that even in the simplest case, we might want the program to retrieve and present not only the two instances of X but also some indication of the textual and/or encoding context. If the resource were an author's manuscript, for example, it might be important for the user to know that the first instance of X occurred within an ancestor <deleted revision="2">. I am indebted to Steve DeRose for this example.

4.

The interest can be judged by the number of papers and sessions focussing on topic maps at this and similar conferences, such as XML Europe 2001, Knowledge Technologies 2001, Extreme Markup Languages 2000. Anecdotally, some questioning occurs. For example, one of the authors has heard two editors of XML-related standards describe topic maps in conversation as "less than meets the eye" and "a crock".

5.

See, for example, [Huitfeldt 1993], [Huitfeldt 1995] and [Caton 2000].


Bibliography

[Caton 2000] Caton, Paul. 2000. Markup's current imbalance. Paper presented at Extreme Markup Languages 2000. 15-18 August, at Hotel Wyndham Montréal, Montréal, Canada. Conference proceedings 37-44.

[Dodds 2001] Dodds, Leigh. 2001. "Toward an XPath API." XML-Deviant: Weekly news from the mailing lists. 7 March, 2001. XML.com. Online at http://www.xml.com/pub/a/2001/03/07/xpathapi.html.

[DOM2 T-R, 2000] Document Object Model (DOM) Level 2 Traversal and Range Specification, version 1.0. World Wide Web Consortium. 13 November, 2000.

[Freese 2000] Freese, Eric. 2000. "Using Topic Maps for the representation, management, and discovery of knowledge." Paper presented at XML Europe 2000. 12-16 June, at Palais De Congrès De Paris, Paris, France. Online at http://www.gca.org/papers/xmleurope2000/s22-01.html.

[Holman 2000] Holman, G. Ken. 2000. "Experiments Using XSLT with Topic Maps." Paper presented at XML 2000. 3-8 December, at Marriott Wardman Park Hotel, Washington D.C. Online at http://www.cranesoftwrights.com/resources/index.htm

[Huitfeldt 1993] Huitfeldt, Claus. 1993. MECS-A Multi-Element Code System. Paper presented at ACH-ALLC '93. The 1993 Joint International Conference of The Association for Computers and the Humanities and The Association for Literary and Linguistic Computing, 16-19 June, at Georgetown University, Washington, DC. Conference abstracts 91-4.

[Huitfeldt 1995] Huitfeldt, Claus. 1995. "Multi-Dimensional Texts in a One-Dimensional Medium." Computers and the Humanities 28 (4-5): 235-41.

[Le Grand and Soto 2000] Le Grand, Benedicte and Michel Soto. 2000. "Information management - Topic Maps visualization." Paper presented at XML Europe 2000. 12-16 June, at Palais De Congrès De Paris, Paris, France. Online at http://www.gca.org/papers/xmleurope2000/s29-03.html.

[Pepper 1998] Pepper, Steve. 1998. "Euler, Topic Maps, and Revolution." Paper presented at XML Europe 99. GCA conference, Granada, Spain. Online at http://www.infoloom.com/tmsample/pep4.htm

[Renear, Mylonas, and Durand 1996] Renear, A, and Elli Mylonas, David Durand. 1996. "Refining Our Notion of What Text Really Is." In Nancy Ide and Susan Hockey (eds.), Research in Humanities Computing 4. Oxford: Clarendon Press.

[XLink 1.0, 2000] XML Linking Language (XLink) specification version 1.0. World Wide Web Consortium. 20 December, 2000.

[XML Fragment Interchange, 2001] XML Fragment Interchange, Candidate Recommendation. World Wide Web Consortium. 12 February, 2001.

[XPointer 1.0, 2001] XML Pointer Language (XPointer) specification version 1.0. World Wide Web Consortium. 8 January, 2001.

[XTM 1.0, 2001] XML Topic Maps (XTM) specification version 1.0, revision 1.12. TopicMaps.Org. 2 March, 2001.



Making Elements from Arbitrary Sections

Paul Caton [Brown University, Scholarly Technology Group]
Morris Hirsch [Brown University, Scholarly Technology Group]