XSL and Hyperdocuments: Applying XSL to arbitrary groves and hyperdocuments

W. Eliot Kimber
eliot@isogen.com
Mark Anderson
mark@amati.petesbox.net
Brandon Jockman
brandonj@datachannel.com

Abstract

Describes how the authors applied an XSLT engine (4Suite's Python XSL package) to the processing of arbitrary groves and abstract hyperdocuments managed by a generic link management system developed by DataChannel. This implementation experience demonstrates that it is both easy and useful to bind XSL processing not just to XML DOMs to but groves of any sort as well as to more abstract business objects, in this case, abstract hyperdocuments.

Discusses the grove- and hyperlinking-specific XSL and XPath extensions created, how XPath expressions were bound to groves and hyperdocuments, and the details of how the implementation was accomplished. Discusses possible future directions and potentials. Provides samples of working hyperdocuments, style sheets, and the resulting output.

Keywords: XSLT; XPath; Python; DOM; Processing

W. Eliot Kimber

W. Eliot Kimber is Lead Brain for DataChannel. Eliot is a founding member of the XML Working Group, Co-editor of ISO/IEC 10744:1997 (HyTime), and Co-Editor of ISO/IEC 10743, Standard Music Description Language. Eliot is also involved in the STEP and SGML Harmonization effort, which led to a deeper appreciation of the power and utility of formal data modeling as a design and analysis tool. Eliot writes and speaks frequently on the subject of SGML, XML, hyperlinking, and related topics. When not trying to wrestle chaotic data into orderly structures, Eliot enjoys swimming and biking and guitar playing. Eliot is a devoted husband and dog owner.

Mark Anderson

Mark is a Senior Software Engineer for DataChannel, currently working on the Bonnell content and link management system. Mark has been implementing hyperlinking-based systems for almost 10 years, first for TechnoTeacher (HyBrowse) and then for ISOGEN and DataChannel. Mark is also an accomplished musician. When not attempting to translate obscure visions into working code, Mark is a devoted husband and father.

Brandon Jockman

Brandon is a Software Engineer for DataChannel, currently working on the Bonnell content and link management system. Brandon is new to the world of standards-based hypermedia information systems but is surviving his trial by fire. When not attempting to grok the details of intricate information standards and technologies, Brandon is a devoted husband. Brandon enjoys in-line skating, performance automobiles, and beating the rest of the team at Quake.

XSL and Hyperdocuments

Applying XSL to arbitrary groves and hyperdocuments

W. Eliot Kimber [DataChannel, Inc.]
Mark Anderson [DataChannel, Inc.]
Brandon Jockman [DataChannel, Inc.]

Extreme Markup Languages 2001® (Montréal, Québec)

Copyright © 2001 W. Eliot Kimber, Mark Anderson, and Brandon Jockman. Reproduced with permission.

Introduction

The XSL recommendation was specifically designed to enable the transformation of XML documents through the use of largely declarative “style sheets”. However, there is no reason in theory why XSL need be limited strictly to the processing of XML documents. Because XSL templates and XPath expressions operate on nodes with properties, it should be possible to apply XSL transformations to any data objects that can be interpreted as nodes with properties. In particular, it should be possible to apply XSL processing to arbitrary groves as defined in ISO/IEC 10744:1997, HyTime, and ISO/IEC 10879, DSSSL. Given an XSL implementation that is not too tightly bound to an underlying DOM implementation, it should be relatively easy to rebind the XSL processor to a grove implementation or other object models. Our experience is that it is in fact easy.

DataChannel is currently developing a grove-based hyperlink management system, code named Bonnell11. One of the requirements for this system is to provide some form of built-in transformation and page composition technology. Given the choices available, the only standardized technologies are DSSSL and XSL. DSSSL suffers from a syntax that, while powerful, is difficult for many people to learn and use effectively. It also suffers from lack of commercial support (although there is at least one commercial DSSSL-based system). XSL has the advantage that it is easier for people to learn and use and has a wide range of support, both commercial and open source. With the release of the XSL-FO implementation from the Apache product (part of the Xerces package), there is reasonable page composition functionality at a reasonable price. Thus, it was clear that the best approach would be to integrate an XSL processor with the larger grove-based link management system.

The Bonnell system is inherently grove based. The grove mechanism can be thought of as a more generic DOM—it provides a standard for representing data of any sort as a collections of nodes and properties. Because the Bonnell system is not in any way XML-specific, we had to provide more than just DOM-based processing, thus our use of the grove specification. By using the GroveMinder system from Epremis Corp., we have ready access to an industrial-strength grove implementation that makes it practical for us to build the rest of the system (however, the use of the GroveMinder product is not a prerequisite for this approach—any grove implementation would serve—as is shown later, the technique can be applied to objects of any type, not just groves).

We knew that it would be possible to bind XSL to groves—the DOM can be thought of as a specific kind of grove. However, we were not sure how easy it would be. We thought that it might require significant effort to rebind an XSL processing engine from a DOM-based process to a grove-based process. However, as it happened, the binding turned out to be much easier than we expected. This may partly be a side effect of our implementation language (Python) and the architecture of the XSL engine we chose (4XSLT, part of the 4Suite package from Fourthought, Inc.). In particular, we discovered that we did not need to change XPath syntax in order to access properties of arbitrary grove nodes—it was sufficient to treat properties of nodes as attributes using the “@name” syntax for selecting attributes and their values.

Because the Bonnell system is a hyperdocument management system, it is not sufficient to apply XSLT style sheets to single documents or groves. It must be possible to process documents in the context of a larger hyperdocument. In particular, it must be possible for the style sheet to style nodes based on their use as anchors of hyperlinks. This is needed to implement both transclusion (for example, to render compound documents composed of parts from many other documents) and navigational hyperlinking. Thus we had to somehow extend XSL to give access to the link-related properties of nodes. This required providing XSL extensions and XPath functions that provide access to the hyperlink information provided by the Bonnell system. This also turned out to be easier than we expected.

Once the binding of XSLT processing to groves and hyperdocuments has been achieved, adding in support for things like XSL-FO is simply a matter of applying the output of the XSLT process to the XSL-FO processor. At the time of writing we have not created any specific XSL-FO extensions for working with hyperdocuments in the FO domain.

The following subsections present two examples of using the XSL-to-grove binding: the first renders a Word document to HTML through its grove representation, the second adds hyperlinking to show how a document can be presented in the contex of a complex hyperdocument. These examples are not explained in detail. The mechanisms used and their implementation is then covered in detail in the implementation sections.

Grove Styling Example: Word Document Grove

The GroveMinder product comes with a very simple grove constructor for Word documents. It produces a grove in which a Word document is represented as a sequence of “Para” nodes. Each Para node has a “Text” property whose value is the text content of the paragraph. Obviously this grove does not provide a complete representation of the information content of Word documents, but it is sufficient to demonstrate grove construction from non-XML data. A production-quality Word grove would reflect most or all of the information in a Word document, including styling information, bookmarks, metadata, and so on. Making such a grove constructor is literally a simple matter of programming.

In this example, a simple style sheet translates the Word document to HTML through it's grove representation. Because there is so little data in the grove, there's not much for the style sheet to do. The Word source document is shown in 1. The Word document grove is shown in 2

Figure 1: Sample Word document source
[Link to open this graphic in a separate page]
Figure 2: Sample Word grove
[Link to open this graphic in a separate page]

The style sheet for rendering this grove is shown in 3. It has two templates: one for the root node and one for the Para nodes. The mapping of XSL to groves treats node classes as though they were element types. Thus the template match on “Para” is matching the node class “Para”. Node properties are treated as though they were element attributes. Thus, the match “@Text” in the value-of statement selects the node property named “Text”.

Figure 3: Stylesheet (XSL): Word document grove
<?xml version="1.0"?>
<xsl:stylesheet
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:ext="http://datachannel.com/Bonnell/Transform"
  extension-element-prefixes="ext"
  version="1.0"
>
  <xsl:template match="/">
    <html>
    <head><title>Word Document</title>
    </head>
    <body>
     <h1>Word Document</h1>
     <xsl:apply-templates/>
    </body>
    </html>
  </xsl:template>
  <xsl:template match="Para">
    <p>
     <xsl:value-of select="@Text"/>
    </p>
  </xsl:template>
</xsl:stylesheet>

The rendered result is shown in 4. There is nothing particularly interesting about it—it is presented here to contrast with the rendered result in the next example, where hyperlinks have been brought into play.

Figure 4: Rendered Word grove
[Link to open this graphic in a separate page]

The point of this example is not that HTML has been generated from Word but that normal XSL processing has been applied to a grove that is not an XML document. The most exciting implications of this are in the realm of hyperlinking, where the Word grove can be combined with any other grove-based data from any source using a consistent set of simple functions and extension elements. The XSL processing, as the top layer of a fairly deep system of information processing components, affords the style sheet creator tremendous leverage at a fairly low cost of entry. A full-featured XSL processor provides a wide array of useful functions for accessing and organizing structured data. Being able to apply those functions to arbitrary data from any source provides an immensely powerful system that is, as much as possible, completely standards based.

There is nothing in these examples that cannot be done in other ways. However, the things done in these examples have never been done with this degree of ease using tools with as wide a user base or as deep a support community. XSL has proved itself to be both tremendously useful and sufficiently easy to learn and use that people can become proficient with it. It reflects decades of experience in the design of transformation and formatting languages. The goal of this paper is to demonstrate both the utility of applying XSL beyond the domain of XML-based data (and without first literally transforming existing data to XML or pretending that it is XML while it's being processed) and the ease with which such applications can be built.

Hyperdocument Styling Example: Word, Excel, And XML Annotations

This example takes the previous example to the next step: hyperlinking. In this scenario, three documents are involved: the previous Word document, an Excel spread sheet, an XML document that annotates the Word document, and an XML document that establishes extended links between nodes in the Word document and nodes in the spreadsheet.

In addition demonstrating the raw linking functionality available, this example also demonstrates an important separation of concerns between the style sheet and the hyperdocuments and hyperdocument processor. Because a style sheet language like XSL can directly address nodes in any source document, it is possible to implement the sort of linking semantics demonstrated here entirely in the style sheet itself, including defining the link instances purely as style rules applied to documents. However, this sort of “do it all in the style sheet” approach violates the principal of separation of concerns by binding both information presentation semantics and information representation semantics in a single object. By keeping the presentation semantics separate from both the declarative definition of the links (the linking elements in the XML documents) and the data processing needed to understand those links, each part of the system becomes independent of the other, thus protecting those components from changes in other parts.

This separatation of concerns also helps keep each part as simple as possible, with the greatest complexity concetrated in the system component most hidden from end users, the hyperdocument processor. Because the XSL processing is entirely in terms of a generic API for hyperdocuments, it is independent of the details of the underlying data and processors, including the use or non-use of HyTime or of a particular hyperdocument manager implementation. We feel that this separation of concerns is of critical importance, especially as systems get larger and are applied to wider scopes of information sets and maintained for longer periods of time. Because our business focus is on building large-scale systems to manage long-lived complex document sets, part of our engineering focus is aways on what best protects the investment of the system user in configuring the system, of which style sheets are a key part.

(In fact, these sample style sheets could probably be made significantly simpler by adding a few more built-in functions or extension elements—we expect to do that as we gain more practical experience with this technology.)

Rendered Result

The rendered result of the Word document in the context of the hyperdocument is shown in 5.

Figure 5: Word document with imposed annotations and spreadsheet transclusions
[Link to open this graphic in a separate page]

This is exactly the same Word document, but now it has been augmented with a variety of links, including node-to-node transclusions, annotations imposed from another document, and node-to-node navigation links, all imposed through extended links onto an otherwise unmodified Word document.

The highlighted rows are paragraphs that are linked in some way. In the first paragraph, there is a “annotation” link between the paragraph and the comment (shown in the right-hand column). The asterisk in brackets is the link to the comment itself. The “More info...” is the link to the more info (nodes in the Excel spreadsheet) and demonstrates a typical presentation style of putting links in a column next to the data then are associated with.

The second highlighted row reflects a transclusion from the Word paragraph (which was an empty paragraph) to a cell in the spreadsheet. The right-hand column reflects the anchor role of the transclusion target (“used-node”) and the node class and data content of the target node (whose data content is also reflected where the original Word paragraph was).

Style Sheet

The style sheet for this example is shown in 6. All of the “when” checks in the “Para” template are producing the appropriate presentation results for nodes that are participating in hyperlinks (the extension functions and elements in this example are explained later in this. If a node participates in any links (is-anchored-object()), the template does whatever is appropriate for a particular type of link or anchor. It is not sufficient to just blindly convert links to HTML A elements—each link type and anchor role has a unique semantic that, in this case, requires different presentation results.

The style sheet is also complicated by the need to present the linking details in addition to producing the normal presentation. The linking details are provided for information for the purposes of this example—it's not necessarily a typical presentation choice.

Figure 6: Stylesheet: Word-with-links
<?xml version="1.0"?>
<xsl:stylesheet
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:ext="http://datachannel.com/Bonnell/Transform"
  extension-element-prefixes="ext"
  version="1.0"
>

  <xsl:template match="/">
    <html>
    <head><title>Word Document</title>
    </head>
    <body>
     <h1>Word Document</h1>
        <table width="100%">
          <tr bgcolor="yellow">
           <td width="60%"><b>Paragraphs in Word Document</b></td>
           <td width="40%"><b>Traversal Target Details</b></td>
          </tr>
          <xsl:apply-templates/>
        </table>
    </body>
    </html>
  </xsl:template>

  <xsl:template match="Para">
   <tr>
     <!-- This first choose handles the presentation of the base paras.
          It checks for any node-to-node transclusions and resolves them.
       -->
     <xsl:variable name="para-node" select="."/>
     <xsl:choose>
      <xsl:when test="ext:is-anchored-object()">
       <td width="60%" valign="top" bgcolor="yellow">
       <p>
       <!-- First see if the node is transcluded and if so, get the transcluded value: -->
       <xsl:choose>
        <xsl:when test="ext:has-target-of-role('used-node')">
          <xsl:for-each select="ext:get-traversal-targets('transclusion','used-node')[1]">
           <xsl:apply-templates select="ext:get-property-value('traversals')"/>
          </xsl:for-each>
        </xsl:when>
        <xsl:otherwise>
           <xsl:value-of select="ext:get-object-property($para-node, 'Text')"/>
        </xsl:otherwise>
       </xsl:choose>
       <!-- Now handle any non-transclusion links: -->
       <xsl:for-each select="ext:get-traversal-targets()">
        <xsl:variable name="rolename"
            select="ext:get-object-property(ext:get-property-value('anchor'),
                                            'getRoleName')"/>
        <xsl:choose>
         <xsl:when test="$rolename = 'used-node'">
           <!-- Already handled above -->
         </xsl:when>
         <xsl:when test="$rolename = 'comment'">
           <xsl:text>[</xsl:text>
           <xsl:element name="a">
            <xsl:attribute name="href">
              <xsl:value-of select="ext:get-fragmentid-for-node()"/>
            </xsl:attribute>
            <xsl:text>*</xsl:text>
           </xsl:element>
           <xsl:text>]</xsl:text>
         </xsl:when>
        </xsl:choose>
       </xsl:for-each>
       </p>
       </td>
      </xsl:when>
      <xsl:otherwise>
       <td width="60%" valign="top">
        <p><xsl:value-of select="@Text"/></p>
       </td>
      </xsl:otherwise>
     </xsl:choose>
   <!-- This third choose populates the second column of our table, which reflects
        every traveral target regardless of presentation semantic.
     -->
   <xsl:choose>
     <xsl:when test="ext:is-anchored-object()">
      <td width="40%" bgcolor="yellow" valign="top">
       <xsl:for-each select="ext:get-traversal-targets()">
          <xsl:variable name="rolename"
                 select="ext:get-object-property(ext:get-property-value('anchor'),
                                                'getRoleName')"/>
          <xsl:choose>
           <xsl:when test="$rolename = 'more-info'">
            <xsl:for-each select="ext:get-property-value('traversals')[1]">
            <xsl:element name="a">
             <xsl:attribute name="href">
              <ext:traversaltargetnode
                   outputdir="./website"
                   outputsuffix=".html"
                   style="link"/>
             </xsl:attribute>
             <xsl:text>More info...</xsl:text><br/>
            </xsl:element>
           </xsl:for-each>
          </xsl:when>
          <xsl:otherwise>
            <xsl:text>[</xsl:text><xsl:value-of select="$rolename"/><xsl:text>] </xsl:text>
            <xsl:for-each select="ext:get-property-value('traversals')">
              <xsl:value-of select="@ClassName"/><xsl:text>: </xsl:text>

Source Documents

The sample hyperdocument consists of four documents: the Word document, a trivial Excel spread sheet, an XML document containing “annotation” links, and an XML “link set” document that serves as the hub document and contains “transclusion” and “more-info” links among nodes in the Word and Excel documents.

The Excel document is shown in 7. It contains just a few cells to demonstrate the ability to construct groves from spreadsheets. The property set used here for Excel documents is what you would expect: Spreadsheet contains Row nodes. Each Row node has a “Cells” property whose value is a list of “Cell” nodes. Each Cell node has a string property containing its text.

Figure 7: Source document: Excel spreadsheet
[Link to open this graphic in a separate page]

The “link set” document is shown in 8. It is a HyTime hyperdocument. It declares as unparsed entities the other documents in the hyperdocument, thus establishing the “bounded object set” of the hyperdocument. The “linkset” element contains two links, a “transclusion” link, that links a node in the Word document (the 9th paragraph) to a node in the Excel spreadsheet (the first cell of the second row). The “more-info” link relates the 6th node in the Word document to another node in the Excel spreadsheet. The intent of this link is to enable navigation from the base information (where the reader is) to information that provides more information on the subject of the base information.

Figure 8: Source document: link set
<?xml version="1.0" ?>
<!DOCTYPE linkset PUBLIC "urn:datachannel:samples:linkset DTD" [
 <!ENTITY ottaviano
    PUBLIC "urn:datachannel:samples:non-sgml:ottaviano.doc"
    NDATA Word
 >
 <!ENTITY excel001
    PUBLIC "urn:datachannel:samples:non-sgml:excel001.xls"
    NDATA Excel
 >
 <!ENTITY myAnnotations
    PUBLIC "urn:datachannel:samples:non-sgml:myannotations.xml"
    NDATA sgml
 >
]>
<linkset>
<transclusion using-node="1 9" using-doc="ottaviano"
              used-node="1 2 1" used-node-doc="excel001"/>
<more-info base-info="1 6" base-doc="ottaviano"
           more-info="1 1 2" more-info-doc="excel001"/>
</linkset>

The final document in the set, “myAnnotations.xml”, contains a set of “annotation” links that have the semantic of applying comments to nodes. This is the sort of thing online reviewers might do, for example. The document is shown in 9. This document is also a HyTime hyperdocument and also declares as an unparsed entity the document it is linking to. It contains one “annotation” link, which binds the 6th paragraph in the Word document to the “comment” element.

Note that both this link and the more-info link point to the 6th paragraph of the Word document. This is an example of how the use of extended linking can make a node a member of multiple links. This is one reason that the style sheet in this case is as complex as it is: it has to account for all the possible link types and anchor roles that might be associated with a node. In most cases the style sheet writer knows what link types are available to document authors—link types are normally designed and managed with the same care as base document types. In this case we have written the style sheet to provide default behaviors for any unaccounted for link types or anchor roles (the “otherwise” clause in the last choice group in the style sheet).

Figure 9: Source document: myAnnotations.xml
<?xml version="1.0" ?>
<!DOCTYPE annotations PUBLIC "urn:datachannel:samples:annotation hytime DTD" [
 <!ENTITY ottaviano
    PUBLIC "urn:datachannel:samples:non-sgml:ottaviano.doc"
    NDATA Word
 >
]>
<annotations>
<annotation target-doc="ottaviano"
            target="1 6">
<comment>
<p>Steve is one of the editors of the HyTime standard and is also
a co-chair of this conference.
</p>
</comment>
</annotation>
</annotations>

Hyperlinking examle summary

This example has shown a number of interesting things: applying XSL style sheets to non-XML data, using XSL style sheets to condition presentation based on the linking status of nodes, and the use of extended links to impose a variety of relationships onto data that is otherwise largely incapable of doing sophisticated linking (and is, in any case, incapable of doing it in non-proprietary ways). These abilities have an almost limitless scope of applicability and enable the satisfaction of a number of important information management and presentation requirements that have, to date, been prohibitively expensive to satisfy for all but a highly motivated few.

Binding XSL To Groves

The challenge of using XSL in the context of generalized link management has two parts: First you must bind XSL processing to groves (because groves are the standard by which all data content is represented for the purposes of linking and addressing in the link management system). Second you must map XSL to hyperdocuments so that individual documents can be processed with respect to the links among the components of the documents. This section describes how the binding of groves to XSL was accomplished. The next section describes the binding of XSL to hyperdocuments.

What Are Groves?

The basic data model for groves is quite simple: a grove consists of one or more nodes. One key difference between groves and the DOM is that the DOM has a fixed set of node types reflecting the data model for XML documents. The grove mechanism is more general, allowing the definition of arbitrary node types, which allows a given grove to represent data of any type. Using this general mechanism, the HyTime standard defines a specific grove type for SGML (and by extension, XML) documents, the “SGML property set”. Groves that conform to the SGML property set are directly analogous to XML DOMs and have essentially the same information content. Of course, the SGML property set, which predates XML, does not directly reflect some XML-specific concepts, such as name spaces.

The grove mechanism was defined in order to provide a common data model for use in both linking applications (e.g., HyTime) and transformation and styling applications (e.g., DSSSL). The grove mechanism had to be generic over all possible data types so that hyperdocuments could involve data of any type, not just XML data (and without requiring that non-XML data first be converted to XML). When using groves, all data, regardless of its specific semantics, is represented using a consistent underlying data model. This enables generic addressing using common semantics and syntaxes. By contrast, in the current Web world, every data type must define its own addressing semantics and syntax2.2

A grove is a directed graph of nodes. Each node has a specific class and a set of properties, as defined in the grove's “property set” (the schema for the grove). The properties of a node may be primitive data types (string, integer, Boolean, etc.), lists of primitives, singleton nodes, lists of nodes, or dictionaries of nodes (“named node lists”). Unlike DOMs, which are strict trees, groves can represent arbitrary directed graphs because any node's properties may point to any other nodes in the same grove. In addition, groves can be connected to each other when a property of a node in one grove points to a node or nodes in another grove (an “unrestricted reference”).

Groves can be viewed in such a way that they are strict trees. This design was specifically geared toward the needs of representing SGML documents and thus provides convenience features that make working with things like XML documents as easy and intuitive as possible. When an XML document grove is viewed as a strict tree, the structural organization of the grove is essentially the same as an XML DOM.

The grove mechanism provides several convenience features, the first being the ability to view the grove as a strict tree.

Two other key conveniences are “content” properties and “data” properties. Content properties are properties that have been defined in the grove's property set as containing the content of the node. For example, in an XML element node, the property named “children”, which contains the nodes constructed from the syntactic content of the element, is declared to be the “content” property of the node. By contrast, the property named “attributes”, which contains the attribute nodes for the element, is not content of the node. The “data” property is that property that has been defined as holding the “data” for the node. Thus, for any node you can ask for its “content” or its “data” and will get back the value of whatever property has been identified as the data or content property, if any (there is no requirement that a node have a data or content property). The content and data properties correspond to the “childNodes” and “value” properties in DOM nodes.

It should be stressed that there is no magic to groves. They are simply an application of the general concept of nodes with properties that has been specialized to meet the specific requirements reflected in the HyTime and DSSSL standards, that is, the requirements of generalized hyperlinking and transformation systems. Hyperlinking and styling require a consistent fundamental data model that their processing semantics can be defined in terms of. Groves provide such a model but are not the only possible model. It is likely that the same forces that led to the development of XSL as a refinement of DSSSL and other style languages will lead to a refinement of the grove concept once the community at large realizes the need for data model that is more general than the XML DOM.

Mapping XSL to Groves

The mapping of XSL to groves is straightforward. First, we treat SGML document groves essentially as though they were DOM trees, so that all the normal XSL operations that are bound to XML constructs (elements, attributes, data characters) work in exactly the same way for SGML document groves. Note that SGML document groves can be constructed from XML documents because XML is SGML. The only potential problem is the use of name spaces—the SGML property set predates the XML name space specification and therefore has no specific support for it. However, this is the same problem as for DOM 1, and can be solved simply by enhancing the grove-to-XSL binding to add the necessary name space support, if required. For a given XML document, an XSL style sheet should produce identical results for both a DOM-based and grove-based implementation.

For all other grove types, the “apply templates” and “for each” operations of XSL iterate over node lists of grove nodes. All nodes are treated as though they were Element nodes in a DOM. Nodes can be selected by class name (instead of element type name). Node properties are selected as though they were attributes. All primitive values are converted to strings.

General Implementation Approach

The Bonnell system is implemented primarily in Python. Thus we selected a Python-based XSLT engine. The 4Suite package from Fourthought, Inc. was the most complete package we could find and had no licensing constraints on its use. The GroveMinder system also provides a Python API, so there was no barrier to combining the two tools.

Because Python is a weakly-typed, interpreted language it made the task of modifying the XSLT engine to act on grove nodes instead of DOM nodes much easier than it would be in a strongly-typed language. The advantage of Python in this case is that XSLT implementation methods that take or return nodes or node lists are perfectly happy to directly accept or return grove nodes (or in fact, any object). In a Java or C++ implementation, we would be forced to either modify the class hierarchy of the XSLT implementation to accomodate grove nodes in addition to DOM nodes or wrap grove nodes in classes that implement the DOM API.

Implementing the XSL-to-grove binding required the following modifications to the 4Suite tools:

  • Refactor the core node processing to select between processing DOM nodes and processing grove nodes.
  • Refactor those places where XSL processing is applied to nodes (apply templates, value-of-element, attribute value template, etc.) to select between DOM processing and grove node processing.
  • Extend the XPath processing to operate on grove nodes in addition to DOM nodes.

This required us to make modifications in 9 separate Python modules in the 4Suite package (out of scores of modules) and to add a grove-specific utility module that provides the functions needed to implement XSL semantics, such as node match, for groves. A team of two programmers working as a pair spent about 4 days getting this initial binding implemented and tested, including finding and fixing bugs in the 4Suite code itself. The total amount of existing 4Suite code modified or added is about 100 lines. The utility module is about 300 lines (of which 100 is a brute-force parser for formal system identifiers that we already had lying about and that didn't warrant the attention needed to make it smaller).

The use of a different grove implementation would simply require reworking the grove utility module to use the new implementation's API (there is no official standard API for groves, although the GroveMinder API serves somewhat as a de-facto standard).

A comparable implementation in other languages should be of roughly the same magnitude.

A C++ implementation would be complicated by the need to do more with either the base implementation's class hierarchy to provide a common superclass over DOM nodes and grove nodes or wrapping of grove nodes in a DOM API.

A Java implementation would require a Java grove implementation because GroveMinder does not have a Java binding. A Java grove implementation would not be difficult to develop (Alex Milowski demonstrated one at SGML '97 but subsequently lost the rights to it through corporate acquisition) but we are not aware of any available Java grove implementations that are not themselves DOM based. Given a grove implementation, a Java XSL-to-grove binding would have the same additional complications as a C++ implementation.

Implementation Details

To implement the mapping of XSL and XPath to groves, we had to modify the base 4Suite XSLT and XPath code to select a DOM or grove processing path based on the type of node being processed. We also had to implement the grove-specific processing logic needed to provide XSL and XPath semantics for grove nodes.

This section reflects our modifications made to the 0.11.1a1 version of the 4Suite package. Our modifications were initially written against an earlier version of the 4Suite code and then were migrated to the latest released version (at the time of writing) when we moved from Python 1.5.2 to 2.1.

The XSLT and XPath implementations comprise about 80 separate Python modules. Of these 80 we had to modify 9: 6 for XSLT and 3 for XPath.

XSLT Modifications

To hook grove-specific processing into the XSLT processor, we had to initially modify the following modules:

  • ApplyTemplatesElement.py
  • AttributeValueTemplate.py
  • Processor.py
  • Stylesheet.py
  • ValueOfElement.py
  • XPatterns.py

Most of the modifications are, not surprisingly, in Processor.py, which implements the main processing loop that iterates over the input document tree. The modifications to the other modules are minor, being checks to see if the processing is being applied to a DOM node or a grove node.

Note: Only the methods that have been changed from the base 4Suite code are presented here. We presume that interested readers can get a copy of the 4Suite code for reference if needed.

Processor.py

The Processor.py module defines the Processor class, which provides the API for applying XSLT processing to documents.

The main addition to Processor is the runGroveNode() method, shown in 10. This method parallels the normal runNode() method but applies it to groves. It sets up the grove for processing by first using the maximum “grove plan”, which ensures in this case that any processing instruction nodes in the document are visible (by default, processing nodes are not exposed in an SGML grove). If there are processing instructions, it looks for any style sheet PIs. The call to the grove-specific grove_execute() method is a workaround for a problem in the latest version of the 4Suite code that is exposed by hyperdocument-specific processing.

The __checkGroveStyleSheetPis() method called from runGroveNode() simply implements the same business logic as the normal __checkStyleSheetPis() but against SGML document groves.

Figure 10: runGroveNode() method of Processor class

    def runGroveNode(self, node, ignorePis=0, topLevelParams=None, writer=None,
               baseUri='', outputStream=None, startAndEndDocument = 1 ):
        """
        Run the stylesheet processor against the given grove node with the
        stylesheets that have been registered.
        If writer is None, use the TextWriter, otherwise, use the supplied writer.
        """

        topLevelParams = topLevelParams or {}

        rootNode = node.GroveRoot
        if ignorePis  == 0:
            rootNode = GroveUtility.assignMaxGrovePlan( "SGML", rootNode )
            if ( GroveUtility.hasPIs(rootNode) ):
                self.__checkGroveStylesheetPis( rootNode, baseUri )

        result = self.grove_execute(node, ignorePis, topLevelParams, writer,
                                    baseUri, outputStream, startAndEndDocument)
        return result

11 shows the applyBuiltins() and _applyGroveBuiltins() method. applyBuiltins() is the original method. It has been modified to add a check to see if the node being processed is a grove node. If it is, processing is redirected to the _applyGroveBuiltins() method. The _applyGroveBuiltins() method differs from applyBuiltins() in the way that the node content is accessed and avoidance of the check for attribute nodes.

Figure 11: applyBuiltins() and _applyGroveBuiltins() methods of Processor class
    def applyBuiltins(self, context, mode):
        if GroveUtility.isGroveNode( context.node ):
            self._applyGroveBuiltins( context, mode)
            return

        if context.node.nodeType == Node.TEXT_NODE:
            self.writers[-1].text(context.node.data)
        elif context.node.nodeType in [Node.ELEMENT_NODE, Node.DOCUMENT_NODE]:
            origState = context.copyNodePosSize()
            node_set = context.node.childNodes
            size = len(node_set)
            pos = 1
            for node in node_set:
                context.setNodePosSize((node,pos,size))
                self.applyTemplates(context, mode)
                pos = pos + 1
            context.setNodePosSize(origState)
        elif context.node.nodeType == Node.ATTRIBUTE_NODE:
            self.writers[-1].text(context.node.value)
        return

    def _applyGroveBuiltins(self, context, mode):
        # applyBuiltins -- calls this function if it determines
        #                  we are in grove land instead of the DOM.
        nodeContent = GroveUtility.getNodeContent( context.node )
        if type( nodeContent ) == types.StringType:
            self.writers[-1].text(nodeContent)
        elif nodeContent:
            origState = context.copyNodePosSize()
            size = len(nodeContent)
            pos = 1
            for node in nodeContent:
                context.setNodePosSize((node,pos,size))
                self.applyTemplates(context, mode)
                pos = pos + 1
            context.setNodePosSize(origState)
        #elif context.node.nodeType == Node.ATTRIBUTE_NODE:
        #    pass
        return

12 shows the __checkGroveStylesheetPis() method called by runGroveNode(). It simply does the same processing as the DOM-based __checkStyleSheetPis() method against PIs as represented in SGML document groves. The _lookupUrlForPath() method translates the system paths GroveMinder maintains for grove source documents into the URLs expected by the XSLT processor.

Figure 12: __checkGroveStylesheetPis() method of Processor class
    def __checkGroveStylesheetPis(self, node, baseUri):
        #
        # Note: A Stylesheet PI can only be in the prolog, acc to the NOTE
        #
        # http://www.w3.org/TR/xml-stylesheet/
        #    NOTE: If the xml-stylesheet processing instruction occurs in the
        #          external DTD subset or in a parameter entity, it is possible
        #          that it may not be processed by a non-validating XML processor
        #
        pis_found = 0

        piSet = GroveUtility.getPIs( node )
        for child in piSet:
            if string.find( child.SystemData, 'xml-stylesheet') != -1:
                data = child.SystemData
                if data[-1] == '?':
                    data = data[:-1]
                data = string.splitfields(data,' ')
                sty_info = {}
                for d in data:
                    seg = string.splitfields(d, '=')
                    if len(seg) == 2:
                        sty_info[seg[0]] = seg[1][1:-1]

                if sty_info.has_key('href'):
                    if not sty_info.has_key('type') or sty_info['type'] in XSLT_IMT:
                        path = self._lookupUrlForPath(node, sty_info['href'])
                        if path:
                            self.appendStylesheetUri(path)
                            pis_found = 1
                        else:
                            print "Unable to load stylesheet: %s" % sty_info['href']

        return pis_found

    def _lookupUrlForPath( self, rootGroveNode, path):
        """
        """
        if os.path.isfile( path ):
            return GroveUtility.filename2url(os.path.abspath(path))
        uriResolver = Ft.Lib.Uri.BaseUriResolver()
        try:
            uriResolver.resolve(path)
            return path
        except:
            pass

        source = GroveUtility.soi( rootGroveNode.groveDefinition().sourceData() )
        absPath = os.path.join(os.path.dirname(source),path)
        if os.path.isfile(absPath):
            return GroveUtility.filename2url(os.path.abspath(absPath))
        return None

The grove_execute() method of Processor, shown in 13, applies XSLT processing to grove nodes instead of to DOM nodes. The main difference here is the addition of a hyperdocument object to the constructed context node. The hyperdocument object is the direct binding of the base XSLT processor to the Bonnell hyperdocument management system. This binding is explained in “Binding XSL To Abstract Hyperdocuments”.

Figure 13: grove_execute() method of Processor class
    def grove_execute(self, node, ignorePis=0, topLevelParams=None, writer=None,
                baseUri='', outputStream=None, startAndEndDocument = 1 ):
        """
        Run the stylesheet processor against the given grove node with the
        stylesheets that have been registered.
        If writer is None, use the TextWriter, otherwise, use the supplied writer.
        """
        if len(self._stylesheets) == 0:
            raise XsltException(Error.NO_STYLESHEET)

        self._outputParams = self._stylesheets[0].outputParams

        if writer:
            self.writers.append(writer)
        else:
            self.addHandler(self._outputParams, outputStream, 0)

        self._namedTemplates = {}
        tlp = topLevelParams.copy()
        for sty in self._stylesheets:
            sty.processImports(node, self, tlp)
            named = sty.getNamedTemplates()
            for name,template_info in named.items():
                if not self._namedTemplates.has_key(name):
                    self._namedTemplates[name] = template_info

        for sty in self._stylesheets:
            tlp = sty.prime(node, self, tlp)

        #Run the document through the style sheets
        if startAndEndDocument == 1:
            self.writers[-1].startDocument()

        if node.ClassName == "SgmlDocument":
            node = node.DocumentElement

        context = XsltContext.XsltContext(node, 1, 1, None,
                                          hyperdoc=self._getHyperDocument())
        self.applyTemplates(context, None)

        if startAndEndDocument == 1:
            self.writers[-1].endDocument()

        Util.FreeDocumentIndex(node)
        result = self.writers[-1].getResult()
        if startAndEndDocument == 1:
            self._reset()
            context.release()
        return result
Stylesheet.py

Stylesheet.py defines the top-level classes for style sheets, in particular, StyleSheetElement. To accomodate groves, we modified the prime() method of StyleSheetElement to get the grove root rather than the owner document as for DOM processing.

Figure 14: prime() method of StyleSheetElement class
    def prime(self, contextNode, processor, topLevelParams):
        #######################################################
        # Grove impl -- changes
        #
        if GroveUtility.isGroveNode( contextNode ):
            primingContext = contextNode.GroveRoot,
        else:
            primingContext = contextNode.ownerDocument or contextNode
        self._primedContext = context =\
                XsltContext.XsltContext(primingContext,
                                        1,
                                        1,
                                        processorNss=self.namespaces,
                                        stylesheet=self,
                                        processor=processor)

(rest of method is unchanged)
XPatterns.py

The XPatterns.py module defines classes that implement the processing of patterns used in template matches. The only modification needed here was to the match() method of the DocumentNodeTest class to redirect match processing for grove nodes to the groveNodeMatch() function in the grove utility module.

Figure 15: match() method of DocumentNodeTest class
    def match(self, context, node, principalType):
        if GroveUtility.isGroveNode( node ):
            return GroveUtility.groveNodeMatch(context, node, self, principalType)
        else:
            return node.nodeType == Node.DOCUMENT_NODE
AttributeValueTemplate.py

The AttributeValueTemplate.py module defines the AttributeValueTemplate class. The only modification required here was to modify the evaluate() method to handle the processing of attribute values from grove nodes.

Figure 16: evaluate() method of AttributeValueTemplate class
    def evaluate(self, context):
        if GroveUtility.isGroveNode( context.node ):
            expansions = []
            for pPart in self._parsedParts:
                returnValue = GroveUtility.getStringValue( pPart.evaluate( context ))
                expansions.append( returnValue )
        else:
            expansions = map(
                lambda x, c=context: Conversions.StringValue(x.evaluate(c)),
                self._parsedParts
                )

        (rest of method is unchanged)
ApplyTemplatesElement.py

The ApplyTemplatesElement.py module defines the ApplyTemplatesElement class. The only modification required here was to modify the instantiate() method to get the child nodes from the context grove node instead from from a DOM node.

Figure 17: instantiate() method of ApplyTemplatesElement class
    def instantiate(self, context, processor):
        origState = context.copy()
        context.setNamespaces(self._nss)

        params = {}

        mode = self._instantiateMode(context)

        for param in self._params:
            (name, value) = param.instantiate(context, processor)[1]
            params[name] = value

        if self._expr:
            node_set = self._expr.evaluate(context)
        else:
            #############################################################
            # Grove impl -- changes
            if GroveUtility.isGroveNode( context.node ):
                node_set = GroveUtility.getChildren( context.node )
            else:
                node_set = context.node.childNodes

        (rest of method is unchanged)
ValueOfElement.py

The ValueOfElement.py module defines the ValueOfElement class. The only modification required here was to the instantiate() method. The ValueOf element returns the text value of the context node. The grove-specific modification simply implements the logic needed to get the text value of different node types.

Figure 18: instantiate() method of ValueOfElement class
    def instantiate(self, context, processor):
        original = context.copy()
        context.processorNss = self._nss
        text = ""

        result = self._expr.evaluate(context)
        if GroveUtility.isGroveNode( context.node ):
            if type( result ) == types.StringType:
                text = result
            elif ((type(result) == types.ListType) or \
                  (type(result) == types.TupleType)) and \
                  len(result) > 0:
                for singleReturn in result:
                    if type(singleReturn) == types.StringType:
                        text = text + GroveUtility.getStringValue(context.node,
                                                                  singleReturn)
                    else:
                        text = text + GroveUtility.getStringValue(singleReturn)
        else:
            text = Conversions.StringValue(result)

         (rest of method is unchanged)

XPath Modifications

In order to apply XPath to groves, we had to modify three modules in the XPath implementation: ParsedNodeTest.py, ParsedAbsoluteLocationPath, and ParsedAxisSpecifier.py. As with the XSL processing, these modifications either redirect to grove-specific processing logic or directly implement the XPath semantics for grove nodes.

ParsedNodeTest.py

Parsed node test required modifications to the match() methods of two classes: NodeTest and NodeNameTest, both shown in 19. In both cases, the match method checks to see if the node is a grove node, and if it is, redirects to the groveNodeMatch() from the grove utilities module.

Figure 19: NodeTest and NodeNameTest classes
class NodeNameTest(NodeTestBase):

    def match(self, context, node, principalType=Node.ELEMENT_NODE):
        if GroveUtility.isGroveNode(node):
            return GroveUtility.groveNodeMatch( context, node, self, principalType )
        if node.nodeType == principalType:
            return node.nodeName == self._nodeName
        return 0

class NodeNameTest(NodeTestBase):

    def match(self, context, node, principalType=Node.ELEMENT_NODE):
        if GroveUtility.isGroveNode(node):
            return GroveUtility.groveNodeMatch( context, node, self, principalType )
        if node.nodeType == principalType:
            return node.nodeName == self._nodeName
        return 0
ParsedAxisSpecifier.py

The parsed axis specifier module required modification to the select() methods of the ParsedAncestorOrSelfAxisSpecifier, ParsedAttributeAxisSpecifier, ParsedChildAxisSpecifier, and ParsedPrecedingSiblingAxisSpecifier. In each case, the only differences between the DOM and grove paths are those caused by detail differences in the DOM and GroveMinder APIs. Otherwise, the processing logic is identical. Note that the GroveMinder API predates the DOM by at least two years. However, there's no reason a grove implementation couldn't emulate the DOM API. (Note: at the time of writing the modifications had not been exhaustively tested over all the XPath axes—it is likely that a few more modules or methods will need to be modified to account for the remaining axes. Because we follow the Extreme Programming principal of use-case-driven implementation and because we had not had time to develop an extensive set of test cases, we had only tested those axes actually used in our test style sheets.)

Figure 20: select() methods for axis specifier classes
class ParsedAncestorOrSelfAxisSpecifier(AxisSpecifier):
    def select(self, context, nodeTest):
        """Select all of the ancestors including ourselves through the root"""
        node = context.node
        if nodeTest(context, node, self.principalType):
            nodeSet = [node]
        else:
            nodeSet = []

        #################################
        # Grove impl
        if GroveUtility.isGroveNode( node ):
            parent = node.Parent
            while parent:
                if nodeTest(context, parent, self.principalType):
                    nodeSet.append(parent)
                parent = parent.Parent
        else:
            parent = ((node.nodeType == Node.ATTRIBUTE_NODE) and
                      node.ownerElement or node.parentNode)

            while parent:
                if nodeTest(context, parent, self.principalType):
                    nodeSet.append(parent)
                parent = parent.parentNode
        nodeSet.reverse()
        return (nodeSet, 1)

class ParsedAttributeAxisSpecifier(AxisSpecifier):

    principalType = Node.ATTRIBUTE_NODE

    def select(self, context, nodeTest):
        """Select all of the attributes from the context node"""

        ###################################
        # Grove impl
        if GroveUtility.isGroveNode(context.node):

            attrs = GroveUtility.getAttributes( context.node )
            rt = filter(lambda attr, test=nodeTest,
                        context=context, pt=self.principalType:
                        test(context, attr, pt),
                        attrs or [])

        else:
            attrs = context.node.attributes
            rt = filter(lambda attr, test=nodeTest,
                        context=context, pt=self.principalType:
                        test(context, attr, pt),
                        attrs and attrs.values() or [])
        return (rt, 0)

class ParsedAttributeAxisSpecifier(AxisSpecifier):

    principalType = Node.ATTRIBUTE_NODE

    def select(self, context, nodeTest):
        """Select all of the attributes from the context node"""

        ###################################
        # Grove impl
        if GroveUtility.isGroveNode(context.node):

            attrs = GroveUtility.getAttributes( context.node )
            rt = filter(lambda attr, test=nodeTest, context=context, pt=self.principalType:
                        test(context, attr, pt),
                        attrs or [])

        else:
            attrs = context.node.attributes
            rt = filter(lambda attr, test=nodeTest, context=context, pt=self.principalType:
                        test(context, attr, pt),
                        attrs and attrs.values() or [])
        return (rt, 0)

class ParsedPrecedingSiblingAxisSpecifier(AxisSpecifier):
    def select(self, context, nodeTest):
        """Select all of the siblings that precede the context node"""
        result = []

        ###################################
        # Grove impl

        if GroveUtility.isGroveNode(context.node):
            parent = context.node.Parent
            if parent:
                siblings = GroveUtility.getChildren( parent )
                for sibling in siblings:
                    if context.node == sibling:
                        break
                    if nodeTest(context, sibling, self.principalType):
                        result.append(sibling)

        else:
            sibling = context.node.previousSibling
            while sibling:
                if nodeTest(context, sibling, self.principalType):
                    result.append(sibling)
                sibling = sibling.previousSibling
            # Put the list in document order
            result.reverse()

        return (result, 1)

Grove-Specific Utilities

The GroveUtil.py module implements the functions used by the preceding grove-specific code. These utilities primarily serve to provide methods analogous to the generic DOM methods such as childNodes().

The module prolog imports the base requirements and the grove implementation, in this case, GroveMinder. It also defines some constants. The __groveNodeTypes list is used to do type comparisons on GroveMinder grove nodes.

import sys, os, string
import types
import GroveMinder

__groveNodeTypes = [
    "<type 'GroveNode'>",
    "<type 'GroveNodeList'>",
    "<type 'GroveNodeNamedNodeList'>",
    "<type 'GroveStringNamedNodeList'>"
    ]

The assignMaxGrovePlan() function is a convenience method that ensures that the grove node is being viewed with respect to all the properties the grove implementation can make available. One feature of groves is the ability to hide selected classes or properties of a grove. In the case of SGML, there are many properties, such as the original markup, that many applications do not care about and that are hidden by default.

def assignMaxGrovePlan( builderType, node):
    """
    Assigns the system maximum grove plan to the given node.
    """
    groveBuilder = GroveMinder.makeGroveBuilder( builderType )
    grovePlan = groveBuilder.systemMaximumGrovePlan()
    return node.withGrovePlan( grovePlan )

The isGroveType() and isGroveNode() functions perform basic checks needed to delegate to the appropriate type of processor.

def isGroveType( node ):
    """
    Returns 1 if the given node's type is a valid grove node types.
    Returns 0 otherwise.
    """
    # we could ask:
    #     hasattr( node, "ClassName" )
    #         -- which is an intrinsic property... or ?
    nodeType = str( type( node ) )
    if nodeType in __groveNodeTypes:
        return 1
    return 0

def isGroveNode( node ):
    """
    Returns 1 if the given node's type is GroveNode.
    Returns 0 otherwise.
    """
    if str( type( node ) ) == "<type 'GroveNode'>":
        return 1
    return 0

The groveNodeMatch() function is the heart of the grove-specific XPath processing, implementing the various match semantics of XPath. The __normalizeName() function handles the fact that in SGML names in markup may or may not be case sensitive depending on the specific syntax rules in effect for a given document. In SGML groves, the root node may have a RefSyntax property which contains a Syntax node. The Syntax node has properties indicating the case substitutions in effect: general names (elements, attributes, IDs, notations) or entities. In groveNodeMatch() only general names are matched on, so the code only looks at the general name substitution setting. In XML, all names are case sensitive, so this issue doesn't arise. As a side note, one subtlety in processing XML documents with SGML tools is the use of SGML declarations that turn off case sensitivity—for example, with GroveMinder, an XML document processed with such an SGML declaration will have all of its names normalized, which will often lead to unexpected results either within the SGML tool context (such as matches of names that have different case in the source) or unexpected results when pure XML processing is used (such as matches failing that were not failing in the SGML context because the case was being normalized). To maintain sanity, SGML environments should turn case sensitivity on.

def groveNodeMatch( context, node, pattern, principalType="Element"):
    """
    This method is used by Stylesheet.applyTemplates() to perform a grove node match.
    """
    if ( node == node.GroveRoot ):
        return 1
    normcase = 0  # XML default. May result in false negatives in some cases
    if hasattr(node.GroveRoot, "RefSyntax"):
        syn = node.GroveRoot.RefSyntax
        if hasattr(syn, "SubstGeneralNames"):
            normcase = node.GroveRoot.RefSyntax.SubstGeneralNames
    if pattern and ( pattern.__class__.__name__ == "DocumentNodeTest" ):
        if node.ClassName == "Element" and\
           __normalizeName( node.Gi, normcase ) == \
            __normalizeName( node.PrincipalTreeRoot.Gi, normcase ):
            return 1

        if ( node.ClassName != "Element" ):
            pTreeRoot = node.PrincipalTreeRoot
            if pTreeRoot and ( node.ClassName == pTreeRoot.ClassName ):
                return 1
            if ( node.ClassName == node.GroveRoot.ClassName ):
                return 1

    if pattern and ( pattern.__class__.__name__ == "NodeNameTest" ):
        if node.ClassName == "Element" and\
           __normalizeName( pattern._nodeName, normcase ) == \
             __normalizeName( node.Gi, normcase ):
            return 1
        elif node.ClassName == "AttributeAssignment" and\
             __normalizeName( node.Name, normcase ) == \
                __normalizeName( pattern._nodeName, normcase ):
            return 1
        elif type( node ) == types.StringType and \
            hasattr( context.node, node ):
            return 1
        if pattern._nodeName == node.ClassName:
            return 1

    return 0

def __normalizeName( name, normcase = 0 ):
    """
    Converts and returns name in lowercase.
    """
    if normcase == 1:
        return string.lower( name )
    return name

The getStringValue() function returns the string value of a node. In SGML groves, character data content is not held as a string as it is in the DOM but as a sequence of character nodes. Likewise, tokenized attribute values are held as a sequence of token nodes, not as a character string (but character data attribute values are held as strings). The data() method of grove nodes returns the string value of the “data” of the node, if any. In a grove's property set, you can designate one property to be the “content” of the node. Given such a property, if you ask the node for its data, you will get the concatenation of the data values of all of the nodes in the content property (if the property is nodal) or simply the value of the data property (if it is a primitive value). Likewise, if you ask the node for its children, you will get the value of the content property if it is nodal (if it is not nodal, then the node has no children). This bit of indirection allows different property sets to use whatever name they want for properties that are semantically the content of the node, instead of requiring them to be named “content” or “children”. Also, the grove specification is defined only in terms of named properties, not methods (such as childNodes(), as in the DOM). However, the fact that a property set can designate properties as content properties implies the need for the methods data() and children(), which GroveMinder provides.

def getStringValue( node, attributeName = None ):
    """
    Returns the string value data of the given node.
    """
    text = ''
    if type( node ) == types.StringType:
        text = node
    elif node and attributeName:
        text = getattr( node, attributeName )
    elif type( node ) == types.ListType:
        for obj in node:
            text = text + getStringValue( obj, attributeName )
    elif node and ( ( not attributeName ) or ( attributeName == '' ) ):
        text = node.data()
    return text

The getNodeContent() function returns the appropriate “content” value based on the node type. Here grove node processing is complicated by the fact that the DOM API distinguishes between node lists and strings where the SGML grove does not. In particular, character data content in the DOM is represented by text nodes, whereas in groves, character data content is represented as node lists of character data nodes. The GroveMinder API provides an optimization class, CharData (as opposed to the standard-defined DataChar node, which represents a single character), that acts more like DOM text nodes, that is, a single object that holds a single string. Note that the SGML grove design decision to use individual nodes for each character was driven by the need to address individual nodes for linking and styling purposes. The GroveMinder approach appears to be a reasonable compromise that does not eliminate the ability to treat strings as node lists of characters but provides the convenience that the DOM provides and that is all that most processing applications need. In any case, a grove implementation need not literally produce nodes for characters until they are specifically asked for. The special case for the PseudoElement node class, which is used in SGML groves, optimizes the default processing that would otherwise occur. PseudoElement nodes are directly analogous to DOM Text nodes in that they always represent continguous strings of data characters in element content, but unlike Text nodes, their content is a node list of DataChar nodes. But, PseudoElement nodes are specific to the SGML property set and are not a generic node type. However, in this case we can shortcut the processing of each individual node by calling the data() property directly, providing a significant preformance improvement over the default behavior.

def getNodeContent( node ):
    """
    Returns the children or data for the input node.
    """

    if node == node.GroveRoot and node.PrincipalTreeRoot:
        node = node.PrincipalTreeRoot

    # NOTE: We have to special-case PseudoElement because the default behavior
    #       would cause us to iterate over a bunch of datachar nodes when we can
    #       return the data() directly.
    if node.ClassName == "PseudoElement":
        return node.data()

    if node.DataPropertyName:
        return node.data()

    if node.ChildrenPropertyName:
        return node.children()

    return ""

The hasPIs() and getPIs() functions are helpers that first determine if a node might have processing instructions and then gathers them up for easy processing.

def hasPIs( node ):
    """
    Returns 1 if the node is an SgmlDocument, or 0 if not.
    """
    if node and node.ClassName == 'SgmlDocument':
        return 1
    return 0

def getPIs( node ):
    """
    Returns the list of PI's for the given node.
    Returns an empty list if there are none.
    """
    piList = []
    if node:
        #groveBuilder = GroveMinder.makeGroveBuilder( "SGML" )
        #grovePlan = groveBuilder.systemMaximumGrovePlan()
        #node.withGrovePlan( grovePlan )
        for child in node.Prolog:
            if child.ClassName == "Pi":
                piList.append(child)
    return piList

The hasAttributes() and getAttributes() functions simply determine if a node has attributes and, if so, returns them.

def hasAttributes( node ):
    """
    Returns 1 if the given node has an Attributes attribute, or 0 otherwise.
    """
    if hasattr( node, "Attributes" ):
        return 1
    return 0

def getAttributes( node ):
    """
    Returns the list of the given node's attributes.
    Returns an empty list if there are no attributes.
    """
    attList = []
    if hasattr( node, "Attributes" ):
        attList = node.Attributes
    return attList

The getChildren() function emulates the childNodes() DOM method. If the ChildrenPropertyName property evaluates to true (the value is actually the name of the property) then in the GroveMinder API there will be a children() property that returns the value of the property designed in the property set as the content property.

def getChildren( node ):
    """
    Returns a list of the children of the given node.
    Returns an empty list if there are no children.
    """

    if node.ClassName == "SgmlDocument" and node.DocumentElement:
        node = node.DocumentElement

    if node.ChildrenPropertyName:
        return node.children()
    else:
        return []

The soi() function (storage object identifier) is a brute-force parser for turning formal system identifiers into normal file path strings. James Clark's SP parser, on which GroveMinder is based, normalizes all system identifiers into formal system identifiers. Formal system identifiers are part of the SGML Extended Facilities, defined in Annex A of ISO/IEC 10744:1997. There's probably a more efficient or compact way to write this parser but we haven't had a need to optimize it—this code was just lying about so we used it as is.

def soi(fsistr):
    """Given an FSI string in SP/GroveMinder form, return the SOI (filename) part"""
    ncro = None  # Numeric character reference open. Will be set in start tag.
    inncr = 0
    intag = 0
    ingi = 0
    inattspec = 0
    inattval = 0
    gi = ""
    attname = ""
    attval = ""
    ncr = ""
    rsoi = ""

    for i in range(0, len(fsistr)):
        c = fsistr[i]
        if c == "<":
            intag = 1
            ingi = 1
        elif c == ">":
            intag = 0
            ingi = 0
        elif c == ncro:
            if not intag:
                inncr = 1
                ncr = ""
            else:
                rsoi = rsoi + c
        elif c == "'":
          if intag:
             if inattspec:
                if not inattval:
                   inattval = 1
                   lit = c
                else: # Must be in attval
                   if lit == c:
                      inattval = 0
                      inattspec = 0
                   if attname == "SMCRD":
                       ncro = attval
                   else:
                      attval = attval + c
             else:
               errmsg(__name__, "W",
                      "Lita (') found where not allowed in tag in FSI at character %d in FSI '%s'" %
                      (i, fsistr))
          else:
            rsoi = rsoi + c
        elif c == chr(66):
          if intag:
             if inattspec:
                if not inattval:
                   inattval = 1
                   lit = c
                else: # Must be in attval
                   if lit == c:
                      inattval = 0
                      inattspec = 0
                   else:
                      attval = attval + c
             else:
               debug("Lit %s found where not expected in tag in FSI at character %d in FSI '%s'" %
                     (chr(66), i, fsistr))
          else:
            rsoi = rsoi + c
        elif c == "=":
          if intag:
            if inattval:
               attval = attval + c
            else:
               if inattspec:
                  inattname = 0
               else:
                  debug("Value indicator (=) found where not expected in tag in FSI at character %d in FSI '%s'" %
                        (i, fsistr))
          else:
            rsoi = rsoi + c
        elif c == " ":
          if intag:
             ingi = 0
             if inattspec:
               if inattname:
                  inattname = 0
               if inattval:
                  attval = attval + c
          else:
            rsoi = rsoi + c
        else:
            if inncr:
                if c == ";":
                    c = chr(int(ncr))
                    ncr = ""
                    inncr = 0
                else:
                    ncr = ncr + c
                    c = ""
            if intag:
                if inattspec:
                    if inattval:
                        attval = attval + c
                    if not inattval and not inattname:
                        inattval = 1
                        attval = c
                    if inattname:
                        attname = attname + c
                else: # must not be in attspec
                    if ingi:
                        gi = gi + c
                    if (not ingi) and (not inattspec):
                        inattspec = 1
                        inattname = 1
                        attname = c
            else: # Must be in rsoi
                rsoi = rsoi + c
    return rsoi

The filename2url() function just serves to turn a filename (such as returned by the soi() function), into a URL, as required by certain parts of the XSLT processor.

def filename2url(filename):
    """
    Given a filename to a local file, returns the equivalent
    'file:' URL.

    This function ensures that filenames are consistently
    converted to URLs as there seems to be some inconsistency
    in how the different URL-related packages do this.
    """
    if filename[1] == ":":
        filename = "%s|%s" % (filename[0], filename[2:])
    if string.find(filename, "\\") > 0:
        filename = string.replace(filename, "\\", "/")
    return "file:/" + filename

Summary of XSL to Grove Binding

It should not be surprising that the conceptual mapping of XSL and XPath to groves is fairly straightforward. The DOM and groves are both based on the same basic idea of nodes with properties. The DOM and SGML groves both have very similar data structures as one would expect. The challenges are mostly in the handling of strings, which are pre-optimized in the DOM but left to implementations to optimize in groves. In addition, the greater generality of the grove approach, coupled with SGML's larger set of choices for things like case normalization, add some complexity to the mapping, but not much.

Even with these obvious similarities, we expected the implementation task to be more difficult than it was. Our initial assumption was that we would have to wrap a DOM API around the grove API in order to plug our objects into the existing DOM-based processing framework. However, we discovered that Python's lack of strong typing plus a few well-placed redirections allowed us to use grove nodes directly. The GroveUtility module provides as much DOM API mapping as we needed, being nothing more than convenience functions that concentrate the details of accessing grove-based data in a DOM-like way.

Implementing this same functionality in a strongly typed language such as Java would require more work to map the grove API to the DOM API or significant reworking of the XSL implementation's class hierarchy to allow grove nodes to be used with DOM nodes. As our performance requirements increase, we will likely be forced to move to a different implementation language, probably C++. Another alternative would be to implement our own grove system in Java (GroveMinder has no Java binding), at which point we could have the grove implementation emulate the DOM API directly.

The implementation of the XSL-to-grove mapping gave us the first part of what we needed: the ability to apply XSL style sheets to arbitrary groves regardless of their data type (SGML, XML, Word, etc.). However, we still needed to be able to apply XSL style sheets to entire hyperdocuments, not just single documents. For example, it would enable style sheets that act on compound documents composed of elements (or other node types) used by reference (transcluded) from many individual documents.

Binding XSL To Abstract Hyperdocuments

Having bound XSL processing to generic groves the next challenge was to expose the hyperlinking information provided by the hyperdocument manager component of the Bonnell system so that style sheets can act on it. In particular, the following actions needed to be enabled:

  • Determine that a particular node is a member of one or more hyperlink anchors.
  • Access the anchor role names of the anchors of which a node is a member
  • Access the link types of the links of in which a node participates
  • Get access to the nodes to which traversal is possible from a given node
  • Make it easy and convenient to process node transclusions (“value references” in HyTime terminology).
  • Access arbitrary properties of hyperlink objects.

Given this set of functions it would be possible to produce whatever output result is desired based on the linking properties of nodes.

The Bonnell Hyperdocument Data Model

The hyperdocument provided by the Bonnell hyperdocument manager is an abstract hyperdocument that conforms to the data model shown in 21. This data model is an abstraction over most reasonable ways to express hyperdocuments, including XLink, HyTime, HTML, proprietary or purpose-built hyperdocuments, and information systems that can be viewed as hyperdocuments (e.g, Microsoft Project). This data model is exposed through an API that provides a number of convenience functions for interrogating the hyperdocument, such as “getAllTraversalsFromNode()”. The intent with this API is to provide generic hyperdocument access to business logic such that the business logic is protected from the details of hyperdocument storage, syntactic representation, and management. This API also enables the direct programmatic creation of hyperdocuments.

21 is a UML diagram that reflects the fundamental hyperdocument data model. Each box represents a type or class. The lines represent relations between the types. The numbers and stars indicate the repeatabilty of the values. Reading from the upper left, a hyperdocument consists of zero or more hypererlinks (the black diamond represents containment or ownership). A hyperlink has two or more anchors as well as exactly one link type. An anchor has a role, which is defined within the context of a link type (and must be unique within the link type). An anchor aggregates (open diamond) zero more members, which are “information objects”. At this level of abstraction, an information object can be anything you can point to (i.e., a “resource” in XLink terms). In the Bonnell implementation, information objects are grove nodes, following the HyTime model.

A hyperdocument is constructed from the data in a set of one or more documents, called the “bounded object set”, emphasing the fact that the set of documents involved is finite. The use of a bounded set of source documents is a prerequisite for processing unconstrained extended links. This is for the simple reason that you cannot completely enable traversal from a node until you know what anchors it is a member of. Thus, in the general case, you must be able to fully resolve all the links before fully-functional use of the hyperdocument can be made. The links can be fully resolved in a useful amount of time only if the set of documents involved is finite and invariant over the course of the processing.

Note that this is a fundamentally different model from that used for the Web, where it is taken as a given that the set of documents involved is essentially unbounded. Thus, HTML links are one-way simple links and there is no general expectation that one can navigate from the target of an HTML link to the start of the link without having first navigated from the link to the target. By contrast, the hyperlinking model used by Bonnell reflects the requirements of more-or-less closed information systems, such as technical documentation authoring and delivery systems, where the value of extended linking outweighs the cost of the systems that enable it. In particular, there are problems of managing hyperlinks among components of versioned documents that cannot be solved satisfactorily without this type of closed-system linking.

Figure 21: Abstract hyperdocument data model
[Link to open this graphic in a separate page]

Note the subtle shift in terminology of links and anchors from that normally used in discussing HTML. Because a given anchor may contain many nodes it is not sufficient for “anchors” to mean “elements that are anchors of hyperlinks”. Rather, a given element (or arbitrary node) may be a member of any number of anchors. Because both HyTime and XLink support extended linking, it may be possible for any node to be an anchor member regardless of whether or not it was originally intended for linking. This also means that there is not always a direct relation between the syntactic representation of a hyperlink and the node-based representation of it.

For example, an XLink simple link is a single element that defines a two-anchor link. When translated into the abstract hyperdocument data model, the simple link element becomes a node that is the only member of one anchor of the simple link. The target resources of the “href” attribute are the members of the other anchor. When an XPointer is resolved it may end up addressing multiple nodes at the grove or DOM level. Because the simple link element defines a hyperlink, it also results in the creation of a Hyperlink object in the abstract hyperdocument, which maintains a pointer back to the original simple link element as it's “data source”.

Hyperlinks Augment the Description of the Base Markup

One of the motivating reasons for using XML is that XML markup is descriptive: tags and attributes add descriptive information to otherwise unstructured and undifferentiated text. This descriptive markup enables the use of declarative style sheets like XSL to use the descriptive markup to produce a variety out output forms from a single input source. This is XML motherhood.

Hyperlinks as provided by the above hyperlink model also provide descriptive information that can be used by style sheets to further qualify presentation, in addition to the value of simply having different information components connected together. That is, hyperlinks can do much more than simply enable navigation or aggregation: they can add a whole other layer of semantic characterization to information sets.

Most importantly, hyperlinks can impose this additional layer of descriptive information onto existing information unilaterally, leaving the original data untouched. Existing bodies of data can be annotated using hyperlinks to whatever additional degree of detail is desired. In extreme cases, links can even be used to impose structure onto data to which markup cannot be added directly, so-called “standoff markup”.

Hyperlinks provide two main ways to further classify information components: anchor roles and link types. Hyperlink anchors have named anchor roles that are unique within a given hyperlink. Hyperlinks have hyperlink types. The anchor role names serve to further classify their member nodes and provide an important selector for qualifying presentation style. For example, a given element type might have different presentation characteristics depending on what anchor role it is a member of or even whether or not it is a member of an anchor. Anchor roles can play the same role as element types. For example, an otherwise generic element type like “paragraph” might have different presentation styles based on what anchors it is a member of, as if it were a different element type.

Link types provide additional qualification information that may affect presentation style or that may need to be reflected in the rendered output (for example, in a dialog or intermediate page that enables traversal to multiple targets from a single anchor member).

Thus, anchor roles and link types provide two additional possible dimensions of qualification in addition to the normal element in context.

XSL to Hyperdocument Implementation Approach

The implementation involved implementing XSLT extension elements and a set of XPath extension functions that operate directly on hyperdocuments. The implementation consisted of the following Python modules:

  • SpidermanXPathExtensions.py (Spiderman is the code name for the link management component of the Bonnell system). Provides the following functions:
    • IsAnchoredObject()
    • IsValueRef()
    • GetTraversalTargets()
    • GetPropertyValueFromObject()
    • GetPropertyValue()

    This module also defines the AnchoredObject class, which represents the binding of a single Anchor object to the nodes that can be traversed to from that anchor. This object enables iteration using for-each within the style sheet over the list of anchors to which traversal is allowed from a node.
  • LinkExtension.py. Superclass extension element implementation. Used by the TraversalTargetNode and ValueRef extension elements.
  • TraversalTargetNodeExtension.py. Implements the “traversaltargetnode” extension element, which provides built-in style choices for formatting nodes that are potential traversal targets.
  • ElementValueRefExtension.py. Implements the “elementvalueref” extension element, which provides built-in style choices for formating element nodes that transclude other elements.

These four packages represent about 400 total lines of active code, much of which is simply implementing the 4Suite-defined API for extension elements. The actual business logic reflected in this code is pretty simple. All of the hard work of interacting with and getting access to the information in the hyperdocument is done by the underlying hyperdocument management system exposed through the Bonnell hyperdocument API. Thus these extensions represent a very thin layer integrating the underlying information management system to the XSLT processor.

In addition to these new modules, we also extended the built-in 4Suite “context” object to take a hyperdocument object so that our extensions can get access to the hyperdocument. This addition can be seen in 13. This addition of the Bonnell hyperdocument object to the XSLT context object is the integration point that the rest of the hyperlinking extensions depend on. This form of extension could probably be further generalized in the 4Suite implementation but we did not pursue any such generalizations. (Although in the writing of this paper we started experimenting with a “helper” object that we attach to the Processor object at run time, with an eye towards refactoring away from direct modification of the Processor constructor.)

The techniques used here could be similarly applied to other information systems with similar ease (assuming a comparable API is provided by the underlying system).

Note that the hyperdocument provided by the hyperdocument manager is not a grove. One implementation approach would be to wrap a grove API around the hyperdocument nodes. However, this turned out to be unnecessary. The 4Suite implementation can process any Python object. All we had to do was extend the “getProperty” processing to interrogate the properties of any Python object (including methods that take no arguments). Given this very simple extension, it becomes possible to apply XSL style sheets to any set of Python objects, not just grove nodes. Thus, a style sheet can apply templates and XPath expressions to the abstract hyperdocument objects as though they were grove nodes. Given this, all things are possible through the style sheet. The only other requirement may be to provide XPath extension functions that perform complex queries (e.g., GetTraversalTargets) or additional parameterless object methods that can be interrogated through XPath property checks.

SpidermanXPathExtensions.py

This package provides the following five XPath extension functions:

  • IsAnchoredObject(context). Returns True if the node is a member of one or more anchors.
  • IsValueRef(context). Returns True if the node transcludes other nodes.
  • GetTraversalTargets(context). Returns a list of “anchoredobject” nodes that represent a mapping from anchor objects to the members of those anchors to which traversal is allowed. Provides the set of nodes to which traversal is allowed from the input node.
  • GetPropertyValue(context, propertyName, forceListReturn). Returns the value of the specified property from the input node. Enables arbitrary navigation through any graph of objects.
  • GetPropertyValueFromObject(context, object, propertyName, forceListReturn). Gets the value of the specified property from the specified object. The context parameter is required by the 4Suite implementation but is ignored. This function enables interrogating arbitrary objects, not just the current context object.

This list of functions could easily be extended to add more convenience functions or to implement complex queries specific to particular hyperdocument designs or input document types. XSL was designed with an extension mechanism. We did not have any difficulty adding extensions using the 4Suite implementation.

IsAnchoredObject() extension function

The IsAnchoredObject() function is a straight delegation to the corresponding method of the Bonnell hyperdocument API.The primary job of the hyperdocument manager is to know about all the links among all the nodes and therefore whether or not a given node is a member of any anchors (remembering that a node may be a member of more than one anchor).

When processing documents in the context of a hyperdocument, this question must be asked of every node that might be an anchor member (which in the general case is every node).

Figure 22: IsAnchoredObject() extension function
def IsAnchoredObject(context):
    """
    Returns 1 if the given context's node is an anchor member of its hyperdocument.
    Returns 0 otherwise.
    """

    '''available in XPath as get-curent-time()'''
    if GroveUtility.isGroveNode( context.node ):
        hyperDoc = context.hyperdocument
        if hyperDoc and hyperDoc.isAnchorMember(context.node):
            return 1
    return 0
IsValueRef() extension function

Like IsAnchoredObject(), IsValueRef() delegates to the corresponding method of the hyperdocument management API.

Figure 23: IsValueRef() extension function
def IsValueRef(context):
    """
    Returns 1 if the given context's node is a value reference of its hyperdocument.
    Returns 0 otherwise.
    """
    if GroveUtility.isGroveNode( context.node ):
        hyperDoc = context.hyperdocument
        if hyperDoc and hyperDoc.isValueRef( context.node ):
            return 1
    return 0
GetTraversalTargets() extension function

This is also a straight delegation, but the result returned from the hyperdocument management API must be converted into a form usable by the XSLT processor. The getAllTraversalsFromNode() API method returns a dictionary that maps hyperlink anchors to node lists. The anchors are the anchors to which traversal is allowed from the context node (not the anchors of which the node is itself a member). This is further complicated by the fact that the anchor a node is a member of may have other members, each of which is also a potential traversal target for the context node (the context node is excluded from the traversal lists for anchors of which it is itself a member). The HyTime standard distinguishes these two forms of traversal as “link traversal” and “list traversal”. However, for the purpose of simply knowing where you can go from a given node, these distinctions can be ignored.

For example, if the context node is the sole member of exactly one anchor, then the traveral targets dictionary will have exactly one member whose key is the other anchor of the link and whose value is the list of nodes in that anchor (if any). If the context node is the sole member of two anchors, each of a different two-anchor link, then the traversal targets dictionary will have two members, one for each of the other anchors of each link.

Note that the same node may occur in multiple traversal target anchors.

The current hyperdocument management implementation ignores traversal constraint specifications such as XLink arcs. If taken into account, these constraints would further filter the list of nodes to only those to which traversal is actually allowed according to the traversal constraints.

Figure 24: GetTraversalTargets() extension function and AnchoredObject Class
def GetTraversalTargets(context):
    """
    Returns a list of all traversal targets for a given context's node.
    Returns an empty list otherwise.
    """
    node_set = []
    if GroveUtility.isGroveNode( context.node ):
        hyperDoc = context.hyperdocument
        if hyperDoc:
            traversalTargets = hyperDoc.getAllTraversalsFromNode( context.node)
            if not traversalTargets:
                return node_set
            else:
                for anchor, traversals in traversalTargets.items():
                    node_set.append(AnchoredObject(anchor, traversals))

    return node_set

class AnchoredObject:
    """
    Single object that binds an anchor to the allowed traversal
    members of that anchor. Used to pass back the results from
    getTraversalTargets. Emulates the AnchoredObject node in the
    HyTime semantic grove.
    """

    def __init__(self, anchor, traversals):
        self.anchor = anchor
        self.traversals = traversals
GetPropertyValue() and GetPropertyValueFromObject() extension methods

These two methods enable access to any properties of any grove node or Python object through the style sheet. Through these functions, a style sheet can navigate anywhere within a grove or other collection of objects.

GetPropertyValue() always operates on the context node. It simply delegates to the more general GetPropertyValueFromObject. Note that GetPropertyValueFromObject attempts to call callable properties of objects. This enables reference in the style sheet to object methods as though they were static properties, as long as those methods taken no parameters.

def GetPropertyValue(context, propertyName, forcelistreturn = 1):
    """
    Returns the property value for the given context's node and property name.
    Returns an empty list otherwise.
    """

    return getPropertyValueFromObject(None, context.node, propertyName,
                                      forcelistreturn)

def GetPropertyValueFromObject(context, object, propertyName,
                               forcelistreturn = 1):
    """
    Returns the property value for the given object and property name.
    Returns an empty list otherwise.

    Because some property values will return a list,
    always return a list to provide a consistent interface
    in the XSL stylesheets. (This also provides an easy mechanism
    for collecting information from the returned objects.)
    """
    results = []
    returnValue = []
    if ( type( object ) == types.ListType ) and ( len( object ) == 1 ):
        object = object[ 0 ]
    if object and GroveUtility.isGroveNode( object ) and\
        hasattr( object, propertyName):
        returnValue = getattr( object, propertyName )
    else:
        if hasattr(object, propertyName):
            returnValue = getattr(object, propertyName)
        else:
            returnValue = "{property %s not exhibited by node}" % propertyName

    if callable(returnValue):
        returnValue = returnValue()

    if type( returnValue ) == types.ListType:
        results = returnValue
    elif returnValue and ( forcelistreturn > 0 ):
        results.append( returnValue )
    elif returnValue and ( forcelistreturn < 1 ):
        results = returnValue
    return results

Mapping of Python Functions to XPath Function Names

The following Python dictionary reflects 4Suite's method for mapping the XPath extension names to their Python implementations.

"""The ExtElements dictionary is required by the 4Suite implementation of XSLT extensions."""
ExtFunctions = {
    ('http://datachannel.com/Bonnell/Transform', 'is-anchored-object'): IsAnchoredObject,
    ('http://datachannel.com/Bonnell/Transform', 'is-value-reference'): IsValueRef,
    ('http://datachannel.com/Bonnell/Transform', 'get-traveral-targets'): GetTraversalTargets,
    ('http://datachannel.com/Bonnell/Transform', 'get-traversal-targets'): GetTraversalTargets,
    ('http://datachannel.com/Bonnell/Transform', 'get-property-value-from-object'): GetPropertyValueFromObject,
    ('http://datachannel.com/Bonnell/Transform', 'get-property-value'): GetPropertyValue
}

XSLT Extension Element Implementations

The hyperdocument support provides two extension elements that simplify the task of formatting anchor members and value references (transclusions): traversaltargetnode and elementvalueref. These two elements are both derived from the common superclass LinkElement in the underlying Python implementation.

LinkExtension.py

The LinkExtension class implements the functionality and attributes common to the traversaltargetnode and elementvalueref elements. This reflects the fact that value references are just a special case of linking with different default behavior but otherwise identical semantics and options. The LinkExtension class could be further specialized to handle more specific cases, such as link types defined in corparate document types.

The LinkElement class implements most of the business logic for processing links as it is common to both the traversal target node and element valueref cases.

########################################################################
#
# Bonnell XSLT - Link Extension Superclass for XSLT.
#
# Copyright (c) 2001, DataChannel, Inc. All rights reserved.
#
#
# This is a superclass for Traversal Target Node and Element Value Reference
# XSLT Extensions.
#
#-----------------------------------------------------------------------

import os
import string, types

# import xml.dom.ext

import xml.xslt
from xml.xslt import XsltElement, XsltException, Error
from xml.xslt.AttributeValueTemplate import AttributeValueTemplate

from spiderman_xsl.utils import GroveUtility

class LinkElement( XsltElement ):
    """
    This is intended to be a super class to provide a common
    set of attributes for link type extension elements.
    """
    def __init__(self, doc, uri, localName, prefix, baseUri=''):
        XsltElement.__init__(self, doc, uri, localName, prefix, baseUri)
#        self.extensionNss = []
        return

    def instantiate(self, context, processor):
        """
        Common instantiate processing that all subtypes must do.
        """
        if not hasattr(processor, "SpidermanXSLHyperdocHelper"):
            from SpidermanXSLHyperdocHelper import SpidermanXSLHyperdocHelper
            processor.SpidermanXSLHyperdocHelper = SpidermanXSLHyperdocHelper()
        context.setNamespaces(self._nss)


    def setup(self):
        """
        This method is required by the 4Suite Implementation of XSLT Extensions. It
        builds up a dictionary mapping attribute to attribute value template. An attribute
        value template holds the namespace and the value for the attribute.

        The supported link styles are (transclude | link ).

        """
        self.__dict__['_nss'] = xml.dom.ext.GetAllNs(self)
        self.__dict__['_style'] = AttributeValueTemplate(self.getAttributeNS('',
                                                                             'style'))

        # attempt to collect additional attribute information for 'link' creation.
        self.__dict__['_linkref'] = AttributeValueTemplate(self.getAttributeNS('',
                                                                               'linkref'))
        self.__dict__['_outputdir'] = AttributeValueTemplate(self.getAttributeNS('',
                                                                                 'outputdir'))
        self.__dict__['_outputsuffix'] = AttributeValueTemplate(self.getAttributeNS('',
                                                                                    'outputsuffix'))
        return

    def _createLinkToNode( self, context, processor, ref ):
        """
        For the context node: If the node hasn't yet been processed, this method
        has the processor create an output stream for it . Then it calls the processor's
        runGroveNode method on each.
        """
        outputSuffix = str( self._outputsuffix )
        sourceDataLocation = GroveUtility.soi( ref.groveDefinition().sourceData() )

        outputAddress = processor.getOutputLocation( sourceDataLocation )

        if not outputAddress:
            print "Processing %s for output" % sourceDataLocation
            (outStream, outputAddress) = processor.createOutputStream( str( self._outputdir ),
                                                                       sourceDataLocation,
                                                                           outputSuffix )
            print "Output location is '%s'" % outputAddress
            processor.registerOutputSource( sourceDataLocation, outputAddress )

            processNode = ref.GroveRoot
            proc = processor.createProcessorCallback( context, processor, processNode )
            proc.runGroveNode( processNode, 0, None, outputStream=outStream )

        fragmentId = ""
        if ref != ref.GroveRoot:
            if hasattr(ref, "PrincipalTreeRoot") and ref == ref.PrincipalTreeRoot:
                pass
            else:
                fragmentId = "#%s" % processor.SpidermanXSLHyperdocHelper.getFragmentIdForNode(ref)
        targetURL = "%s%s" % (outputAddress, fragmentId)
        processor.writers[-1].text(targetURL)

    def _processTranscludeStyle( self, context, processor, node ):
        """
        Transcludes the given node's text into the current document's output.
        """
        writer = processor.writers[ -1 ]

        # context.node is the transclusion destination and the variable 'node' is our source.
        processor.runGroveNode( node, 0, None, writer, startAndEndDocument = 0 )

    def _reportErrorMessage( self, message ):
        """
        Error reporting mechanism. Currently only prints the message.
        """
        print message

TraversalTargetNodeExtension.py

########################################################################
#
# Bonnell XSLT - Traversal Target Extension for XSLT.
#
# Copyright (c) 2001, DataChannel, Inc. All rights reserved.
#
# Manages the processing of traversal target nodes. In particular,
# makes sure that the target nodes are in a generated output file.
# Returns the URL to use to refer to the traversal target.
#
#-----------------------------------------------------------------------

from spiderman_xsl.utils import GroveUtility
from spiderman_xsl.xslt import LinkExtension

class TraversalTargetNodeElement( LinkExtension.LinkElement ):
    """
    Manages the processing of traversal target nodes. In particular,
    makes sure that the target nodes are in a generated output file.
    Returns the URL to use to refer to the traversal target.

    Attributes:

    style = { "link" | "transclude" }
      Determines the output result for the value reference node.

      link:       The node's grove is processed to it's own output file and the
                  URL to the node in that document is returned.
      transclude: The node is processed in place

    outputdir = pathspec
      Defines the output location for the result data when style="link". Relative
      to the current input document's location. Default is "." (current directory)

    outputsuffix = extensionspec
      Defines the extension to be used for output files when style="link", e.g. ".html".
      Default is ".html"

    Example:
    <xsl:template match="mylink">
     <p>
      <xsl:for-each select="ext:get-traversal-targets()">
        <br/><xsl:value-of select="ext:get-property-value-from-object(
              ext:get-property-value-from-object(ext:get-property-value( 'anchor', 0),
                                                 'hyperlink', 0),
              'getLinkTypeName',
              0)"/><xsl:text>: </xsl:text>
          <xsl:for-each select="@traversals>
            <xsl:element name="a">
              <xsl:attribute name="href"
                ><ext: outputdir="./website" outputsuffix=".html" style="link"
              /></xsl:attribute>
              <xsl:text>Click here, </xsl:text>
            </xsl:element>
          </xsl:for-each>
        </p>
      </xsl:for-each>
     </p>
    </xsl:template>
    """

    def __init__(self, doc, uri='http://datachannel.com/Bonnell/Transform', localName='traversaltargetnode',
                 prefix='ext', baseUri=''):
        LinkExtension.LinkElement.__init__(self, doc, uri, localName, prefix, baseUri)
        self.extensionNss = []

    def instantiate(self, context, processor):
        """
        """
        LinkExtension.LinkElement.instantiate(self, context, processor)
        origState = context.copy()

        self._processTarget( context, processor, context.node)

        context.set(origState)
        return (context,)

    def _processTarget( self, context, processor, node ):
        """
        Processes the traversal target node. Either transcludes or creates a link based on style.
        Uses _processTranscludeStyle for transclusion and _processLinkStyle for link creation.
        """
        # print "in _processTarget, context.node=%s" % repr(context.node)
        style = str( self._style )
        if style and  style == "transclude":
            self._processTranscludeStyle( context, processor, node )
        else:
            return self._createLinkToNode( context, processor, node )

"""The ExtElements dictionary is equired by the 4Suite implementation of XSLT extensions."""
ExtElements = {
    ('http://datachannel.com/Bonnell/Transform', 'traversaltargetnode'): TraversalTargetNodeElement
    }

ElementValueRefExtension.py

[########################################################################
#
# Bonnell XSLT - Element Value Reference Extension for XSLT.
#
# Copyright (c) 2001, DataChannel, Inc. All rights reserved.
#
# This extension element handles "#ELEMENT" value references, that is,
# elements that transclude exactly one target element. This element
# is a convenience for the common, but special case, of value reference
# where the target is exactly one node and the referencing element
# is not (normally) reflected in the output.
#
# Attributes:
#
# style = { "link" | "transclude" }
#   Determines the output result for the value reference node.
#
#   link:       The node's grove is processed to it's own output file and the
#               URL to that document is returned.
#   transclude: The node is processed in place
#
# outputdir = pathspec
#    Defines the output location for the result data when style="link". Relative
#    to the current input document's location. Default is "." (current directory)
#
# outputsuffix = extensionspec
#    Defines the extension to be used for output files when style="link", e.g. ".html".
#    Default is ".html"
#
# Example:
#  <xsl:template match="book-ubr">
#    <p>
#   <xsl:element name="a">
#            <xsl:attribute name="href"
#              ><ext:elementvalueref outputdir="./website" outputsuffix=".html" style="link"
#            /></xsl:attribute>
#            <xsl:value-of select="@book"/>
#        </xsl:element>
#    </p>
#  </xsl:template>

from spiderman_xsl.utils import GroveUtility
from spiderman_xsl.xslt import LinkExtension

class ElementValueRefElement( LinkExtension.LinkElement ):
    def __init__(self, doc, uri='http://datachannel.com/Bonnell/Transform', localName='valueref',
                 prefix='xsl', baseUri=''):
        LinkExtension.LinkElement.__init__(self, doc, uri, localName, prefix, baseUri)
        return

    def instantiate(self, context, processor):
        """ This method is required by the 4Suite Implementation of XSLT Extensions. """
        LinkExtension.LinkElement.instantiate(self, context, processor)
        origState = context.copy()

        hyperDoc  = context.hyperdocument
        if hyperDoc and hyperDoc.isValueRef( context.node ):
            try:
                resolvedRefs = hyperDoc.resolveValueRefs( context.node, 0 )
            except Exception, e:
                import traceback
                traceback.print_exc()
                resolvedRefs = []
                self._reportErrorMessage( "Value reference warning: %s" % e )
            if len(resolvedRefs) > 1:
                self._reportErrorMessage( "Attempt to process more than one node with ElementValueRef, "\
                                          "will process first node in result list.")
            if len( resolvedRefs ) > 0:
                self._processValueRef( context, processor, resolvedRefs[0] )

        context.set(origState)
        return (context,)

    def _processValueRef( self, context, processor, resolvedReference ):
        """
        Depending on the 'style', this method calls to either transclude or create links.

        ResolvedReference is the node to which the value reference resolved.
        """

        style = str( self._style )
        if (not style) or (style and  style == "transclude"):
            self._processTranscludeStyle( context, processor, resolvedReference )
        elif style and style == "link":
            self._createLinkToNode( context, processor, resolvedReference )

"""The ExtElements dictionary is equired by the 4Suite implementation of XSLT extensions."""
ExtElements = {
    ('http://datachannel.com/Bonnell/Transform', 'elementvalueref'): ElementValueRefElement
    }

Summary of Hyperdocument Implementation

The mapping of XSLT to groves is useful but does not fundamentally change the types of processing one can do with XSL. By contrast, the hyperdocument processing extensions open up XSLT processing to a whole new world of potential data sources, including the hyperdocuments managed by the Bonnell system as well as almost any other type of data that is exposed through an integration API to which the XSLT processor can be bound.

Our implementation takes full advantage of the underlying hyperdocument management system to make access to otherwise complex processing, such as determining traversal targets for complex links or resolving transclusions, as easy as it could be for style sheet authors. It also takes advantage of Python's flexibility with regard to object processing to easily enable the processing of arbitrary Python objects with an XSL style sheet.

All of the actual style-specific business logic is concentrated in the two extension elements, which implement the process of creating either HTML links to target nodes or processing the target nodes at the point of reference and injecting their output into the main output stream.

We found that the primary challenge in this task was understanding the details of the 4Suite extension mechanism, which is not particularly hard to understand or use. In fact, the entire integration process was remarkably easy. Some of this ease was a side effect of using Python as an implementation language because, as an interpreted language, Python imposes very little overhead. Some of it was because we were already very familiar with the hyperdocument processing domain and therefore knew exactly what we wanted to do. But some of it was because the actual scope of the work to be done was quite small. In fact, we have probably expended almost as much effort keeping our modifications to the 4Suite code up to date as that code base has evolved as we have on the initial implementation.

Hyperdocument Styling Examples

This section presents several examples of hyperlink-link-based XSL style sheets.

Hyperdoc Styling Example 1: Compound Document Transclusion

This example uses transclusion to create a single output document from multiple, otherwise independent component documents. In this case, the links are represented using the HyTime “value reference” markup, which is a special case of hyperlink that has the explicit semantic of use-by-reference. However, because the presentation effect of the use-by-reference is implemented in the style sheet (and not, for example, in a processing step before the XSL processing step), the style sheet can choose whether or not to present the nodes at the point of reference. Likewise, more general links can also be given a transclusion style.

For this example, we have broken the New Testament document from the Jon Bosak world religions set into a number of distinct documents, one for each “book”. We have then created a new hub or “master” document that uses valueref links to compose the the full New Testament as a compound document. This example demonstates the ease with which such compound documents can be constructed and processed using relatively simple XSL style sheets. Being able to construct and process these types of compound documents is a prerequisite for enabling true re-use of document components, an common requirement in technical documentation.

Source documents

The source documents for this example consist of various books of the New Testament and a "master" document that uses transclusion (value reference) links to define the order of inclusion of the books. The component books are modified from the original Jon Bosak distribution only as needed to turn them into independent XML documents. The markup and content has not otherwise been changed in any way. A typical example is this version of the Epistle of Jude:

<?xml version="1.0" encoding="utf-8"?>

<!DOCTYPE book SYSTEM "urn:magic:bosak:religion:common:testament DTD">
<?xml-stylesheet href="gospels-to-html_stylelink.xsl" type="text/xml"?>

<book id="jude">
<bktlong>The General Epistle of JUDE.</bktlong>
<bktshort>Jude</bktshort>
<chapter>
<chtitle>Chapter 1</chtitle>
<v>Jude, the servant of Jesus Christ, and brother of James, to them that are
sanctified by God the Father, and preserved in Jesus Christ, and called: </v>
<v>Mercy unto you, and peace, and love, be multiplied. </v>
...

All that has been added is the DOCTYPE declaration, the XML declaration and the style sheet reference.

The master document augments the original markup by making the document a HyTime document (a document that conforms to the HyTime architecture) and by adding the book-ubr and book-ubr-by-id element types. These two element types create the translation ("use by reference") links. We also included only a few books simply to shorten processing time for testing purposes. The document instance markup is:

<?xml version="1.0" encoding="iso-8859-1"?>
<?xml-stylesheet href="./gospels-hubdoc_stylelink.xsl" type="text/xml"?>

<!DOCTYPE tstmt SYSTEM "../common/tstmt-hytime.dtd" [

<!ENTITY jude SYSTEM "jude.xml" NDATA sgml>
<!ENTITY revelations SYSTEM "revelations.xml" NDATA sgml>
<!ENTITY timothy-1 SYSTEM "timothy-1.xml" NDATA sgml>
<!ENTITY timothy-2 SYSTEM "timothy-2.xml" NDATA sgml>
<!ENTITY nephi PUBLIC "mormon-Nephi" NDATA sgml>
]>

<tstmt id="tstmthub">
<coverpg>
<title>The New Testament</title>
<title2>One of a group of four religious works marked up for electronic publication
from publicly available sources</title2>
[...]
<bookcoll>
<book-ubr book="jude"/>
<book-ubr-by-id book="n-timothy-1-c-2"/>
[...]
</bookcoll>
<nmsploc id="n-jude" namespc="elements" locsrc="jude">jude</nmsploc>
<nmsploc id="n-timothy-1-c-2" namespc="elements" locsrc="timothy-1">tim1c2</nmsploc>
[...]
</tstmt>

The entity declarations declare the documents that will be referenced for transclusion by the book-ubr/book-ubr-by-id elements. Note that these are declared as unparsed entities, not parsed entities.

Both the "book-ubr" and "book-ubr-by-id" elements establish a "value reference" relationship between the element making the reference and the element pointed to. The difference between the two element types is the form of address used. The "book-ubr" element type uses a direct entity reference, which, by HyTime's defaulting rules is an implicit reference to the document element of the target document (if it's an SGML document). If the entity is not interpreted as an SGML document and therefore results in a different type of grove, an entity reference is an implicit reference to the "principal tree root" of that grove. The principal tree root of a grove is the node designated in the grove's property set as such. If a principal tree root is not defined, the grove root is the principal tree root. Thus, the book-ubr element in this example is, through the entity reference, a pointer to the document element node of the SGML grove constructed from the document "jude.xml"

The "book-ubr-by-id" element uses an ID reference to address a chapter element within the document "timothy-1" (this is an artificial example in this case, used simply to demonstrate the ability to re-use elements that are not document elements). The attribute "book" of "book-ubr-by-id" is declared as type IDREF. Therefore it must point to an element with an ID in the same document. In this case, a "name space location address" element is used to map the local ID "n-timothy-1-c-2" to the target ID "tim2c2" in the document "timothy-1". This bit of indirection is equivalent to using an XPointer with a document() function followed by an ID reference (e.g., document("timothy-1.xml").id(tim1c2)).

The semantics of both of the book-ubr element types is the same: a transclusion relationship from one element to exactly one other element, what the HyTime standard refers to as "#ELEMENT" value references, meaning that the effective value of the element making the reference is the element referenced (HyTime value reference can also be used to get attribute values and element content by reference, thus the name "value reference").

Note that although the semantic of these links is transclusion, there is no requirement that the presentation effect be transclusion. By the same token, generic links can also be given a transclusion semantic. The HyTime standard defines the value reference mechanism primarily to make this typical case of constructing compound documents through element-to-element references easy and obvious. It also allows for a pre-processing step that resolves the transclusions into a new result grove before additional processing, such as styling, is performed. The HyTime element value reference mechanism is roughly equivalent to the XInclude mechanism currently being defined by the W3C.

Style Sheets

There are two style sheets for this document set: one that renders the value references as traversable hyperlinks and one that transcludes the target elements. The only interesting template in these two style sheets is the template for the "book-ubr" and "book-ubr-by-id" elements. The valueref-as-link template is:

<xsl:template match="book-ubr | book-ubr-by-id">
    <p>
    <xsl:element name="a">
            <xsl:attribute name="href"
              ><ext:elementvalueref outputdir="./website" outputsuffix=".html" style="link"
            /></xsl:attribute>
        <xsl:for-each select="ext:get-resolved-value-refs()">
          <xsl:choose>
            <xsl:when test="self::chapter">
                  <xsl:value-of select="ancestor::book/bktlong"/>
                  <xsl:text>, </xsl:text>
                  <xsl:value-of select="chtitle"/>
                </xsl:when>
            <xsl:when test="self::book">
                  <xsl:value-of select="bktlong"/>
                </xsl:when>
                <xsl:otherwise>
                  <xsl:value-of select="@book"/>
                </xsl:otherwise>
              </xsl:choose>
        </xsl:for-each>
        </xsl:element>
    </p>
  </xsl:template>

The extension element "elementvalueref" handles the processing of the target node, either returning the URL of the target node in its separately-presented context (style="link") or processing the node at the point of reference (style="transclude"). In this case, the link style is used in order to construct an HTML “A” element that links to the target node, which will be presented in a separate HTML document. The for-each section uses the extension function "get-resolved-value-refs()", which returns the node list of nodes the value reference resolves to. In the case of #ELEMENT value references, this should always be a list of one node, but for attribute or content value references, it could be a node list. Because in this example there are references to both "book" elements and "chapter" elements, the “choose” statement serves to generate the appropriate text for the generate A element, either the book title or the chapter title.

The transclusion form of the template is:

<xsl:template match="book-ubr">
    <div>
        <ext:elementvalueref style="transclude"/>
    </div>
  </xsl:template>

Here, we use the transclude style of the "elementvalueref" extension element. Because the transcluded nodes are processed in place, nothing else is required. The "elementvalueref" extension element with the transclude style is a convenience feature. It is equivalent to this template:

<xsl:template match="book-ubr">
    <div>
        <xsl:apply-templates select="ext:get-resolved-value-refs()"/>
    </div>
  </xsl:template>

Rendered examples

The version rendered with the transclusions as links is shown in 25.

Figure 25: Compound document with valuerefs rendered as links
[Link to open this graphic in a separate page]

The rendered transclused example is indistinguishable from the rendering of the normal single-instance version of the document.

Hyperdoc styling example 2: Extended Links Between Two Documents

In this example, extended links are used to establish correlations between two language versions of the Gospel of John, simulating a simple but useful example of scholarly analysis. Like all the examples used in this paper, the hyperlinking used here conforms to the HyTime standard. However, there is no particular magic to the use of HyTime: everything shown in these examples could be done with XLink and XPointers or with some non-standard linking scheme. The Bonnell system is purposely designed to be independent of any particular linking approach—the style sheet mechanisms demonstrated here are not dependent on the use of HyTime (or any other standard), but only on the abstract hyperdocument API exposed by the Bonnell system. It just happens that we had a HyTime implementation ready to hand that was easily adapted to the Bonnell hyperdocument API. (In fact, we have at least two other bindings to this API: one that implements a very small subset of HyTime functionality in order to maximize performance for a specific use case and another that is implemented entirely in Java and uses XML DOMs for grove representation, implemented by Eric Lawson in our Dallas office).

Source documents

The source hyperdocument consists of three hyperdocuments: the original Greek text of the gospel, the New King James version, and a hub document that contains a set of extended links establishing the correspondences between the original and the translation. The King James version is taken from Jon Bosak's world religions data set and is not modified in any way except to make individual documents out of each book in the New and Old Testaments (as for the transclusion example). The Greek version is taken from a text originally marked up in HTML and converted by us to the Bosak testament markup. The hub document is a HyTime document.33

26 shows the first part of the hub document (johngospel-translations.xml). The two entities JohnGreek and gospel-st-john declare those documents as external unparsed entities, as required by normal HyTime location addressing. The notation “sgml” indicates that these are SGML documents (that is, documents from which SGML groves will be constructed). A grove-based system uses the notation of an unparsed entity to determine which property set the resulting grove should be in. That is, the grove system implementation will provide some form of mapping from source entity notations to property sets and grove construction processes. The entity declarations also serve to define the HyTime “bounded object set” of entities that make up the hyperdocument. (By using entities to define BOS membership, it is possible to determine the members of a hyperdocument without processing the document instances—only their prologs.)

The “metadata” element simply contains some introductory text and two “citation” links to each of the component documents. In this case, those links serve to enable direct navigation two each of the two language versions as a convenience—there is no requirement that those links be there in order for the verse-to-verse links to work. The citation links use a simple entity reference to address the root nodes of the documents cited. The HyTime rule for entity addresses is that, by default, a reference to an entity is a reference to the “princial tree root” of the grove constructed from the entity. The principal tree root is designated in the grove's property set. For example, in the SGML property set, the principal tree root is the document element node. If there is no principal tree root specified, the the grove root is addressed. These links are equivalent to XLink simple links with URLs that address the entire document resource.

The “correspondences” element contains a set of “translation” links. Each translation link connects a single verse in the Greek version to the corresponding verse in the King James version. The addressing here uses HyTime tree locations in the “original” and “translation” attributes, which address the members of the original and translation anchors, respectively. The tree locations simply count child nodes down the document tree from the document element. For this example, it was easy to auto-generate the tree locations because the structure of the documents is regular and consistent between the two translations. The “original-doc” and “translation-doc” attributes address the location sources for the corresponding tree locations. This is equivalent to using “document()” functions in XPointers to establish the location source of the following term. In this example, each anchor addresses exactly one node, but either or both anchors could address multiple nodes, for example to indicate that one version in the Greek became two in the King James. It happens that for this example, tree locations were sufficient and were the easiest form of address to generate automatically because of the nature of the data.

Knowing that the “original” and “translation” anchors are the anchor addressing attributes, one can infer the link type definition: the link type “translation” has two anchors, “original” and “translation”. The link type is explicitly defined in the element type declaration for the “translation” element.

Figure 26: Source document: Translation hub document
<!-- A link between the original and translated version of some text -->
<!ELEMENT translation
  (p)*
>
<!ATTLIST translation
  linktype
    NAME
    #FIXED "translation-linktype"
  original
    CDATA
    #REQUIRED
  translation
    CDATA
    #REQUIRED
  original-doc
    ENTITY
    #REQUIRED
  translation-doc
    ENTITY
    #REQUIRED
  loctype
    CDATA
    #FIXED "original TREELOC
            translation TREELOC
            original-doc ENTLOC
            translation-doc ENTLOC"
  rflocsrc
    CDATA
    #FIXED "original original-doc
            translation translation-doc"
  anchrole
    CDATA
    #FIXED "original #LIST translation #LIST"
  HyTime
    CDATA
    #FIXED "hylink"
>

The linktype name is actually “translation-linktype”—this is a somewhat artificial name used to aid in testing and debugging of the Spiderman system. The “anchrole” attribute formally defines the anchor role names “original” and “translation”. The key word “#LIST” indicates that the corresponding anchor allows node lists. The “HyTime” attribute indicates that this element type is a HyTime “hylink” (hyperlink). It is this attribute that signals a HyTime-aware processor that this element is in fact a hyperlink. All the other attributes define the details of the addressing used to point to the anchor members. (HyTime provides this level of explicitness in order support the HyTime requirement for generality and flexibility. But note that the HyTime standard makes it possible to use addressing syntaxes such as XPointer, although the HyTime implementation used here does not currently support that option.)

Figure 27: Correlation of Greek and King James version of Gospel of John
<?xml version="1.0" ?>
<?xml-stylesheet href="johngospel-side-by-side.xsl" type="text/xml"?>
<!DOCTYPE analysis PUBLIC "urn:datachannel:samples:nt-analysis:analsis hytime DTD" [
 <!ENTITY JohnGreek
    PUBLIC "urn:datachannel:samples:GreekGospels:Gospel of John"
    NDATA sgml
 >
 <!ENTITY gospel-st-john
    PUBLIC "urn:datachannel:samples:religion 2.00 xml:Gospel of John"
    NDATA sgml
 >
]>
<analysis>
<metadata>
<title>Correlations Between Greek and King James Version of Gospel of John</title>
<p>This document contains links that establish the correspondence between the verses
in the Greek and English (King James) versions of the Gospel of John.</p
>
<p>The Greek version:
<citation document="JohnGreek">Gospel of John, Original Greek</citation></p>
<p>The King James version:
<citation document="gospel-st-john">Gospel of John, King James</citation></p>
</metadata>
<correspondences>
<translation     original="1 4  2"        translation="1 3  2"
             original-doc="JohnGreek" translation-doc="gospel-st-john"/>
<translation     original="1 4  3"        translation="1 3  3"
             original-doc="JohnGreek" translation-doc="gospel-st-john"/>
<translation     original="1 4  4"        translation="1 3  4"
             original-doc="JohnGreek" translation-doc="gospel-st-john"/>
<translation     original="1 4  5"        translation="1 3  5"
             original-doc="JohnGreek" translation-doc="gospel-st-john"/>
Style sheets

This example uses two style sheets: one for the hub document (johngospel-side-by-side.xsl) and one for the gospels themselves (gospels-to-html_translations.xsl). The johngospel-side-by-side style sheet handles the hub document, which is the first document of the set to be processed by the XSL processor (29). This style sheet uses direct processing of the Hyperdoc object exposed through the XSL-to-Bonnell integration in order to generate a side-by-side presentation of the Greek and King James versions. It uses the same sort of processing to report statistics on the hyperdocument. The interesting templates are:
analysis

Analysis is the top-level element. Following the main apply-templates is the generation of the side-by-side translation table. This is done by interrogating the hyperdocument object to get the list of hyperlink objects from the hyperdocs. The get-object-property() extension function gets the “getHyperlinks” property of the hyperdocument object, which is itself accessed by the get-hyperdoc extension function (this property is actually a parameter-less method of the Hyperdocument object). To generate the correspondence table, the template simply iterates over each link, alternately selecting the appropriate anchor based on anchor role name. Note that apply-templates is applied to the members of the anchors. This has the effect of presenting the anchor members within the table but formatted as they are in their own documents.

The hyperdocument statistics report repeats this theme—in this case reporting the lengths of various lists and getting the details of the link types in the hyperdocument.

citation

The “citation” element is a simple link (in the XLink sense). Here the style sheet formats it by wrapping an HTML “A” element around the content of the citation element. The get-traversal-targets() extension function returns the list of AnchoredObject nodes for the link. AnchoredObject nodes have two properties: the anchor object and the member nodes of that anchor. This is always a list, even if the element is linked to exactly one other node. Note also that this style sheet processing would be the same even if the “citation” element were not itself a linking element—that is, what's important is that the “citation” element is a member of a link anchor, not that it also happened, in this case, to define the link itself. Because the traversal targets is always a list, the normal pattern is to iterate over the list with a for-each element.

For each AnchoredObject node returned by get-traversal-targets() the style sheet gets the “traversals” property, which is the node list of the members of that anchor. Again, for-each is used to iterate over this list. In this case the style sheet expects there to be only one anchor and one anchor member (otherwise the style sheet would generate non-sensical HTML markup). This nested iteration could have been done more safely by selecting only the first member of each list. However, this example demonstrates the more general pattern for processing traversal targets.

This style sheet also ignores potential problems, such as the “citation” element also being a member of other hyperlinks, which could happen if an extended link happened to point to the citation element. In this case, we have total control over the documents so we can ensure this doesn't happen, but more general style sheets for less constrained document sets would need to be more careful, for example, filtering anchors by anchor role name or selecting only the first member of multi-member lists.

Another potential challenge is how to represent in HTML nodes that are members of multiple anchors of multiple links—HTML does not provide a direct way to represent this and there is no well-established convention for handling one-to-many links. This would be a place where facilities such as JavaScript or Java applets could be used to good effect to provide more sophisticated user interface options.

The “traversaltargetnode” element is a convenience extension for processing traversal targets. The “style” attribute provides two choices: “link” and “transclude”. The “link” style calculates the URL of the target node (the node being processed in the for-each context) and returns it. The “transclude” style presents the target nodes at the point of reference. In both cases, the “traversaltargetnode” element processes the document containing the traversal target if it has not already been processed. In this case, the link type returns the URL of the target node (which will be the document element node of the document referenced by the citation).

The gospel documents are formatted by the gospels-to-html_translations.xsl style sheet (28. The main task for this style sheet is to enable navigation between verses linked by translation links. The only interesting template is for the “v” (verse) element.

In these documents, any verse may or may not be a member of an anchor of a translation link. In order to enable the creation of URLs to verses, the get-fragmentid-for-node() extension function. This function returns a unique ID for node which is then used in an A NAME= element. The stylesheet could do this generation only if the node is in fact an anchor member, but there's no harm in doing it blindly as is done here.

The IF test using the is-anchored-object() extension function tests whether or not the node is a member of any anchors. If it is, the style sheet uses the same pattern as shown above for the “citation” element: iterate over the traversal targets (AnchoredObject nodes) and the, for each of those, select only the anchors of “translation” links [at the time of writing we didn't have this filtering working in the underlying hyperdocument implementation because of lack of time]. Again the traversaltargetnode extension element is used to generate the URL of the target node. In this case, there could easily be multiple translations of a given verse, so the style sheet will produce a list of “A” elements, one for each target translation, if necessary.

Because the for-each makes the traversal target node the current context, the style sheet can use normal XPath processing to get data from the target node, in this case the short title of the book that contains the target verse. It happens that the book short title makes an ideal link button for the translation navigation links.

Figure 28: Stylesheet: gospels-to-html_translations.xsl
<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
     xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
     xmlns:ext="http://datachannel.com/Bonnell/Transform"
     extension-element-prefixes="ext">

  <xsl:template match="book">
    <html>
      <head><title></title></head>
      <body>
        <xsl:apply-templates/>
      </body>
    </html>
  </xsl:template>

  <xsl:template match="bktlong">
    <h1><xsl:apply-templates/></h1>
  </xsl:template>

  <xsl:template match="bktshort">
    <!-- short title suppressed -->
  </xsl:template>

  <xsl:template match="bksum">
    <div>
      <xsl:apply-templates/>
    </div>
  </xsl:template>

  <xsl:template match="chapter">
      <div>
      <!-- <xsl:param name='chapLang' select='ext:get-property-value-from-object( $chapLangAtt, "data", 0)'/>
      -->
      <xsl:choose>
          <xsl:when test='@LANG = "gr"'>
              <font face="symbol">
                  <xsl:apply-templates/>
              </font>
          </xsl:when>
          <xsl:otherwise>
              <xsl:apply-templates/>
          </xsl:otherwise>
      </xsl:choose>
      </div>
  </xsl:template>

  <xsl:template match="chtitle">
    <h2><xsl:apply-templates/></h2>
  </xsl:template>

  <xsl:template match="v">
    <p><xsl:element name="a">
        <xsl:attribute name="name"><xsl:value-of select="ext:get-fragmentid-for-node()"/>
        </xsl:attribute>
       </xsl:element>
        <b><xsl:number count="v" format="1 "/></b>
        <xsl:apply-templates/>
        <xsl:if test="ext:is-anchored-object()=1">
      <!-- First process the translations:
               FIXME: filtering with get-traversal-targets() isn't working.
            -->
          <xsl:text>[Translations: </xsl:text>
      <!-- <xsl:for-each select="ext:get-traversal-targets('translation')"> -->
      <xsl:for-each select="ext:get-traversal-targets()">
        <xsl:for-each select="ext:get-property-value('traversals')">
             <xsl:element name="a">
               <xsl:attribute name="href">
                <ext:traversaltargetnode
            outputdir="./website"
            outputsuffix=".html"
            style="link"/>
               </xsl:attribute>
               <xsl:value-of select="/bktshort"/>
               <xsl:text> </xsl:text>
             </xsl:element>
        </xsl:for-each>
      </xsl:for-each>
          <xsl:text>]</xsl:text>
    </xsl:if>
    </p>
  </xsl:template>

  <xsl:template match="i">
    <i><xsl:apply-templates/></i>
  </xsl:template>
  <xsl:template match="reading_start">
    <xsl:text disable-output-escaping="yes"><font color="red"></xsl:text>
  </xsl:template>
  <xsl:template match="reading_end">
    <xsl:text disable-output-escaping="yes"></font></xsl:text>
  </xsl:template>
</xsl:stylesheet>
Figure 29: Stylesheet: johngospel-side-by-side.xsl
<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
     xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
     xmlns:ext="http://datachannel.com/Bonnell/Transform"
     extension-element-prefixes="ext">

  <!-- This version of the style sheet produces a side-by-side rendering of the two translations.
    -->
  <xsl:template match="analysis">
    <html>
     <head>
      <title><xsl:value-of select="metadata/title"/> (Side By Side)</title>
     </head>
     <body>
      <div>
       <xsl:apply-templates/>
       <div><h2>Verse-To-Verse Correspondences</h2>
        <!-- FIXME: This should work <xsl:apply-templates -->
        <table border="1" width="100%">
        <tr bgcolor="yellow">
          <td>Original</td>
          <td>Translation</td>
        </tr>
        <xsl:for-each
           select="ext:get-object-property(ext:get-hyperdoc(),
                                           'getHyperlinks')">
         <tr valign="top">
          <td>
           <!-- FIXME: should be able to use modes to good effect here -->
           <xsl:apply-templates select="ext:get-object-property(ext:get-anchor('original'),
                                                                'getMembers')"/>
          </td>
          <td>
           <xsl:apply-templates select="ext:get-object-property(ext:get-anchor('translation'),
                                                                'getMembers')"/>
          </td>
         </tr>
        </xsl:for-each>
        </table>
       </div>
      </div>
      <div>
       <h2>Hyperdocument Statistics</h2>
        <table border="1" width="50%">
        <tr>
         <td align="center" bgcolor="yellow" colspan="3">Bounded Object Set</td>
        </tr>
        <tr>
         <td colspan="3">Bounded object set has
      <xsl:value-of
 select="string(count(ext:get-property-value-from-object(ext:get-hyperdoc(),
                                'getBos')))"
          /> members.
         </td>
        </tr>
        <tr>
         <td align="center" bgcolor="yellow" colspan="3">Hyperlinks</td>
        </tr>
    <tr>
         <td colspan="3">Number of links:
               <xsl:value-of select="string(count(ext:get-property-value-from-object(ext:get-hyperdoc(),
                                'getHyperlinks')))"/>
         </td>
        </tr>
        <tr bgcolor="pink"><td>Link Type</td><td>Anchor Roles</td></tr>
        <xsl:variable name="hubdoc" select="ext:get-hyperdoc()"/>
         <xsl:for-each select="ext:get-object-property($hubdoc, 'getLinkTypes')">
          <tr valign="top">
           <td><xsl:value-of select="ext:get-property-value('getName')"/></td>
           <td>
            <xsl:for-each select="ext:get-property-value('getAnchorRoles')">
             <xsl:value-of select="ext:get-property-value('getName')"/><br/>
            </xsl:for-each>
           </td>
          </tr>
         </xsl:for-each>
        </table>
      </div>
          </body>
      </html>
  </xsl:template>

  <xsl:template match="Hyperlink">
    <p>A hyperlink</p>
  </xsl:template>

  <xsl:template match="title">
    <h1><xsl:apply-templates/> (Side By Side)</h1>
  </xsl:template>

  <xsl:template match="citation">
    <!-- This is an example of a simple link with exactly one traversal target.

         FIXME: Should create an extension element to handle this common case
      -->
    <xsl:element name="a">
      <xsl:for-each select="ext:get-traversal-targets()">
          <!-- get-traversal-targets() returns a list of AnchoredObject nodes.
               Each anchored object has "traversals" property that is the list
               of nodes in that anchor to which traversal is allowed.

               In this case, we expect citation to point to exactly one node.

               We use for-each because the travesals property is always a node
               list.
           -->
          <!-- FIXME: this should work: <xsl:for-each select="@traversals"> -->
          <xsl:for-each select="ext:get-property-value('traversals', 1)">
              <xsl:attribute name="href"
                ><ext:traversaltargetnode outputdir="./website" outputsuffix=".html" style="link"
              /></xsl:attribute>
          </xsl:for-each>
      </xsl:for-each>
      <xsl:apply-templates/>
    </xsl:element>
  </xsl:template>

  <xsl:template match="correspondences">
      <!-- no need to process these children -->
  </xsl:template>

  <xsl:template match="metadata">
      <xsl:apply-templates/>
  </xsl:template>

  <xsl:template match="p">
    <p><xsl:apply-templates/></p>
  </xsl:template>

</xsl:stylesheet>
Rendered examples

30 shows the rendered result of processing the translation document with the side-by-side style sheet. 31 shows the rendered result of the King James version of the Gospel of John. Note the links from each verse to the corresponding verse in the Greek version. Both of these renderings were generated from a single processing run over the hyperdocument. Navigating the citation link from the translation document to the King James Gospel of John will take you to the rendition shown below.

Figure 30: Rendered analysis document with side-by-side presentation
[Link to open this graphic in a separate page]
Figure 31: Gospel of John, King James version, with verse-to-verse links
[Link to open this graphic in a separate page]

Conclusions and Further Activity

The primary conclusion from this activity is that this type of extension of an XSL processor to a much wider scope than just XML document processing was much easier than anticipated, even with the obvious similarities between the DOM model and groves. We are also pleased with the utility of the end result—the ease with which one can produce a variety of presentations from complex hyperdocuments is very gratifying.

Further activity includes the following:

  • refining our implementation to make it completely tested against all XSL and XPath features.
  • Refining the set of extension functions and elements to provide the best user interface with the minimum of complexity and maintenance cost.
  • Re-implementing these extensions using a non-Python XSL implementation (which may require developing a Java-based grove implementation if we choose Java as our implementation language).
  • Seeing what, if any of this, is appropriate for further standardization, perhaps as a W3C note.
  • Implementing XLink-based hyperdocuments for use within the Bonnell framework.
  • Adding Java or JavaScript to generated HTML documents to provide better user interface for interrogating and traversing hyperlinks.

We will also be putting this system into production at a couple of clients (including internal DataChannel use), which will give us valuable and much-needed practical experience with this system. It still remains to develop patterns of template construction and presentation style to go with typical patterns of hyperlink usage.

Notes

1.

The Bonnell system was initially developed with the intent of making it a for-sale product. However, DataChannel made the business decision not to pursue the product path, but has continued development of the system for use within other DataChannel products and in the context of integration projects. The system is a direct evolution from the types of information systems ISOGEN has historically built for its customers. The system as a whole is designed as a set of modular components integrated through public APIs. One of our key design goals is to build a system in which all the components are independently replacable, limiting as much as possible the proprietary lock-in any given component can impose. The use of established standards is another a key design goal that also helps reduce proprietary lock-in of components. While the code developed by DataChannel is proprietary, it has never been our intent that our proprietary code be anything other than an implementation instance of the public APIs we have defined for the system as a whole. We are trying to define a system that maximizes value for its owners while establishing a solid framework within which component developers and integrators can compete exclusively on value.

Except for the GroveMinder product, all the third-party components of the Bonnell system are open-source. We used GroveMinder because it saved us having to implement our own grove implementation, a non-trivial engineering task, thus allowing us to meet our original first release delivery deadlines. Unfortunately no-one has yet developed an open source GroveImplementation that matches GroveMinder for completeness, speed, and scalability.

As an engineering team we would prefer to make some form of what we've built available as open source. Current business realities do not allow this. However, it is still a personal goal of the Bonnell engineering team to eventually produce or commission open-source versions of the Bonnell system.

If we had an open-source grove implementation and we were allowed to make the components developed directly by DataChannel available as open source, the world would have a completely standards-based system using well-established and accepted tools and technologies (XML, XSL, SGML, HyTime, etc.) in order to do easily do things that have, up until now, been seen as either outright impossible or so difficult to implement as to be effectively impossible. We find it extremely frustrating that it is not presently with our personal power to realize this vision—however, we've been working on this vision for at least the last 10 years—another year or two won't make that much difference. We are also hopeful that by demonstrating that it can be done that we will motivate others in a better position to produce the necessary open source to realize this vision. We would also like to see a variety of implementations that focus on different implementation details, such as using XLink instead of HyTime.

2.

For example, the XPointer specification is only defined for addressing XML documents. This means that the use of XPointer (and therefore XLink) to link to components of non-XML data, such as structured graphics, is undefined in the general case. Some standards, such as the ISO Computer Graphics Metafile (CGM) standard, have provided their own addressing syntax, but it is burdensome for each different non-XML data type to have to define such an addressing syntax and for users to have learn each new syntax simply to do addressing. A standard underlying data model enables a single addressing syntax that can address anything. Specialized addressing syntaxes can still be defined, but they can themselves be defined in terms of the generic data model, making them testable (and possibly implementable) in terms of a common standard.

3.

The Gospel of John was chosen for this exercise because it is, according to my collegue Don Smith, the gospel about which there is the most controversy and uncertainty about its source texts. Thus it is a subject of much textual analysis, of which simply correlating the translations is the first step. Don is a PhD in religious studies with a focus on the Christian gospels. Don told me about his idea of a hypertext-based system for capturing and presenting textual analysis over dinner one evening last fall (we were both in Dayton teaching an XML for Programmer's class). That night I found the Greek text, marked it up, and created the hub document with the links. It would not be until I wrote this paper, however, that I was able to create the type of visual presentation that Don had in mind.


XSL and Hyperdocuments

W. Eliot Kimber [DataChannel, Inc.]
eliot@isogen.com
Mark Anderson [DataChannel, Inc.]
mark@amati.petesbox.net
Brandon Jockman [DataChannel, Inc.]
brandonj@datachannel.com