Internationalized Back-of-the-Book Indexes for XSL Formatting Objects

W. Eliot Kimber
eliot@isogen.com
Joshua Reynolds
joshuar@isogen.com

Abstract

The XSL recommendation and its commercial implementations have finally reached a point of maturity that allows for their use in generating production-quality printed technical documents. However, the XSL specification does not provide any built-in features specific to the production of back-of-the-book indexes. In addition, the current commercial tools do not provide any proprietary features for indexing comparable to those that can be found in existing page composition systems such as Arbortext's Epic Publisher and XyEnterprise XPP. Thus, if XSL-FO is to be used to produce documents that have back-of-the-book indexes, the style sheet author must implement all the processing necessary to create the index. This task is challenging enough for Latin languages. For indexes in other languages, the task is further complicated by the need to support locale-specific collation schemes and index groupings. Defining collation and grouping rules for languages such as Traditional and Simplified Chines, that use tens of thousands of unique characters, further complicates the task.

This paper describes in detail the system developed by the authors to satisfy the challenge of producing back-of-the-book indexes using XSL-FO for a number of Asian and Middle-Eastern languages, including Thai, Japanese, Korean, Arabic, Hebrew, and Chinese (Traditional and Simplified). The solution developed includes the following key components:

  • XSL business logic for processing the index entries in order to generate the index pages themselves
  • Java libraries that support index item grouping and sorting for arbitrary national languages and locales, configured through an XML document by which the index groups and sorting rules are easily specified
  • a PDF post processing step that removes duplicate page numbers from the index pages produced by the XSL-FO process.

The solution developed is completely generic and can be easily adapted to any typical technical documentation document type that uses embedded index entry markup (e.g., any Docbook-like document type). The configuration of the indexes is through relatively simple XML documents that define the index groups, sorting and grouping strategies to use and, if necessary, locale-specific collation orders for index entry and group sorting. The paper describes how the system takes full advantage of Java's built-in internationalization support to simplify both the initial development task and the definition of custom index configurations.

The paper specifically discusses the indexing challenges posed by the national languages involved and how the XSL and Java code developed meets those challenges. Also discusses the general issues involved in generating indexes with XSL-FO, including issues of locale-specific sorting, grouping using the Munchian Method, and use of Java's built-in support for locale-specific collation.

Keywords: XSL-FO; Unicode

W. Eliot Kimber

W. Eliot Kimber is a Consultant at ISOGEN. Eliot is a founding member of the XML Working Group, Co-editor of ISO/IEC 10744:1977 (HyTime), and Co-Editor of ISO/IEC 10743, Standard Music Description Language. Eliot is also involved in the STEP and SGML Harmonization effort, which led to a deeper appreciation of the power and utility of formal data modeling as a design and analysis tool. Eliot writes and speaks frequently on the subject of SGML, XML, hyperlinking, and related topics. When not trying to wrestle chaotic data into orderly structures, Eliot enjoys swimming, biking and guitar playing. Eliot is a devoted husband and dog owner.

Joshua Reynolds

Joshua is a Software Engineering Consultant for ISOGEN. With a background in hardcore mathematics and a eye for versioning, Josh is a vital member of the Bonnell content and link management system team. Josh enjoys coffee shops, staying fit, flying, and providing witty insight to keep team spirits high.

Internationalized Back-of-the-Book Indexes for XSL Formatting Objects

W. Eliot Kimber [ISOGEN International, LLC]
Joshua Reynolds [ISOGEN International, LLC]

Extreme Markup Languages 2002® (Montréal, Québec)

Copyright © 2002 W. Eliot Kimber and Joshua Reynolds. Reproduced with permission.

Introduction

The Extensible Stylesheet Language (XSL [XML Stylesheet Language]) recommendation [XSL] defines a robust facility for creating composed pages and online representations from XML [Extensible Markup Language] documents. Published as a recommendation in October of 2001, XSL is already supported by at least two production-quality commercial tools and one useful-but-not-quite-there open source implementation1. The commercial tools are suitable for any print production scenario where the formatting requirements are within the scope of the composition features provided by XSL and the implementing tools. While XSL and its commercial implementations cannot fully duplicate the functionality of venerable tools such as XyEnterprise XPP [XPP] or Advent 3B2 [3B2], it is competitive with products that provide a relatively low-cost publishing solution, such as Epic Publisher [EPIC].

The XSL recommendation (not to be confused with the XSL Transformations (XSLT [XSL Transformations] [XSLT] recommendation) defines a document type that that represents an abstract formatted document as a set of “formatting objects”. Thus, the XSL recommendation is often referred to as “XSL-FO” to distinguish it from XSLT. XSL formatting-object documents are normally the output of an XSLT transform and are the input to XSL-FO processors that create composed pages or online screens from the FO document. (Obviously, just like HTML, XSL-FO documents can be produced by any sort of process and from any input, not just XML documents.)

The formatting objects defined in XSL 1.0 allow the creation of relatively simple page structures. It does not provide for arbitrarily complex page layouts, such as might be used for newspapers or magazines. The page layout features provided by XSL 1.0 (and implemented by the commercial implementations) are sufficient for most technical documentation presentation styles.

Our experience suggests that there is no particular difficulty in implementing production-quality print output using XSL-FO, with one key exception: back-of-the-book indexes. For back of the book indexes there are two challenges:

  1. Internationalizing the definition and production of index groups and internationalizing the collation rules needed to group and sort index entries appropriately. This internationalization requires providing locale-specific index group and collation rules.
  2. Reducing sequences of duplicated page numbers in the paginated index to unique numbers.

This paper discusses both of these challenges and presents the solution we developed for them. The solution developed is completely generic and should be applicable to any document type for which indexes are required. The solution has been implemented for the following “difficult”2 national languages:

  • Arabic
  • Chinese, both simplified and traditional
  • Hebrew
  • Japanese
  • Korean
  • Thai

All non-western languages present the challenge of simply configuring workstations with the correct fonts. For example, within ISOGEN, we were only ever able to get about a quarter of the computers used on the project configured to correctly display the various non-western languages and we could never figure out what the variable was—all the machines were Windows 2000 workstations with all the regional options and fonts installed. Even on correctly-configured machines not all applications used the fonts correctly.

Arabic and Hebrew are difficult primarily because they are right-to-left languages. Editing and presenting right-to-left data, especially in combination with left-to-right data, is a challenge and many tools have either not stepped up to right-to-left presentation or have not implemented it completely. Arabic and Hebrew are also challenging because of the way that some letter forms vary based on their positions within words and sentences. Also, Arabic has very complex collation rules that depend on word roots rather than simply word spelling, significantly complicating the collation algorithm for Arabic [Note that for the project described in this paper we were not aware of this complication and have not yet stepped up to implementing it.].

Chinese and Japanese are difficult for a number of reasons. As ideographic languages they do provide a direct route to collation—the order that the ideographs are listed in the Unicode character set has no necessary relation to any particular way that the characters might be sorted. The sheer size of the character sets—tens of thousands of individual characters—makes working with these languages difficult, especially the task of configuring per-character operations. Japanese is further complicated by using both a phonetic alphabet and ideographic characters adopted from Chinese. In Japanese usage the same character or set of characters may represent several homonyms, such that the same character sequence will have a different Japanese pronounciation based on usage. Because Japanese is collated based on phonetic pronounciation, it is impossible to generically collate Japanese words that include ideographs.

Thai is difficult for two primary reasons. First, Thai has no well-defined notion of “word”. Thus a stream of Thai cannot be automatically flowed into lines by normal composition software. Thus word breaks must be added to the Thai data, either by the authors or by an automatic process. Thai also presents some font and presentation challenges because of the way that various diacritical marks are positioned relative to other glyph components. To untrained eyes the differences between many Thai characters are very subtle, equivalent to the distinction between grave and accute accents, but less obvious. This can make it easy to mistake one character for another, such as when setting up configurations or checking output.

Korean does not present any particular difficulties beyond those of any non-Western language. It is a phonetic alphabet and its characters are easy enough to distinguish once the basic glyph components are learned. The only real problem we ran into was that some tools that were able to display all the other Asian language characters would not display Korean, presumably due to some sort of font configuration problem that we never tracked down.

Given the current state of national language support in Windows 2000 we did not find any of the above challenges, except those dealing with collation and Thai word breaking, to be much of a burden. As explained below, we are able to assemble tool sets that let us do what we needed to do.

Overview of Character Sets, Encodings, and Fonts

In order to understand how collation works in general and how it works with XSL and XML in particular, it is necessary to understand the basics of character sets.

The character data in an XML document is exactly that: characters. In XML, a “character” is a sequence of bytes that corresponds to a character in a “character set.” A character set is nothing more than a table of byte sequences and the logical or semantic characters they correspond to. For example, in the ASCII character set, the byte 0x41 corresponds to the logical character “A”. Over the years, a number of different character sets have been developed to support the needs of different national languages.

The ASCII Latin-1 character set, the one made ubiquitous by its use in Unix systems and personal computers, originally only used 7–bits for the character codes, meaning that it could represent at most 128 distinct characters. This was sufficient for representing English and other Latin languages that used essentially the 26 letters of the English alphabet. To handle most other European languages, and to provide a number of graphic characters, ASCII was extended to 8 bits (one byte) for the character codes, allowing up to 255.

This had the result that most software expected characters to only be one byte long. If you wanted to handle languages that required more than 255 characters, such as ideographic languages, you had to either define new character sets with multi-byte characters or have many single-byte character sets and some way to indicate which character set a particular character was to be drawn from. Needless to say this complicated things quite a bit.

To try to solve these problems the Unicode character set has been developed [UNICODE]. The Unicode character set is a single character set that aims to provide, ultimately, all the characters needed by all the world's national languages, past, present and future. Unicode provides up to 4 bytes for each character, more than enough for all the known languages and any languages yet to be defined or discovered. Unicode includes the ASCII Latin-1 character set, so that all the characters in the ASCII Latin-1 character set have the same character code in Unicode. The core of Unicode is a two-byte character set, the basic multilingual plane, which provides the characters for all modern languages.

Because Unicode provides all characters in a single character set there is no need for Unicode-based processing to do any character set switching or escaping. This simplifies that aspect of document processing.

However, because Unicode is a “multi-byte” character set it does present the problem of how to encode the characters in a particular data stream. For ASCII it's relatively easy: data is simply a sequence of 8–bit bytes and each byte corresponds to exactly one character. For two-byte character sets, such as some of the Japanese character sets, the data is stored as a sequence of pairs of 8–bit bytes where each pair equals one character.

But Unicode characters can be represented by from one to four bytes, depending on a character's position in the character set. It would be highly inefficient to store all characters as four-byte sequences, especially when the vast majority of data would use at most two bytes and most Western-language data would only use one byte. Thus the Unicode specification provides for a number of different encoding schemes, allowing data to be stored in the way that is most efficient for a given data set.

This means that it is not sufficient to simply talk about “Unicode data.” Rather, you must distinguish between Unicode data as stored as byte sequences and Unicode data as held in memory by processing programs.

The two most common Unicode encodings are UTF-8 and UTF-16. UTF-8 uses variable-length sequences of 8–bit bytes to encode characters. In UTF 8 all ASCII characters are exactly the same as in an ASCII document. In UTF-16, each character is represented by a two-byte sequence or a pair of two-byte sequences. Thus, the same sequence of Unicode characters may be stored on disk in at least two different, equivalent ways.

Collation is the act of sorting character strings into some specified order. For languages like English, collation is entirely a function of the spelling of words and the order of letters in the alphabet: A, B, C, D, etc., such that “baboon” collates after “ape” and before “chimpanzee.” In most other languages, collation is defined by more complex rules, such as stroke count in Traditional Chinese or word roots in Arabic. Ideographic languages have no natural ordering in the way that alphabetic languages do. The characters in any character set are inherently ordered, i.e., in increasing numeric order of the character codes themselves. However, the order of characters in a character set does not necessarily correspond to the collation order for those characters, and in most languages, will not correspond to any particular desired collation order.

Unicode largely solves the problem of character representation for national languages, which makes it much easier to solve problems in collation, which requires the ability to both define collation orders in terms of character codes and do comparisons of strings of characters. However, Unicode does not completely solve problems of character display.

A character set defines characters, which are logical things (the idea of the letter “A”). To display a given character it must be mapped to a glyph or set of glyphs in a particular font. Fortunately this task is usually handled transparently by either the underlying operating system or by rendition software, such as Web browsers and print composition tools. Most rendering tools provide the appropriate character-to-glyph-in-font mappings out of the box, although many can be configured if necessary, for example, to accomodate a national language the tool supplier didn't provide for.

Note that both the Windows operating system and the Java programming language are natively Unicode based. For example, in Java, all strings are represented internally as Unicode strings.

There are two cannonical ways to refer to unicode characters: character name and character code using the syntax “\u0000” or “U+0000”, which is a two-byte hexidecimal number. Each character in the Unicode character set has a unique descriptive name, e.g. "GURMUKHI LETTER CHA" or "LATIN CAPITAL LETTER A." In XML, unicode characters can be referenced directly using numeric character references of the form “�”.

Index Generation Challenges

All commercial print production systems in current use provide some form of support for producing back-of-the-book indexes. The most complete tools support a full range of national languages. When using these systems it is usually sufficient to simply mark up index entries in the source document. Some tools, such as Epic Publisher or Framemaker, may require some custom configurations to be defined, but that customization task is not particularly difficult (it does not normally require writing any excecutable code, for example).

Producing a back of the book index requires the following information for a given national language:

  • The groups into which the index entries will sorted, e.g., ”Numeric/Symbols”, “A”, “B”, “C”, in English.
  • The set of characters or character-equivalents that sort into a particular group. For example, in Hungarian, the character pair “dz” is treated as a single character for sorting purposes, as is “dzs” and both sort into the “D” group in a Hungarian index.
  • The collation order for the characters or character-equivalents used by the language. For example, in Hungarian, “dz” sorts after “d” but before “dzs”, so that “dzz” sorts after “daa” but before “dzs”.
  • For non-Latin languages, whether latin groups are presented before or after non-Latin groups.

In addition, the groups used for a given language are not necessarily set in stone, especially in ideographic languages where there are several possible schemes for organizing and sorting characters. There may be, for example, subtle regional differences in how characters are sorted or simply different standard practices in different regions or in different disciplines.

While the above information can be hard-coded into an index processor, good engineering practice suggests that a more flexible configuration mechanism is most appropriate.

Given the above information, we can sort index entries into the appropriate groups and sort the entries appropriately within their groups.

However, one additional challenge remains: eliminating duplicate page numbers.

For each leaf index item in the index there may be any number of actual entries in the document and some of these may occur on the same page. If the index processor blindly turned each source index marker into a page number in the generated index there could be entries of the form:

apple
   macintosh 4, 8, 12, 12, 12

Commercial composition systems handle this situation by reducing the result page number lists to eliminate duplicate page numbers. However, because of the nature of the XSL production process, there is no way for an XSL processor to do this using only XSL-defined mechanisms. Thus we had to find a way to post-process the composed documents to eliminate duplicate page numbers in the generated index.

Note that the better solution to this problem would be for the XSL-FO specification to provide a way to indicate that number sequences should be reduced to unique numbers or, lacking that, for individual XSL-FO implementations to provide that feature.

Basic Index Generation Using XSL-FO

The production process for creating composed pages from XML documents using XSL-FO consists of two steps:

  1. An XSLT transform is used to generate an XSL-FO document from the input XML document. This transform is exactly analogous to using XSLT to generate HTML [hypertext markup language]from an XML document, the only difference being the target document type. It is during this transformation that all index generation processing occurs.
  2. An XSL-FO implementation is used to generate composed pages from the XSL-FO document generated in step 1. Specific implementations may provide proprietary extensions or provide extension points for use during this step. For example, both the XSL Formatter [XSF] and XEP [XEP] products provide extension elements for creating PDF [PDF]-specific structures that are not directly provided for in XSL. Unless there are proprietary extensions specific to index processing, this step does nothing specific for index generation.

This section describes the general process for index generation, ignoring internationalization issues. The next section then discusses the additional considerations that must be addressed for non-Roman national languages. Note that this paper, and this section in particular, reflects XSLT 1.0, which provides limited features for doing grouping. XSLT 2.0 will provide additional grouping features. At the time of writing we had not had time to evaluate the new grouping and sorting features of XSLT 2.0 and compare them against the requirements of index generation.

To generate an index the index entries must be grouped into the main index groups. Within each group, the primary entries must be sorted, secondary entries grouped with each primary entry and then sorted. Tertiary items, if used, must be grouped and sorted as for secondary entries.

All of this processing can be done using the standard XSLT key and sort features.

The key feature is used to group index entries, using a technique known as the Muenchean Method [Muench] (named after Steve Muench , who first publicly proposed the technique). The sort feature is, of course, used to sort the entries within their groups (index entry collation).

The style sheet excerpt in Figure 1 reflects the processing needed to handle the typical index markup pattern of an “index item” element that contains a primary entry, optionally followed by a secondary entry or a see or see-also, e.g:

<para>An apple a day keeps the doctor away
<index.item>
  <primary>apple</primary>
  <secondary>as medicine</secondary>
</index.item>
</para>

The template in Figure 1 has been adapted from the Docbook XSLT style sheets developed by Norm Walsh and others [Docbook].

The use of XSLT keys is fairly straight forward. A “key” in XSLT is nothing more than a lookup table or dictionary where each entry in the key table is a node list of zero or more nodes. Each entry has a string key. Entries are created for each node in the input document that matches the match= specification for each key table.

For an index with primary, secondary, see, and see-also entries, six key tables are required:
groups

Repesents the top-level index groups. Each entry represents an index group. The entry members are the index marker nodes that sort into that group. The key string for each group is the presentation label for the group.

primary

Represents the full set of primary index entries. Each entry collects all the index markers for a given primary entry. The key string is the text content of the primary entry.

secondary

Represents the full set of secondary entries. Each entry collects all the index markers for a given secondary entry within a given primary entry. The key string is the concatenation of the primary entry content and the secondary entry content.

primary-marker

Represents the full set of primary entries with no secondary or see/see-also entries. The same as the “primary” key table except that it only includes index markers that are just primary entries.

see

Represents the full set of “see” entries. Each entry collects the index markers that contain “see” entries. The string key is the concatenation of the primary entry content and the secondary entry content (if there is any).

see-also

Represents the full set of “see also” entries. Each entry collects the index markers that contain “see also” entries. The string key is the concatenation of the primary entry content and the secondary entry content (if there is any).

Producing the index is simply a matter of iterating over the set of groups, and, within each group, iterating over the primary, secondary, see, and see also entries from the tables. The code fragment in Figure 1 shows the basic approach. This code sample is simplified. For a complete example, see the Docbook XSL-FO style sheet [Docbook], from which this code was adapted.

Figure 1: Generic Index Generation XSLT Template
<!DOCTYPE xsl:stylesheet [

<!ENTITY lowercase "'abcdefghijklmnopqrstuvwxyz'">
<!ENTITY uppercase "'ABCDEFGHIJKLMNOPQRSTUVWXYZ'">

<!ENTITY primary   'concat(primary/@sortas,
                           primary[not(@sortas)])'>
<!ENTITY secondary 'concat(secondary/@sortas,
                           secondary[not(@sortas)])'>
]>
<xsl:stylesheet>
 ...

<-- NOTE: key declarations are global to the style sheet -->
<xsl:key name="group"
         match="index.marker"
         use=”translate(substring(&primary;, 1,1),
                        &lowercase;,
                        &uppercase;)“/>
<xsl:key name="primary"
         match="index.marker"
         use="&primary;"/>

<xsl:key name="secondary"
         match="index.marker"
         use="concat(&primary;, &sep;, &secondary;)"/>

<xsl:key name="primary-marker"
         match="index.marker[not(secondary) and not(see)]"
         use="concat(&primary;, &sep;, generate-id())"/>

<xsl:key name="secondary-marker"
         match="index.marker[not(tertiary) and not(see)]"
         use="concat(&primary;, 
                     &sep;, &secondary;,
                     &sep;, generate-id())"/>

<xsl:key name="see-also"
         match="index.marker[see.also]"
         use="concat(&primary;, 
                     &sep;, &secondary;, 
                     &sep;, see.also)"/>

<xsl:key name="see"
         match="index.marker[see]"
         use="concat(&primary;, 
                     &sep;, &secondary;, 
                     &sep;, see)"/>

<xsl:template name="generate-index">
  <xsl:variable name="terms"
       select="//indexterm[count(.|key('letter',
                  translate(substring(&primary;, 1, 1),
                    &lowercase;,&uppercase;))[1]) = 1]"/>
  <xsl:variable name="alphabetical"
     select="$terms[contains(concat(&lowercase;,
                                    &uppercase;),
                      substring(&primary;, 1, 1))]"/>
  <xsl:variable name="others"
        select="$terms[not(contains(concat(&lowercase;,
                                           &uppercase;),
                          substring(&primary;, 1, 1)))]"/>
  <fo:block>
    <xsl:if test="$others">
      <fo:block font-size="16pt"
                font-weight="bold"
                keep-with-next.within-column="always"
                space-before="1em">
        <hptse:textBefore key="ixpregrp*"/>
      </fo:block>
      <fo:block>
        <xsl:apply-templates
                select="$others[count(.|key('primary',
                                     &primary;)[1]) = 1]"
                             mode="index-primary">
          <xsl:sort select="&primary;"/>
        </xsl:apply-templates>
      </fo:block>
    </xsl:if>
    <xsl:apply-templates
           select="$alphabetical[count(.|key('letter',
                   translate(substring(&primary;,1,1),
                   &lowercase;,&uppercase;))[1]) = 1]"
                         mode="index-div">
      <xsl:sort select="&primary;"/>
    </xsl:apply-templates>
  </fo:block>
</xsl:template>

Note how the select within the apply-templates works: the select is effectively iterating over every node, constructing its key string, selecting that entry from the specified key table, and then returning the first member of the entry's node list. This has the effect of passing to the mode-specific template the first member in each group. This technique is then applied for each primary entry in each group.

Note also the xsl:sort within each xsl:apply-templates. It is this sort that puts the groups in their appropriate order. The same pattern is used to process the primary and secondary entries, applying a sort at each level.

To adapt the above code to different document types, simply change the definitions of the &primary; and &secondary; text entities (and add the necessary code for tertiary entries, if needed). The main “generate-index” template is completely generic (at least for languages that use the Latin alphabet).

However, the code shown above accounts for neither locale-specific groupings nor locale-specific sorting.

Internationalizing The XSLT Index Processing

Internationalizing the XSLT index processing requires creating XSLT extension functions that provide access to the locale-specific index configuration information as well as providing locale-specific collators for sorting index entries. It also requires refinements to how the index entries are selected for processing when a single document may contain index entries in multiple languages.

Locale-Specific Groupings

Most non-Latin languages do not use the Latin alphabet to organize index entries3. In addition, many languages do not have a simple mapping from characters to index groups. For example, in Traditional Chinese characters are indexed according to their stroke count. Thus the simple mechanism used in the Docbook style sheet to get a group key from a primary entry will not work in the general case. A general, locale-specific mechanism is required to provide a mapping from index entries to their groups.

Our analysis of the different languages we had to support led to the realization that there are two strategies for defining index group membership and these two strategies appear to be sufficient to handle all cases, except those requiring grammatical or contextual processing of words.4

Each group has a required “group key,” an optional “sort key,” and an optional “group label.” The group key is the key used in the “groups” XSLT key table. The sort key is the key used to sort the groups into collation order for presentation in the generated index. It defaults to the group key. The group label is the text used as the title for the group in the generated index. It also defaults to the group key.

For languages with limited alphabets, such as European languages, it is easiest and clearest to explicitly define the membership of each index group. This approach is similar to that used, for example, by the Epic Composer product. In this approach, each group definition includes an explicit enumeration of the characters or character sequences that sort within that group.

For languages with large alphabets, such as most Asian languages, explicit group definition would be tedious and error prone. For these languages, it is easier to use the group keys as collation targets such that all the characters that sort between a given group's key and the next group's key are in the given group. For these languages the group key is usually different from the group label. For example, in Traditional Chinese, the index keys are the characters that start each stroke count group but the index labels are the characters that say “One-Stroke Characters”, “Two-Stroke Characters”, etc (see 2).

Figure 2
[Link to open this graphic in a separate page]

Index group definitions for Simplified Chinese (zh-CN)

To use this mechanism, we provide the extension functions “getGroupKey()” and “getGroupLabel()” that take as input a primary index entry and the target output language and return the key or label for that group.

Locale-Specific Collators

The XSLT recommendation provides a “sort” element. While the sort element can take a “lang” attribute to indicate which national language the items being sorted are in, the recommendation leaves it up to XSLT implementations to provide appropriate collators for particular languages. While languages like Java provide some built-in locale-specific collation support, that support is not comprehensive, and in any case, there may be application-specific differences in how collation is to be done for a given national language's index. That is, for many languages there is not a single standard for collation.

Thus, XSLT processors must provide a way to integrate new collators for use with the sort element if they are to be completely flexible. For this project we chose the Saxon XSLT processor, which is Java-based. We initially standardized on Saxon for no better reason than it was the one that most of the people on the team were using before we started to implement the internationalization support. However, it happens that Saxon provides an explicit extension point for integrating custom collators while Xerces, the other main Java-based XSLT processor, does not. We did not evaluate non-Java XSLT engines.

The Java language is particularly well suited to internationalization: it is completely Unicode based and it provides both locale-specific and generic collator objects that can easily be used to implement locale- and application-specific collation.

Java Locale objects represent specific named locales and are used to then create locale-specific objects or as parameters to locale-specific functions. A locale is usually identified using an ISO [International Organization for Standardization]language code and, optionally, country code. For example, the locale “ja” identifies the Japanese language generally, while “ja-JP” indentifies Japanese as spoken in Japan.

Java Collator objects encapsulate the rules for doing character sorting for one or more locales. Sun provides a number of built-in locale-specific collators with the Java run-time. However, this set of collators is not (and could not be) comprehensive. Thus you can construct your own Collators if necessary. Collator objects expose a “compare()” method that takes as input two Unicode strings and returns a value indicating whether the second string sorts before, after, or with the first string. The compare() method is controlled by the Collator's collation rule specification.

A collation rule specification is nothing more than a list of Unicode characters and their comparison rules: “< a < b < c”. The complete collation rule syntax provides for specifying character equivalence, secondary sort orders, and other details for collation. But conceptually it is just a list as shown.

Thus, creating a locale-specific or application-specific collator requires nothing more than creating the appropriate collation rule specification and constructing a collator with it.

For languages such as Chinese, with over 25,000 distinct characters in the Unicode character set, this would seem to be a daunting task. Fortunately there are a number of resources to simplify the task. First, you can get the collation rules from any of Java's built-in collators. Second, there are a number of Web sites that provide databases of characters in different sort orders. Most of these are mappings from language-specific code pages and character sets to Unicode. Given these mappings it is relatively easy to use a text editor to construct a collator rule specification. We found the Textpad [Textpad] and Unipad [Unipad] products to be invaluable for this task. Textpad for its regular expression search and replace and sorting features, Unipad for its support for unicode editing and encoding conversion.

Index Configuration Document

For each national language to be indexed, the groups must be defined and, if necessary, the collation rule specification must be provided. To satisfy this requirement we defined an XML document type, shown in Figure 3.

Figure 3: Index Configuration Document Type Definition
<!--=============================================================
    Index Generation Configuration Document Type

    Copyright (c) 2002, ISOGEN International, LLC


    This DTD governs documents that configure back-of-the-book
    index generation for different national languages.

    For a given national language index you need the following information:

    - The collation sequence for the characters in the language
    - The headings for the index sections (e.g., "A", "B", "C" in
      English)
    - The mapping of characters in the language to specific headings.
      For example, several different characters might all sort
      under "C" in a particular language.
    - For each sort grouping, the sort order within that grouping.
    =============================================================-->

<!ELEMENT index.configuration.set
  (metadata,
   index.configuration+)
>

<!ELEMENT metadata
  (revision.history)
>

<!-- Revision history is intended to capture the development history
     of the table itself.
  -->
<!ELEMENT revision.history
  (revision*)
>

<!-- The revision element describes a single revision to the table.
  -->
<!ELEMENT revision
  (date,
   author,
   description)
>

<!-- Contains the date the revision was made.
  -->
<!ELEMENT date
  (#PCDATA)
>

<!-- The author of the revision -->
<!ELEMENT author
  (#PCDATA)
>

<!--
     Defines the index configuration for a single
     national language.

  -->
<!ELEMENT index.configuration
  (language,
   description,
   collation.sequence,
   sort.strategies,
   index.groups)
>

<!ELEMENT sort.strategies
  ((sort.between.keys |
    sort.by.members)?,
   (sort.english.before |
    sort.english.after)?,
   (sort.groups.by.keys |
    sort.groups.by.labels)?)
>

<!-- Indicates that the grouping strategy is to sort between the
     index group keys rather than by the explicit group membership.
  -->
<!ELEMENT sort.between.keys
  EMPTY
>

<!-- Indicates that the grouping strategy is to sort by the explicit
     group membership. This is the default grouping strategy.
  -->
<!ELEMENT sort.by.members
  EMPTY
>

<!-- Indicates that English terms should be sorted before the main
     language's entries. This is the default.
  -->
<!ELEMENT sort.english.before
  EMPTY
>

<!-- Indicates that English terms should be sorted after the main
     language's entries.
  -->
<!ELEMENT sort.english.after
  EMPTY
>

<!-- Indicates that the groups should be sorted for presentation by
     their group key value. This is the default.
  -->
<!ELEMENT sort.groups.by.keys
  EMPTY
>

<!-- Indicates that the groups should be sorted for presentation by
     their group label value.
  -->
<!ELEMENT sort.groups.by.labels
  EMPTY
>

<!-- Contains the language code for the national language
     this index configuration applies to.

     Language code is as for xml:lang or specific to
     the application.
  -->
<!ELEMENT language
  (#PCDATA)
>

<!-- Generic description
  -->

<!ELEMENT description
  (para*)
>

<!ELEMENT para
  (#PCDATA)
>

<!ELEMENT collation.sequence
  (use.unicode |
   java.collator.spec)
>

<!-- Indicates that the Unicode-defined or Java locale-defined
     sort order should be used -->
<!ELEMENT use.unicode
  EMPTY
>

<!-- Contains a Java RuleBasedCollator
     (http://java.sun.com/products/jdk/1.2/docs/api/java/text/RuleBasedCollator.html)

     rule set.
  -->
<!ELEMENT java.collator.spec
  (#PCDATA |
   external.collator.spec)*
>

<!-- Specifes the name of the collation specification file to use. This file
     must contain a Java collation spec as used with the RuleBasedCollator
     class.

     The value is interpreted as a Java resource filename resolved relative to the
     class path. Normally this would be relative to the "dtds\" directory, which
     must already be in the classpath for other reasons.

     If more than one external.collator.spec is specified, then all the rule sets
     are used for the collator.

     If external.collator.spec is used, any other text content within java.collator.spec
     is ignored. That is, the intended content model is (#PCDATA | external.collator.spec+).
  -->
<!ELEMENT external.collator.spec
  (#PCDATA)
>

<!-- Defines the groupings used to organize the index entries -->
<!ELEMENT index.groups
  (index.group+)
>

<!-- Defines a single index group -->
<!ELEMENT index.group
  (group.key,
   group.label?,
   group.sort.key?,
   group.members)
>

<!-- The key used in organizing index entries into their proper groups.
     This is the value used, for example, for the top-level xsl:key
     use= string.

  -->
<!ELEMENT group.key
  (#PCDATA)
>

<!-- The text to be displayed for the index group in the composed index.

     The group label is optional. If not specified, the group key is also
     used as the group label.
  -->
<!ELEMENT group.label
  (#PCDATA)
>

<!-- Specifies the characters that make up this index group. Must include
     all case varriants e.g.: <group.members><char.set>a</char.set><char.set>A</char.set>
     </group.members>
  -->
<!ELEMENT group.members
  (char.set* |
   last.member)
>

<!-- Specifies one or more characters that constitute a single logical character
     within the group. For example, "ch" in Spanish would be a single char.set.
     Most char.set elements will contain single characters.
  -->
<!ELEMENT char.set
  (#PCDATA)
>

<!ELEMENT last.member
  (char.set)
>

A typical index configuration is shown in Figure 4.

Figure 4: Typical Index Configuration Document Instance
<index.configuration.set>
<metadata>
<revision.history>
<revision>
<date>12 Feb 2002</date>
<author>W. Eliot Kimber</author>
<description>
<para>Created</para>
</description>
</revision>
</revision.history>
</metadata>
<index.configuration>
<language>en</language>
<description>
<para>Index configuration for English</para>
</description>
<collation.sequence>
<use.unicode/></collation.sequence>
<sort.strategies></sort.strategies>
<index.groups>
<index.group>
<group.key>A</group.key>
<group.members>
<char.set>a</char.set>
<char.set>A</char.set>
</group.members>
</index.group>
<index.group>
<group.key>B</group.key>
<group.members>
<char.set>b</char.set>
<char.set>B</char.set>
</group.members>
</index.group>
<index.group>
<group.key>C</group.key>
<group.members>
<char.set>c</char.set>
<char.set>C</char.set>
</group.members>
</index.group>
...
</index.configuration>
</index.configuration.set>

While the DTD [document type definition]as presented is the simplest thing that would work for this project, an obvious extension would be to provide explicit markup for storing individual index configurations in separate documents (or XInclude could be used). The DTD as shown was sufficient to satisfy the configuration needs of all the languages we had to support for this project.

Index Generation Support Extension Library

The Saxon-specific index generation support consists of two parts: a generic index support library and a thin Saxon integration library that binds the generic support library to Saxon-specific XSLT extension functions.

Generic Index Support Library

The generic index support library exposes an IndexHelper class that exposes the following public functions:
void loadIndexConfiguration (String languageCode)

Loads the index configuration data for the specified language code (locale).

While the index configuration document may contain configuration information for many languages, the configuration for a given document is only loaded when needed.

String getGroupKey(String languageCode, String indexEntry)

Given an index entry and the language it is being indexed in, returns the group key for the group the index entry belongs in. If the character sorts outside the range of the defined groups, the sort key returned is “#NUMERIC”, indicating that the character is not an alphabetic character.

String getGroupLabel(String languageCode, String indexEntry)

Given an index entry and the language it is being indexed in, returns the group label for the group the index entry belongs in.

String getGroupSortKey(String languageCode, String indexEntry)

Given an index entry and the language it is being indexed in, returns the group sort key for the group the index entry belongs in.

java.text.Collator getCollator(String languageCode)

Returns the collator associated with the specified language's index configuration.

Most of the business logic in the library implementation supports the processing of the index configuration document. The IndexHelper also relies on a more general internationalization support library that provides a number of utility functions and also provides features for looking up translated text strings (primarily for creating language-specific generated text in presentation style sheets) [EPIC-I18N].

Saxon Integration Library

The Saxon integration library has two parts: wrapper functions that expose the getGroup*() methods as XSLT extension functions and a set of Saxon TextComparer objects that bind the xsl:sort element to the language-specific collators defined by the index configuration document.

The Saxon extension function implementation is very simple. Each exposed function is just a straight delegation to the corresponding method of the underlying IndexHelper object5.

The TextComparer objects are equally simple: each language-specific comparer class simply requests that language's Collator from the IndexHelper and then delegates the TextComparer's compare() method to the collator. A typical TextComparer is shown in Figure 5.

Figure 5: Typical Text Comparer Implementation
package com.icl.saxon.sort;
import com.isogen.i18nsupport.I18nUtil;
import com.isogen.i18n.support.I18nUtilError;
import java.text.Collator;
import java.text.RuleBasedCollator;

public class Compare_ar extends com.icl.saxon.sort.TextComparer {

    static Collator collator;

    public Compare_ar() {
        super();
        try {
            collator = TranslationSupportUtil.getCollatorForLanguageCode("ar");
        } catch (TranslationSupportUtilError e) {
            e.printStackTrace();
        }
    }

    public int compare(java.lang.Object obj, java.lang.Object obj1) {                           
        return collator.compare(obj, obj1);
    }
    
}

Using the Index Support from XSLT

Using the indexing support from XSLT requires at minimum, two changes to the index generation code presented above: The specification of the group keys for use in populating and retrieving from the groups key table must use the getGroupKey() method and the xsl:sort element must specify the national language for the index in order to invoke the appropriate collator. The revised generate-index template is shown in Figure 6

Figure 6: Internationalized generate-index Template
<xsl:key name="group"
         match="index.marker"

use="indexHelper:getIndexGroupLabel(indexHelper:getIndexGroupKey(&primary;))"/>

<xsl:key name="primary"
         match="index.marker"
         use="&primary;"/>

<xsl:key name="secondary"
         match="index.marker"
         use="concat(&primary;, &sep;, &secondary;)"/>

<xsl:key name="primary-marker"
         match="index.marker[not(secondary) and not(see)]"
         use="concat(&primary;, &sep;, generate-id())"/>

<xsl:key name="secondary-marker"
         match="index.marker[not(see)]"
         use="concat(&primary;, &sep;, &secondary;, &sep;, generate-id())"/>

<xsl:key name="see-also"
         match="index.marker[see.also]"
         use="concat(&primary;, &sep;, &secondary;, &sep;, see.also)"/>

<xsl:key name="see"
         match="index.marker[see]"
         use="concat(&primary;, &sep;, &secondary;, &sep;, see)"/>

<xsl:template name="generate-index">

  <xsl:variable name="terms"
        select="//index.marker[count(.|key('group',
           indexHelper:getIndexGroupLabel(indexHelper:getIndexGroupKey(&primary;)))[1]) = 1]"/>

  <xsl:variable name="sort-lang">
     <!-- get-iso-lang-code translates application-specific
          language codes to the ISO language and country
          codes used by the Java Collators.
       -->
     <xsl:call-template name="get-iso-lang-code"/>
  </xsl:variable>

  <xsl:variable name="alphabetical"
                select="$terms[not(starts-with('#NUMERIC',
                                            indexHelper:getIndexGroupKey(&primary;)))]"/>
  <xsl:variable name="others"
        select="key('group', '#NUMERIC')"/>
  <fo:block>
    <xsl:if test="$others">
      <fo:block font-size="{$body-font-size}"
                font-weight="bold"
                color="{$hp-color}"
                keep-with-next.within-column="always"
                space-before="12pt">
        <hptse:textBefore key="ixpregrp*"/>
      </fo:block>
      <fo:block>
        <xsl:apply-templates select="$others[count(.|key('primary', &primary;)[1]) = 1]"
                             mode="index-primary">

          <xsl:sort lang="{$sort-lang}" select="&primary;"/>
        </xsl:apply-templates>
      </fo:block>
    </xsl:if>
    <xsl:apply-templates
        select="$alphabetical[count(.|key('group',
       indexHelper:getIndexGroupLabel(indexHelper:getIndexGroupKey(&primary;)))[1]) = 1]"
        mode="index-div">
      <xsl:sort lang="{$sort-lang}"
          select="indexHelper:getIndexGroupSortKey(indexHelper:getIndexGroupKey(&primary;))"/>
    </xsl:apply-templates>
  </fo:block>
</xsl:template>

To fully support internationalized documents that may contain multiple languages the logic that selects index markers for processing must be refined to handle the case where the same primary entry text is used for two different languages. In this case, all the primary and secondary entries will be in the key tables but only the entries for the specific output language should be processed. For example, if a document includes text for both English and Spanish, the primary entry text might be used in both languages:

<lang.alts>
  <para xml:lang=”en”>
    <index.item>
      <primary>standards</primary>
      <secondarydefinitions</secondary>
    </index.item>
    ...
  </para>
  <para xml:lang=”es”>
    <index.item>
      <primary>standards</primary>
      <see>definiciones</see>
    </index.item>
    ...
  </para>
</lang.alts>

In this case, the string “standards” is the key for one entry in the primary key table containing two nodes, one for the English index.item and one for the Spansih index item. If you only check the language of the first item in the key table entry, you will miss some entries when the target language is not the same as the language of the first entry. If you process all items blindly, you might create primary entries for which there are no secondary entries in a particular language. Therefore, you must check to see if any of the index items for particular primary key are in the target language, and if so, only process those that are in that language.

Removing Duplicate Index Page Numbers

The index generation support extensions enable the generation of FO documents with back-of-the-book indexes in any language. However, it cannot solve the problem of duplicate page numbers in the final composed index.

This is because the index page numbers as generated in the formatting object document are represented by a sequence of fo:page-number-citation elements, one for each index marker for a given index entry. At the time the formatting object document is generated there is no way to know what page a given page-number-citation element will resolve to because the page number citation is simply a reference to another formatting object somewhere else in the FO document–there is no way of knowing before pagination what page a particular formatting object will occur on. Therefore there is no way to eliminate duplicate page number citations in advance of page composition. This is a hard limitation of XSL, not a side effect of the indexing approach used in this project.

The ideal solution would be for FO implementations to provide a way to remove duplicate page numbers from sequences of page number citations. However, none of the available commercial tools provide such an extension6.

Therefore we were forced to do our own post-processing of the composed document. For this project the output target was PDF [portable document format]. PDF documents can be manipulated programmatically. For this task we selected a free Java library, PJ [PJ]. The PJ [PDF for Java]library can parse a PDF document and expose its contents through an API that directly reflects the low level abstract structure (“COS layer”) of PDF documents as defined in the PDF Specification.

The basic idea is simple enough: find the start of the index, then for each page of the index iterate over the page's contents looking for numbers followed by commas, copying the data to a new page content structure. Whenever a number is the same as the preceding number, skip it, thereby eliminating it from the output. At the end of each index page, replace the existing page with the new page data. However, the actual algorithm required is quite complex because of the wide variety of ways the same string can be represented in a PDF document.

Using the PJ library we faced two main challenges. First, the PJ library as provided had some bugs in its parsing algorithm that made invalid assumptions about the data stream..

Second, the representation of non-Latin-1 languages in PDF doesn't use UTF-8 or some similar encoding. Instead for each non-ASCII character on a page it stores a code that represents the appropriate glyph to use from a declared font (referred to as a 'character code'). Since PDF is defined as a formatting language, this is an appropriate thing to do.

In order to do any semantic string processing, we must be able to navigate to actual Unicode characters. This is accomplished through a per-page mapping from character codes to the corresponding Unicode characters. This mapping is embedded in the PDF file itself.

Thus, to implement the algorithm to search for content strings we first had to fix the parsing bugs in PJ and then implement the additional code needed to get from a sequence of font codes to Unicode strings within the PDF document.

Given these extensions to PJ we were then able to post-process PDF documents in any language to remove duplicate page numbers.

Conclusions

The XSLT extensions needed to support index generation were relatively easy to implement. We invested about 3 person weeks developing the IndexHelper library and about 1 person week implementing the PDF post processor.

While the need to post-process the composed PDF documents to produce a finished index is unfortunate, we were able to implement the post processing using available tools (and without the need to write an Acrobat plug-in, which while a more robust solution in the abstract, would have taken significantly more time because of the need to implement it as a C++ extension).

We also expect the commercial FO implementations to provide a built-in solution to page number sequence reduction in the short term, if for no other reason than doing so is a requirement to be fully competetive with existing page composition systems.

We have not evaluated the performance implications of the indexing support. The use of XSL by itself presumes that performance is not a top concern. The index processing does not appear to add significant processing time. The PDF post-processing is slower than doing the same processing in the composition engine would be. It is also slower (we presume) than doing the same processing using a C++ Acrobat plugin, simply because the Java code is performing lots of string processing that Acrobat presumably optimizes.

We have not identified any areas for further research. The set of languages we have had to support to date has not exposed any requirements to do more than simple string matching in order to correctly collate index entries. However, it is quite possible that some languages require more sophisticated textual analysis in order to do correct collation. For such languages, the complexity of the collator would be much greater and would require either a specialized, locale-specific Collator implementation or would require extensions to the index configuration document type and IndexHelper objects. However, we would expect (or at least hope) that packages such as IBM's ICU4J [ICU4J] would address such challenging languages and provide ready-made processors (as they have, for example, for doing Thai word breaking).

The XSLT index generation processing coupled with the IndexHelper index support extensions appear to provide a complete solution for generating back-of-the-book indexes in any national language for which there is Unicode support. The index configuration document has proven to be relatively easy to create and maintain, given knowledge of the rules for a particular language. The creation of custom collation rules for languages like Korean and Simplified Chinese turned out to be easier than expected because of the availability of online databases of Unicode mappings for these characters. Through the use of tools like Textpad and Unipad it was not difficult to organize the collation rules for any of the languages we had to support.

Notes

1.

The commercial tools include XSL Formatter from Antenna House (www.antennahouse.com) and XEP from RenderX (www.renderx.com). The main open source implementation is the FOP system, part of the Jakarta Xerces project (http://xml.apache.org/fop/index.html).

2.

I put difficult in quotes because these languages are only difficult from the perspective of a provincial Western 7–bit ASCII [American Standard Code for Information Interchange]perspective. While all these languages present their own complexities, none of them are particularly intractible, at least in the context of the computing infrastructures available in 2002. They wouldn't be difficult at all if, for example, our early computing infrastructure had been developed primarily in China rather than North America and Europe.

3.

One notable exception is Simplified Chinese, which groups ideographic characters based on their Pinyin transliterations, which are written using the Roman alphabet.

4.

In fact, all the group configuration could be done through Java collation rules. However, we decided that it would be easier for most people to manage the configuration of Latin languages using the XML configuration file.

5.

The implementation presented here is written against the Saxon 6.x API [application programming interface], not the Saxon 7 API. The Saxon 7 API changes the way that collators are integrated, making it possible to have a single collator object that can support any number of languages.

6.

This requirement has been communicated to both Antenna House and RenderX. At the time this paper was submitted (June 2002), Antenna House had provided a possible implementation for us to test. By the time of XML 2002, the PDF post processing step should no longer be necessary.


Bibliography

[3B2] Advent 3B2 page composition system. http://www.3b2.com/index.html

[Docbook] Docbook XSL-FO style sheet.

[EPIC] Epic Composer page composition system (an optional feature of the Epic Publisher SGML/XML editor). http://www.arbortext.com

[EPIC-I18N] Internationalization of Generated Text For Epic Editor and XSLT Processing, www.isogen.com/papers (prepared for submission to XML 2002 conference).

[ICU4J] International Components for Unicode for Java. http://oss.software.ibm.com/icu4j/

[Muench] The Muenchian Method of using XSLT groups. http://www.jenitennison.com/xslt/grouping/muenchian.html

[PDF] Portable Document Format. http://www.adobe.com

[PJ] PDF for Java (PJ), GNU-licensed PDF library. http://www.etymon.com/

[Textpad] Textpad text editor product. http://www.textpad.com

[UNICODE] The Unicode® Standard. www.unicode.org.

[Unipad] Unipad Unicode-based text editor product. http://www.unipad.org.

[XEP] RenderX XEP product. http://www.renderx.com

[XPP] XyVision XPP page composition system. http://www.xyenterprise.com/xpp.asp

[XSF] AntennaHouse XSL Formatter product. http://www.antennahouse.com

[XSL] Extensible Stylesheet Language (XSL), http://www.w3.org/TR/xsl/

[XSLT] XSL Transformations (XSLT) Version 1.0, http://www.w3.org/TR/xslt



Internationalized Back-of-the-Book Indexes for XSL Formatting Objects

W. Eliot Kimber [ISOGEN International, LLC]
eliot@isogen.com
Joshua Reynolds [ISOGEN International, LLC]
joshuar@isogen.com