Localization of Schema Languages

Felix Sasaki
fsasaki@w3.org

Abstract

This paper discusses requirements and solutions for the localization of schema languages. Main requirements are the adaptation of markup names and documentation, the modification of data types and the integration of information which is relevant for internationalization and localization. Existing approaches which respond to these requirements are integrated into a general framework of schema language localization, which can be applied to XML Schema, RELAX NG and XML DTD. In addition, the approach allows for relating instances of localized schemas to instances of the general, locale independent schema. In this way, a common level of data processing is maintained.

Keywords: Schema Languages; Modeling

Felix Sasaki

Until 1999, Felix Sasaki has studied Japanese and Linguistics in Berlin, Germany. From 1999 until 2005 he worked in the Department for Computational Linguistics and "Text Technology" in Bielefeld, Germany. As of 1 April 2005, he joined the W3C Internationalization Activity. Since June 2006, he is also working in the W3C Web Services Activity.

Localization of Schema Languages

Felix Sasaki [World Wide Web Consortium]

Extreme Markup Languages 2007® (Montréal, Québec)

Copyright ゥ 2007 Felix Sasaki. Reproduced with permission.

Introduction

This paper discusses requirements and solutions for the localization of schema languages. Some example requirements are illustrated using the XML Schema document in fig. 1.

Figure 1
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
 targetNamespace="http://example.com/purchaseOrder"
 xmlns:po="http://example.com/purchaseOrder">
 <xs:element name="name" type="xs:string"/>
 <xs:element name="street" type="xs:string"/>
 <xs:element name="city" type="xs:string"/>
 <xs:element name="state" type="xs:string"/>
 <xs:element name="zip" type="xs:string"/>
 <xs:element name="country" type="xs:string"/>
 <xs:element name="price">
  <xs:complexType>
   <xs:simpleContent>
    <xs:extension base="xs:decimal">
     <xs:attribute name="currency" type="xs:string"/>
    </xs:extension>
   </xs:simpleContent>
  </xs:complexType>
 </xs:element>
 <xs:element name="language" type="xs:string"/>
 <xs:element name="comment">
  <xs:complexType mixed="true">
   <xs:sequence>
    <xs:any processContents="lax" minOccurs="0" maxOccurs="unbounded"/>
   </xs:sequence>
  </xs:complexType>
 </xs:element>
 <xs:element name="shipTo">
  <xs:complexType mixed="false">
   <xs:sequence>
    <xs:element ref="po:name"/>
    <xs:element ref="po:street"/>
    <xs:element ref="po:city"/>
    <xs:element ref="po:state"/>
    <xs:element ref="po:zip"/>
    <xs:element ref="po:country"/>
    <xs:element ref="po:price"/>
    <xs:element ref="po:language"/>
    <xs:element ref="po:comment" minOccurs="0"/>
   </xs:sequence>
  </xs:complexType>
 </xs:element>
 <xs:element name="purchaseOrder">
  <xs:complexType mixed="false">
   <xs:sequence>
    <xs:element ref="po:shipTo"/>
   </xs:sequence>
   <xs:attribute name="orderDate" type="xs:date" use="required"/>
  </xs:complexType>
 </xs:element>
</xs:schema>

Example schema

The XML Schema document is motivated by the purchase order example introduced in XML Schema Part 0: Primer [XML Schema 0]. There are several possible targets for localization in fig. 1:

  • Translation of names for elements and attributes. For example <purchaseOrder> can be translated to <Kaufbestellung> for a German schema user. The translation seems to be not of importance for common names like <p>or <head>. However, the more specific a markup vocabularies gets, the more important it becomes to assure a clear understanding of markup names for non-native speakers.
  • Modification of data types. For example, enumeration lists can be modified. A possible target is the data type of the <country> element in fig. 1: A territory name like US could be Vereinigte Staaten for a German user or アメリカ合衆国 for a Japanese user. Other targets in the example schema are currency names in the currency attribute of the <price> element, or language names in the <language> element.
    Date related data types like date or dateTime could be modified as well. For example, the date data type 2007-03-16 could be represented to German user as 16. März 2007, and to a Japanese audience as 2007年3月16日.
  • Document type specific information about internationalization and localization. An example of such information is a description of (non)translatability of textual content. For example, in an instance of the schema in fig. 1, only the content of the <comment> element might need translation.

It is not a difficult task to implement these requirements ad hoc in a given schema: a schema user needs just to translate all element names, modify data types as appropriate, and provide information about translatability for various element and attribute types. However, there are some chances and needs for the localization of schemas which are better realized with mechanisms applicable for schema languages in general:

  • Using existing locale data. Recently more and more locale data has been created. This allows for implementing the requirements mentioned above in a general manner and reusing the implementation for a variety of target locales. This paper uses data from CLDR [Common Locale Data Registry] [CLDR], especially for the modification of data types. CLDR data is represented in a format called LDML [Locale Data Markup Language] [LDML]. The CLDR data and the LDML format have not been developed for a specific application environment; hence some issues arise with applying them to data types in schema languages. These will be discussed in sec. “Overview of CLDR and LDML”.
  • Applying localization specific information to various schema languages. Localization information, once specified, should be applicable to different schema languages. This paper provides a framework for applying the same information to XML Schema, RELAX NG, or XML DTDs.
  • Providing a mapping of localized information to the locale unspecific representation. Instance documents of the localized schema need to be transformed to instance documents of a general schema. That means that the <Kaufbestellung> element needs to be recognizable as a <purchaseOrder> element, locale-specific country (or currency and language names) need to be related to locale-independent identifiers, and a localized date needs to be converted to a non-localized lexical representation. This mapping makes a general, locale-independent data access possible1.

The paper is organized as follows: sec. “Background” describes basic terms like locale and locale identification, and input to this paper from existing approaches and data to schema language localization. Sec. “Localization of Schema Languages: Realization” describes a format which integrates the existing approaches and allows for using locale data in schema localization. The paper finishes with a description of the implementation of the approach in sec. “Implementation” and a summary and outlook in sec. “Summary and Outlook”.

Background

Internationalization, Localization and Locale

Internationalization is the process of making a product ready for its global use. Localization is the process of the actual adaptation of the product to a specific locale, that is a country, region or market. See the definition of Localization vs. Internationalization [i18n l10n] for further information on these terms.

Taking the area of schema languages, an example for schema internationalization is to provide markup to express directionality of text in scripts with mixed directionality. With such markup, document instances of the schema can be created by users who work with Arabic or Hebrew texts. The translation of element and attribute names mentioned in sec. “Introduction” can be part of the localization of the schema.

A key definition for this paper is the notion of a locale. LDML describes a locale as follows:

[...] a locale is an identifier (id) that refers to a set of user preferences [which ...] provide support for formatting and parsing of dates, times, numbers, and currencies; for measurement units, for sort-order (collation), plus translated names for timezones, languages, countries, and scripts. They can also include text boundaries (character, word, line, and sentence), text transformations (including transliterations), and support for other services.

Unfortunately there is no general agreement about what a locale is. For example LDML and the POSIX locale model differ in many areas. Also, it is not clear whether language should be the center of the locale. Although this is often adequate, there are cases like time zone information, which are independent of a specific language or a distinguished set of languages.

This paper mainly follows LDML by using language as the center of a locale. Nevertheless, there are examples in the paper which do not rely on language like the description of locale specific currency names.

Locale Identification

Since a clear definition of the term locale is not possible, a problem arises: How is a locale identified? Or to put it differently: How can the choice of German versus English translated names, modified data types or other localization information like in fig. 1 be made explicit?

The solution is to use the locale identifiers defined in LDML. Since LDML puts language in the core of its local model, it is natural that it applies a standard for language identification and matching of so-called language tags: the IETF BCP 47 [Best Common Practice 47] [BCP 47]. Previously, BCP 47 was represented by RFC 3066 [RFC 3066]. Recently RFC 3066 was replaced by RFC 4646 [RFC 4646] and RFC 4647 [RFC 4647]. RFC 4646 describes the structure of a language tag, and RFC 4647 describes requirements for matching of language tags. Parts of the syntax of an RFC 4646 language tag and its application in LDML are introduced in fig. 2.

Figure 2
RFC 4646 SYNTAX (part)
Language-Tag  = langtag
                 / privateuse ; private use tag
                 / grandfathered; grandfathered 
   langtag       = (language
                    ["-" script]
                    ["-" region]
                    *("-" variant)
                    *("-" extension)
                    ["-" privateuse])
 
LDML LOCALE IDENTIFIER
locale_id := base_locale_id options?
base_locale_id := extended_RFC3066bis_identifiers
options := "@" key "=" type ("," key "=" type )*
SAMPLE IDENTIFIER
de_DE@collation=phonebook,currency=DDM

Part of the Syntax of an RFC 4646 language tag, and its application in an LDML locale identifier

RFC 4646 defines a syntax for language tags. These consist of one or several subtags. The subtags provide information about language (language), script (script), region (region) and variants (variant). The values of these subtags are registered in the iana language subtag registry (http://www.iana.org/assignments/language-subtag-registry). In addition, there are extension (extension) subtags or subtags for private use (privateuse), which are not part of the subtag registry.

LDML extends such a language tag with zero or more keys. In fig. 2 there are keys for a collation (German phonebook order) and the currency (DDM which means "East German Ostmark"). Another difference is that LDML uses the delimiter _ instead of - between subtags.

The requirements of schema language localization described in sec. “Introduction” can make use of various parts of a locale identifier:

  • Translation of names for elements and attributes: The translation might be specific to language, script or region. For example, the translation of the element name <city> can target the Japanese language identified via the locale_id "ja", that is a language subtag. An adequate translation would be 都市. If the translation should use a latin transliteration, the script Latn needs to be specified. The appropriate locale identifier should be locale_id "ja_Latn", the translation would be toshi. An example of the combination of language and region identifier in a locale is locale_id "en_US" versus locale_id "en_GB". These help to differentiate e.g. color from colour.
  • Modification of data types: the CLDR data which will be used for data type modification mainly provides distinctions related to language, for example locale specific definitions for date values in Japanese locale_id "ja" versus German locale_id "de". Hence the language subtag will be used as a locale identifier for indicating the appropriate set of locale data.
    CLDR uses other subtags to specify items to be localized within each set of locale data. For example for the region, the items within the local data specified by these subtags are locale display names. In this paper, the locale display names for the region will be applied to localize the <country> element in fig. 1, and the locale display names for language will be used to localize the <language> element2.
  • Document type specific information about internationalization and localization: For this task, the whole set of subtags provided by RFC 4646 will be applicable. The reason is that it depends very much on the granularity of the localization project what locale identifier is appropriate. This is different to name translation and the data type modification tasks mentioned above.

Input from Existing Approaches to XML Localization

The following subsections describe existing approaches towards schema localization. These will be introduced here, modified and later used for a general schema language localization approach.

The TEI Approach towards Markup Language Localization

The TEI ODD format [TEI ODD] created by the TEI [Text Encoding Initiative] is used for a literate programming approach towards markup languages. An ODD [One Document Does It All] document provides both markup declarations and their documentation. ODD is used for the creation of the TEI guidelines (that is, the TEI documentation and schemas in the schema languages XML Schema, RELAX NG and XML DTD) themselves. But it is also applied within the Internationalization Tag Set 1.0 specification [ITS 10], see sec. “Information about XML Localization (and Internationalization): ITS 1.0”.

For localization, ODD provides facilities for renaming and adaptation of documentation, as described in a presentation on the internationalization and localization of the TEI [TEI LOC]. The former are relevant for this paper and are exemplified in fig. 3. The element declaration of <city> is associated with translations into German and Japanese.

Figure 3
<define name="city-elem">
 <elementSpec ident="city">
  <altIdent xml:lang="de">Stadt</altIdent>
  <altIdent xml:lang="ja">都市</altIdent> ....</elementSpec>
</define>

Element renaming with ODD

To be able to use translated elements in content models, the ODD approach keeps the names of RELAX NG patterns, like the city-elem pattern above. The content models refer only to these patterns and not to element declarations directly. This approach works since the ODD format is processed one-way: from an ODD document to generated schemas. Hence, there is no need to provide a mechanism for the localization of global element declarations.

This paper differs in the TEI approach by providing such a mechanism, see sec. “Adaptation of Names”. Only in this way it is possible to localize existing schemas. This paper follows the TEI approach by not changing encapsulation mechanisms like patterns in RELAX NG schemas, names of groups and type definitions in XML Schema, or entities in XML DTDs.

Overview of CLDR and LDML

LDML is an XML format to represent locale information. It provides the structure for locale data in CLDR. See http://unicode.org/cldr/repository_access.html for the latest deliverable of CLDR (which includes a specification describing LDML).

A part of LDML is the definition of locale identifiers introduced in sec. “Locale Identification”. LDML defines an inheritance and overriding model for locale identifiers. For example the locale locale_id "en" defines the display name for the currency USD as US Dollar. The locale locale_id "en_US" (English in the territory of the United States) overrides this definition and uses the display name $. In this paper only the locales directly following the neutral root locale are used.

An example of CLDR data represented in LDML is given in fig. 4.

Figure 4
<ldml>
 <identity> [...] <language type="en"/>
 </identity>
 <localeDisplayNames>
  <languages>
   <language type="de">German</language> [...] </languages>
  <scripts>
   <script type="Latn">Latin</script> [...] </scripts>
  <territories>
   <territory type="DE">Germany</territory> [...] </territories>
  <variants>
   <variant type="1901">Traditional German orthography</variant>
   <variant type="1996">German orthography of 1996</variant> 
 [...] </variants>
 </localeDisplayNames>
 <numbers>
  <currencyFormats>
   <currencyFormatLength>
    <currencyFormat>
     <pattern>¤#,##0.00</pattern>
    </currencyFormat>
   </currencyFormatLength>
  </currencyFormats>
  <currencies>
   <currency type="USD">
    <displayName>US Dollar</displayName>
   </currency>
  </currencies>
 </numbers>
 <dates> [...] <calendars>
   <calendar type="gregorian">
    <months>
     <monthContext type="format">
      <monthWidth type="wide">
       <month type="1">January</month>
      </monthWidth> [...] </monthContext>
    </months>
    <eras>
     <eraNames>
      <era type="0">Before Christ</era>
     </eraNames> [...] </eras>
    <dateFormats>
     <dateFormatLength type="full">
      <dateFormat>
       <pattern>EEEE, MMMM d, yyyy</pattern>
      </dateFormat>
     </dateFormatLength> [...] </dateFormats>
   </calendar>
  </calendars>
 </dates>
</ldml>

Examples of CLDR data given in LDML

The example is an excerpt from CLDR data for the locale locale_id "en". The locale is identified via the <identiy> element as being specific to English, using the nested element <language type="en"/>. For the locale locale_id "en_US", there would be another nested element <territory type="US"/>.

The <localeDisplayNames> contains display names for languages, territories, variants etc. The <numbers> element contains information about currency formatting and currency display names. Date and time related information is represented in the <dates> element. Each set of date and time related information is specific to a calendar, for example <calendar type="gregorian">. For various parts of date and time tokens like years, months, days etc., there are lists of lexical items and patterns which make use of these. See the complete list of fields in patterns at http://www.unicode.org/reports/tr35/tr35-7.html#Date_Format_Patterns. An example lexical item is <month type="1">January</month> used within <monthWidth type="wide">. An example <pattern> is the <dateFormat> of the type <dateFormatLength type="full">. The <pattern> is EEEE, MMMM d, yyyy, which reads as:

  • a week day field entry EEEE like Monday
  • a month field entry of the type wide (MMMM) like February
  • a day field entry of the type Day of the month (d) like 16
  • a year field entry like 2007.

The commas , is used as a separator. An example date would be Wednesday, April 18, 2007.

Using CLDR data for localization of XML Schema data types leads to a problem: In some areas, the CLDR data is semantically richer than related XML Schema data types. For example, there is no weekday information directly represented in the lexical space of the XML Schema data type date3. Nevertheless, for the purpose of this paper, this problem is not relevant: A user can produce semantically rich information using the localized data type. Only for the purpose of mapping localized information to the locale unspecific representation described in sec. “Introduction”, information like the weekday will not be taken into account.

Another type of problem arises with CLDR data which does not map directly to a related XML Schema data type. An example is the Japanese calendar data which is part of the locale locale_id "ja" and identified as <calendar type="japanese">. This calendar separates years into 235 areas. Each area starts for a new Emperor. The Gregorian year 2007 maps to the year 19 of the area HEISEI. However, the mapping relation (i.e. the first year HEISEI maps to 1988) is not available in CLDR.

A solution to this problem might be to apply additional data, which is currently not part of CLDR. For example the mapping data for the Japanese calendar is available at http://source.icu-project.org/repos/icu/icu/trunk/source/i18n/japancal.cpp. The approach described in this paper currently relies only on CLDR.

To summarize, for the purposes of data type localization and mapping of localized data types to locale unspecific representations, CLDR provides a great variety of data. However, not all of this data can be used "as is", since it might be semantically too rich or the mapping would need more information than available. Hence, the data type modification has to be specific to a locale and an adequate part of the locale data.

Information about XML Localization (and Internationalization): ITS 1.0

ITS 1.0 is a standard4 for the expression of information related to Internationalization and Localization. It defines 7 data categories which convey different kinds of information:

  • Translate: What element or attribute content needs to (not) to be translated?
  • Localization Note: notes to localizers, e.g. "how to translate a specific part of content", or "expand on the meaning or contextual usage of a specific element,"
  • Terminology: identification of terms and association with additional information like definitions
  • Directionality: specification of the base writing direction of blocks, embeddings and overrides
  • Ruby: short annotation of a base text with e.g. reading information
  • Language Information: Expressing the language of a piece of content
  • Elements Within Text: element nesting information necessary for e.g. basic text segmentation used by translation memory systems.

These data categories can be implemented locally or globally. An example of both approaches for the Translate data category is given in fig. 5.

Figure 5
DOCUMENT 1 (local ITS markup):
<help xmlns:its="http://www.w3.org/2005/11/its" its:version="1.0">
 <head>
  <title>Building the Zebulon Toolkit</title>
 </head>
 <body>
  <p>To re-compile all the modules of the Zebulon toolkit
you need to go in the
<path its:translate="no">\Zebulon\Current Source\binary</path> directory.
Then from there, run batch file
<cmd its:translate="no">Build.bat</cmd>.</p>
 </body>
</help>
DOCUMENT 2 (global ITS rules):
<help>
 <head>
  <title>Building the Zebulon Toolkit</title>
  <its:rules version="1.0"  
   xmlns:its="http://www.w3.org/2005/11/its" its:version="1.0">
   <its:translateRule selector="//path | //cmd" translate="no"/>
  </its:rules>
 </head>
 <body>
  <p>To re-compile all the modules of the Zebulon toolkit
you need to go in the
<path>\Zebulon\Current Source\binary</path> directory.
Then from there, run batch file <cmd>Build.bat</cmd>.</p>
 </body>
</help>

Example of the Translate data category implemented locally and globally

In the example documents, the content of the <path> and the <cmd> elements should not be translated. In the first document, this information is conveyed locally via the ITS attribute its:translate="no" on the respective elements. In the second document, an ITS global rule <its:translateRule> is used for the same purpose. Global rules make use of XPath to select piece of markup, i.e. via the selector="//path | //cmd" attribute. In this way, the global implementation of data categories is independent of a position in a target document.

This paper will make use of ITS information within element or attribute declarations. That is, the information will be relevant for a document type and all its instances. This approach of ITS information on the schema level makes sense for the ITS 1.0 data categories Translate, Localization Note, Terminology and Elements Within Text. The remaining data categories Directionality, Ruby and Language Information will not be applied on the schema level, since in their case it is unlikely that every instance of an element or attribute has the same ITS 1.0 data category related information5.

Summary on Input to Schema Localization and Potential Implementation Mechanisms

The following table summarizes the input of existing approaches to (schema) localization and their modification made within this paper.

Table 1
Requirement Input Approach Modification
Translation of names for elements and attributes TEI ODD (see sec. “The TEI Approach towards Markup Language Localization”) Providing a mechanism for translation of global element declarations
Modification of data types CLDR (see sec. “Overview of CLDR and LDML”) Omitting CLDR categories which are semantically too rich or which miss information for a mapping to local unspecific data types; for this purpose, proving access specific to a locale and an adequate part of the locale data
Document type specific information about internationalization and localization ITS 1.0 (see sec. “Information about XML Localization (and Internationalization): ITS 1.0”) Omitting ITS 1.0 data categories which are not useful on a schema level

In the following section, a general approach towards schema language localization will be introduced. Before, some potential implementation mechanisms will be discussed here.

DSRL [Document Schema Renaming Language] [ISO/IEC 19757-8] provides a means to rename names of elements, attributes, processing instructions etc. from a source into a target vocabulary. DSRL basically provides the functionality of Architectural Forms [ISO/IEC 10744]. An example of DSRL is given in Fig. 6.

Figure 6
<dsrl:element-name-map
 target="po:purchaseOrder">poloc:Kaufbestellung</dsrl:element-name-map>
<dsrl:attribute-name-map 
 target="po:purchaseOrder[@orderDate]">Lieferdatum</dsrl:attribute-name-map>

Renaming of Elements and Attributes with DSRL

The fragment of a DSRL document shows the renaming of the <purchaseOrder> element to <Kaufbestellung>, and the orderDate attribute to Lieferdatum.

An approach which can be used for the implementation of data type definitions and data type modification is DTLL [ Data Type Library Language] [ISO/IEC 19757-5], see fig. 7.

Figure 7
<datatype name="monthNamesGerman">
  <super type="month" />
  <parse name="month">
    <enumeration code="@name"
      values="document('months.xml')/months/month"/>
  </parse>
  <property name="Januar" select="$month/@january" />
  <property name="Februar" select="$month/@february" />
  <property name="März" select="$month/@march" />
</datatype>

The DTLL document specifies the relation of locale specific month names like März to their general counterpart march. DTLL allow for specifying much more complex relations than simply value mapping and could also be used for implementing other data type modifications described above.

In summary, both DSRL and DTLL could be used to implement some requirements for schema localization, but they are not used in this paper. The reason is that the goal of this paper is to show one framework for specifying localization and internationalization information in a schema. The implementation described in sec. “Implementation” goes a direct way to XSLT, which seems to be simpler than taking a step through DSRL or DTLL.

From a different point of view, the approach of this paper may look like a danger of getting in the way of the customization mechanisms already built into a schema language or related technologies. Both perspectives are understandable: the need to reduce the numer of involved technologies as much as possible, and the need to have one framework for a specific purpose. This paper puts an emphasis on the later perspective, taking the position of localization workers into account who might benefit from a single framework for their needs.

Localization of Schema Languages: Realization

Outline

The outline of the approach is demonstrated in fig. 8.

Figure 8
<loc:localInfo locale="de_DE">
[...]
</loc:localInfo>
regular expression for locale value testing:
(
((([a-z]|[A-Z]){2,3})|(([a-z]|[A-Z]){5,8}))
(_(([a-z]|[A-Z]){4}))?
(_(([a-z]|[A-Z]){2}|\d{3}))?
(_(([a-z]|[A-Z]|\d){5,8})|(\d{1}([a-z]|[A-Z]|\d){3}))?
)
|
(
(([a-z]|[A-Z]){1,3})((_([a-z]|[A-Z]){2,8}){1,2})?
)"/>

Container for locale related information and regular expression for locale identifier check

The container for locale related information is an <localInfo> element with a mandatory locale attribute. The value of that attribute specifies the target locale. The figure contains the regular expression which is used for testing the locale value. It is based on the ABNF in sec. 2.1 of RFC 4646. The difference is that the delimiter between sub tags - is replaced with _, to follow the LDML convention mentioned in sec. “Locale Identification”.

Locale information can be applied as schema annotation or in a separate document. The former usage is applicable for XML Schema or RELAX NG and will be exemplified in most of the following sections. The latter usage is used to apply such a description to XML DTDs6. It is exemplified in fig. 9.

Figure 9
<loc:localInformation xmlns:loc="http://example.com/schemalocalization"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xmlns:xs="http://www.w3.org/2001/XMLSchema" 
 xsi:schemaLocation="http://example.com/schemalocalization localInfo.xsd">
 <loc:localInfo locale="de"
  targetDeclaration="xs:element[@name='purchaseOrder']"
  generalization="Kaufbestellung"
  xmlns:po="http://example.com/purchaseOrder">
  <loc:altIdent>Kaufbestellung</loc:altIdent>
 </loc:localInfo>
 <loc:localInfo locale="de"
  targetDeclaration="xs:element[@name='purchaseOrder']"
  generalization="Kaufbestellung">
  <loc:altDocumentation>Schema zu Kaufbestellungen</loc:altDocumentation>
 </loc:localInfo>
 <loc:localInfo locale="de" 
 targetDeclaration="xs:attribute[@name='orderDate']"
 generalization="Kaufbestellung/@Lieferdatum">
  <loc:dateInfo calendarType="Gregorian" dateFormatLengthType="long"/>
 </loc:localInfo>
 <loc:localInfo locale="de" 
 targetDeclaration="xs:element[@name='language']" generalization="Sprache">
  <loc:localeDisplayNames type="languages"/>
 </loc:localInfo>
 <loc:localInfo targetDeclaration="xs:element[@name='purchaseOrder']"
  translate="no" locale="de">
  <loc:locNote>Localization already available in German. Make sure that the
   exising localization can be reused as much as possible. </loc:locNote>
 </loc:localInfo>
</loc:localInformation>

"Standoff" usage of locale information

The document contains all locale information which will be discussed in the following sections. To be able to apply this information to global or local declarations in an XML Schema document, the XPath expressions in the targetDeclaration attribute can be exploited. An additional, optional generalization attribute provides information for the transformation to an instance of the general schema.

During the development of this approach, XML Schema: Component Designators [SCD] have been considered as an alternative means to XPath for selecting declarations. This idea was dropped since XPath is applicable to various schema languages, whereas component designators are specific to XML Schema.

Adaptation of Names

An example of this functionality is given in fig. 10:

Figure 10
<xs:element name="purchaseOrder">
    <xs:annotation>
        <xs:appinfo>
            <loc:localInfo locale="de_DE">
                <loc:altIdent>Kaufbestellung</loc:altIdent>
            </loc:localInfo>
        </xs:appinfo>
    </xs:annotation> ... </xs:element>

Renaming of elements

The <localInfo> element contains an <altIdent> element which fulfils a similar functionality as the <altident> element in the TEI localization approach described in sec. “The TEI Approach towards Markup Language Localization”. The difference is that the <altIdent> element here can be applied to locally declared elements and global elements in an XML Schema.

The mapping to a locale unspecific representation (i.e. from <Kaufbestellung> to <purchaseOrder> is realized by exploiting the generalization attributes in fig. 9. The XPath expressions can be used to transform an instance of the localized schema to an instance of the general schema.

Translation of Documentation

This functionality is demonstrated in fig. 11.

Figure 11
<xs:element name="purchaseOrder">
    <xs:annotation>
        <loc:localInfo locale="de_DE">
            <loc:altDocumentation>Schema zu
Kaufbestellungen</loc:altDocumentation>
        </loc:localInfo> ... </xs:annotation> ... </xs:element>

Translation of documentation

This approach is again very similar to the TEI localization approach described in sec. “The TEI Approach towards Markup Language Localization”. The difference is again that it can be applied both for global and local markup declarations.

Adaptation of Simple Data Types

Adaptation of the date Data Type

In this section the adaptation of the date data type will be exemplified. Only a subset of the fields described in LDML are used:

  • y ("year"). y can appear 1 or more times. The number of y defines the total number of digits (including possibly leading zeros).
  • M ("month"). One or two M means that at least one digit or always two (including possibly leading zeros) are used. Three or four M mean that given lexical items are used to represent the month's names, taken from the <months> element in fig. 4.
  • d ("day of a month"). One or two d mean that at least one digit or always two (including possibly leading zeros) are used.
  • G ("era") like Anno Domini.
  • E ("week day"). One through three letters are used for the short day, or four for the full name, or five for the narrow name. Again the lexical items are taken from CLDR data.

With this information the Gregorian calendar can be represented. CLDR provides many other calendars as well. Nevertheless, for many locales there is at least a Gregorian calendar, which allows for using these fields, and eases the usage of CLDR information in XML Schema. Fig. 12 demonstrates how the localization of the orderDate attribute is achieved.

Figure 12
<xs:attribute name="orderDate" type="xs:date">
    <xs:annotation>
        <xs:appinfo>
            <loc:localInfo locale="de">
                <loc:dateInfo calendarType="Gregorian" 
 dateFormatLengthType="long"/>
            </loc:localInfo>
        </xs:appinfo>
    </xs:annotation>
</xs:attribute>

Adaptation of the date data type

The <dateInfo> element contains two attributes: calendarType describes the calendar to be used (currently only Gregorian). dateFormatLengthType describes the date format, see the <dateFormatLength> element in fig. 4.

A date format pattern like <pattern>EEEE, MMMM d, yyyy</pattern> can be used for two purposes. First, the type of the not localized declaration can be changed to a localized version. E.g., instead of 2007-03-16, one would have Wednesday, March 16, 2007. In that case, a regular expression is being created which covers constraints imposed by CLDR. The second usage is the creation of a canonical date representation (i.e. an instance of the data type xs:date) from a localized value: to create 2007-03-16 from Wednesday, March 16, 2007. For this purpose some fields in the localized value have to be omitted, like the weekdays. Nevertheless the comparison of values from different locales becomes possible, and the application of date related functions in e.g. XPath 2.0.

Adaptation of Locale Display Names

The adaptation of locale display names basically means the creation of enumeration lists. The necessary information will be exemplified for display names of languages in fig. 13.

Figure 13
<xs:element name="language">
    <xs:annotation>
        <xs:appinfo>
            <loc:localInfo locale="de">
                <loc:localeDisplayNames type="languages"/>
            </loc:localInfo>
        </xs:appinfo>
    </xs:annotation> ... </xs:element>

The adaptation of locale display names basically means the creation of enumeration lists. This will be exemplified for display names of languages in fig. 9

The <localeDisplayNames> element contains a type attribute. It defines what kind of data from CLDR has to be used. In addition to language display names, there are currency, territories and variants.

Using this information, enumeration lists containing the localized display names can be generated. The generation of locale unspecific display names (e.g. from US Dollar to USD) uses the same information from the <localeDisplayNames> element and applies it in the other direction.

Document Type specific Information about Internationalization and Localization

Fig. 14 demonstrates how "Translate" and "Localization Note" related information can be provided for the <purchaseOrder> element.

Figure 14
<xs:element name="purchaseOrder">
    <xs:annotation>
    <loc:localInfo targetDeclaration="xs:element[@name='purchaseOrder']" translate="no"
    locale="de">
        <loc:locNote>Localization already available in German.
Make sure that the existing localization can be reused as much as possible.</loc:locNote>
    </loc:localInfo>
    </xs:annotation> [...] </xs:element>

Information about internationalization and localization

The <localInfo> element contains the translate attribute and the <locNote> element. Their function is identical to the markup described in ITS 1.0. The difference is that in the case described in fig. 14 they apply to all instances of the markup which they are attached to. In the case of the "standoff" usage described in fig. 9, they apply to all element or attribute declarations selected by the targetDeclarations attribute.

It depends on the application (e.g. a translation or localization tool), how this information should be processed. Example applications are given in sec. 2 of ITS 1.0.

Example of a Localized Schema

The example schema in fig. 15 implements the requirements formulated in sec. “Introduction”, using the existing approaches described in sec. “Background”. It can be generated relying on the markup described in sec. “Localization of Schema Languages: Realization”.

Figure 15
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
 targetNamespace="http://example.com/purchaseOrderLocalized"
 xmlns:loc="http://example.com/schemalocalization"
 xmlns:poloc="http://example.com/purchaseOrderLocalized">
 <xs:annotation>
  <xs:appinfo>
   <loc:baseSchema uri="purchaseOrderExample.xsd"/>
  </xs:appinfo>
 </xs:annotation>
 <xs:element name="Name" type="xs:string"/>
 <xs:element name="Straße" type="xs:string"/>
 <xs:element name="Stadt" type="xs:string"/>
 <xs:element name="Bundesland" type="xs:string"/>
 <xs:element name="PLZ" type="xs:string"/>
 <xs:element name="Land">
  <xs:simpleType>
   <xs:restriction base="xs:string">
    <xs:enumeration value="Vereinigte Staaten"/>
    <xs:enumeration value="Deutschland"/>
    <xs:enumeration value="Japan"/>
    <!-- ... -->
   </xs:restriction>
  </xs:simpleType>
 </xs:element>
 <xs:element name="Sprache">
  <xs:simpleType>
   <xs:restriction base="xs:string">
    <xs:enumeration value="Englisch"/>
    <xs:enumeration value="Deutsch"/>
    <xs:enumeration value="Japanisch"/>
    <!-- ... -->
   </xs:restriction>
  </xs:simpleType>
 </xs:element>
 <xs:element name="Preis">
  <xs:complexType>
   <xs:simpleContent>
    <xs:extension base="xs:decimal">
     <xs:attribute name="Waehrung">
      <xs:simpleType>
       <xs:restriction base="xs:string">
        <xs:enumeration value="US Dollar"/>
        <xs:enumeration value="Yen"/>
        <xs:enumeration value="Europäische Währungseinheit (XBB)"/>
        <!-- ... -->
       </xs:restriction>
      </xs:simpleType>
     </xs:attribute>
    </xs:extension>
   </xs:simpleContent>
  </xs:complexType>
 </xs:element>
 <xs:element name="Kommentar">
  <xs:complexType mixed="true">
   <xs:sequence>
    <xs:any processContents="lax" minOccurs="0" maxOccurs="unbounded"/>
   </xs:sequence>
  </xs:complexType>
 </xs:element>
 <xs:element name="Lieferaddresse">
  <xs:complexType mixed="false">
   <xs:sequence>
    <xs:element ref="poloc:Name"/>
    <xs:element ref="poloc:Straße"/>
    <xs:element ref="poloc:Stadt"/>
    <xs:element ref="poloc:Bundesland" minOccurs="0"/>
    <xs:element ref="poloc:PLZ"/>
    <xs:element ref="poloc:Land"/>
    <xs:element ref="poloc:Sprache"/>
    <xs:element ref="poloc:Preis"/>
    <xs:element ref="poloc:Kommentar" minOccurs="0"/>
   </xs:sequence>
  </xs:complexType>
 </xs:element>
 <xs:element name="Kaufbestellung">
  <xs:complexType mixed="false">
   <xs:sequence>
    <xs:element ref="poloc:Lieferaddresse"/>
   </xs:sequence>
   <xs:attribute name="Lieferdatum" use="required">
    <xs:simpleType>
     <xs:restriction base="xs:string">
      <xs:pattern
       value="\d{2}\.\s+(Januar|Februar|März|April|Mai|Juni|Juli|August
 |September|Oktober|November|Dezember)\s+\d{4}"
      />
     </xs:restriction>
    </xs:simpleType>
   </xs:attribute>
  </xs:complexType>
 </xs:element>
</xs:schema>

Localized Schema 1

As an example target locale, a German user will be assumed. For such a user, all element names are translated into German, and the types of elements are modified (see for example the Land element). The elements contain no information about the localization of document instances, e.g. the translate attribute. This information will be used not for the localized schema, but instance documents of the locale unspecific schema.

The pattern of the data type for the <Lieferdatum> element is based on the pattern d. MMMM yyyy. However, this pattern is relaxed regarding whitespace, to allow the user for a convenient input of data.

Implementation

The ongoing implementation is realized with XSLT 2.0 [XSLT20]. One stylesheet is used for the creation of localized schemas. It takes as an input:

  • a schema without localization and optional annotations of markup declaration with regards to renaming, data type modification or further localization / internationalization related markup.
  • optional a separate document like in fig. 9 with further localization related information. In case of contradictory information (e.g. multiple translations for the same element in schema annotations and standoff), the schema annotations have higher priority.
  • a parameter for the target locale
  • a parameter for the schema language: DTD, XML Schema or RELAX NG.
  • data from CLDR.

If the schema language is DTD, the stylesheet generates a Schematron document which can be used for validation with modified data types. For element and attribute renaming in XML DTDs, a separate script will be used. If the schema language is RELAX NG or XML Schema, the data type modifications and other changes will be made in the schema itself.

For the transformation of instances of a localized schema to instances of the general schema, another stylesheet is used. It takes the same input information as the stylesheet described above and the document instance. A third stylesheet is used to create "native" global ITS 1.0 markup from information like in fig. 14. The output can be processed by ITS 1.0 processors which are independent of the approach described in this paper.

Summary and Outlook

This paper introduced a framework for the localization of schema languages (XML Schema, RELAX NG and XML DTD) which can be applied to modify markup names and data types, and add information about localization and internationalization to a schema. It made use of existing approaches and data and integrated them in a general manner.

Desires for the future of this approach concern mainly the modification of data types. More data types like dateTime and time need to be processed. More information like other calendars than Gregorian calendar needs to be taken into account. And it should be investigated if a modification only of lexical values of data types is possible, while keeping the not localized value internally in a schema processor. This requires a finite set of symbols in the lexical space of localized data types. The data provided by CLDR cannot be applied as is for this purpose, but the possibility seems to be promising.

Notes

1.

An example is the access to the localized dates like 16. März 2007 (for a German user) or 2007年3月16日 (for a Japanese user). To make them comparable, both need to be converted to the locale-independent lexical representation 2007-03-16.

2.

An example: the locale display name for the region DE in English (that is, target locale locale_id "en") would be Germany. In Japanese (that is, target locale locale_id "ja"), it would be ドイツ.

3.

To be more precise: the date data type does not represent this information directly in its lexical space. Nevertheless, it is possilbe to derive weekday information from a date value.

4.

The ITS 1.0 specification is written in the ODD format. However, the ODD source document at http://www.w3.org/TR/2007/REC-its-20070403/itstagset.xml does not make use of the ODD localization facilities mentioned in sec. “The TEI Approach towards Markup Language Localization”.

5.

For example, it is unlikely that text Directionality is identical for all instances of an element, or that all textual content needs the same Ruby annotation.

6.

DTDs offer means to describe locale information within a DTD itself, e.g. via fixed values. That is, standoff information is not the only way to express such information. Nevertheless, this paper proposes such standoff annotation for DTDs, to ease the task of adding information more complex than attribute values, e.g. for localization notes.


Bibliography

[BCP 47] A. Phillips, M. Davis, eds. Tags for Identifying Languages. IETF, September 2006. Available at http://www.rfc-editor.org/rfc/bcp/bcp47.txt.

[CLDR] Common Locale Data Registry. Available at http://unicode.org/cldr/.

[i18n l10n] R. Ishida, S. Miller. Localization vs. Internationalization.. Article of the W3C Internationalization Activity, January 2006. Available at http://www.w3.org/International/questions/qa-i18n.

[ISO/IEC 10744] Information Technology - Hypermedia/Time-based Structuring Language (HyTime). International Organization for Standardization, 1997.

[ISO/IEC 19757-5] Information Technology - Document Schema Definition Languages (DSDL) - Part 5: Data Type Library Language - DTLL, ISO/IEC 19757-5. International Organization for Standardization, 2006 (under development).

[ISO/IEC 19757-8] Information Technology - Document Schema Definition Languages (DSDL) - Part 8: Document Schema Renaming Language - DSRL, ISO/IEC 19757-8. International Organization for Standardization, 2006 (under development).

[ITS 10] C. Lieske, F. Sasaki, eds. Internationalization Tag Set (ITS) 1.0. W3C Recommendation April 2007. Available at http://www.w3.org/TR/2007/REC-its-20070403/.

[LDML] Locale Data Markup Language. Unicode Technical Standard #35, November 2006. Available at http://unicode.org/reports/tr35/tr35-7.html.

[RFC 3066] H. Alvestrand, ed. Tags for the Identification of Languages. IETF, January 2001. Available at http://www.rfc-editor.org/rfc/rfc3066.txt.

[RFC 4646] A. Phillips, M. Davis, eds. Tags for the Identification of Languages. IETF, September 2006. Available at http://www.rfc-editor.org/rfc/rfc4646.txt.

[RFC 4647] A. Phillips, M. Davis, eds. Matching of Language Tags. IETF, September 2006. Available at http://www.rfc-editor.org/rfc/rfc4647.txt.

[SCD] Holstege, M. A. S. Vedamuthu, eds. Schema Component Designators. W3C Working Draft 29 March 2005. Available at http://www.w3.org/TR/2005/WD-xmlschema-ref-20050329/.

[TEI LOC] S. Rahtz. Towards an internationalized and localized TEI. Presentation, Kyoto, May 2006. Available at http://tei.oucs.ox.ac.uk/Oxford/2006-05-17-kyoto/i18n.xml.

[TEI ODD] C.M. Sperberg-McQueen, L. Burnard, eds. TEI P5: Guidelines for Electronic Text Encoding and Interchange. TEI Consortium, March 2007 (release 06). Chapter 23 "Documentation Elements". The latest version of TEI P5 is available at http://www.tei-c.org/release/doc/tei-p5-doc/html/.

[XML Schema 0] D. C. Fallside, P. Walmsley, eds. XML Schema Part 0: Primer Second Edition. W3C Recommendation, October 2004. Available at http://www.w3.org/TR/2004/REC-xmlschema-0-20041028/.

[XSLT20] M. Kay, ed. XSL Transformations (XSLT) Version 2.0. W3C Recommendation, January 2007. Available at http://www.w3.org/TR/2007/REC-xslt20-20070123/.



Localization of Schema Languages

Felix Sasaki [World Wide Web Consortium]
fsasaki@w3.org