Datatypes for XML: the Datatyping Library Language (DTLL)

Jeni Tennison
jeni@jenitennison.com

Abstract

This paper looks at why we need to provide datatypes for data in XML documents and describes the Datatype Library Language (DTLL), which is part of DSDL. We'll look at the features of XML data that mean datatypes from programming languages and databases are inappropriate. We'll see that XML Schema falls short of addressing the datatyping requirements of real-life markup languages and how DTLL meets those needs. DTLL has a simple syntax that borrows heavily from RELAX NG, and is fairly straightforward to implement, though the paper discusses a couple of sticky areas. Finally, DTLL can be extended to support validation of more complex datatypes, and to support use in query languages in the future.

Keywords: Datatyping; XSD/W3C Schema; RelaxNG

Jeni Tennison

Jeni Tennison is an independent consultant specialising in XSLT and XML schema development. She trained as a knowledge engineer, gaining a PhD in collaborative ontology development, and since becoming a consultant has worked in a wide variety of areas, including publishing, water monitoring and financial services. She is author of several books including "Beginning XSLT 2.0" (Apress, 2005) and was one of the founders of the EXSLT initiative to standardise extensions to XSLT and XPath. She is an invited expert on the W3C's XSL and XML Processing Working Groups.

Datatypes for XML: the Datatyping Library Language (DTLL)

Jeni Tennison [Jeni Tennison Consulting Ltd]

Extreme Markup Languages 2006® (Montréal, Québec)

Copyright © 2006 Jeni Tennison. Reproduced with permission.

Introduction

This paper discusses the DTLL [Datatype Library Language], which is undergoing standardisation as Part 5 of DSDL [Document Schema Definition Languages]. DTLL provides a mechanism of defining a library of datatypes for an XML vocabulary, using XML. With the right DTLL implementation, these datatypes can then be used within [RELAX NG] or in other situations where datatypes are useful.

Many of you will be wondering whether we need to bother finding a new way of defining datatypes for XML. Why not just use [W3C XML Schema], or, if that's not good enough, another datatype definition language such as [EXPRESS], or even lexical types as defined in Annex A (SGML Extended Facilities) of [HyTime]? In this paper, I'll explain why XML needs a different kind of datatyping solution and present DTLL as that alternative.

The concept of data typing will be familiar to anyone who's worked with computers. Initially, databases and programming languages needed data types so that they could set aside the right number of bits to store a given value. The size of numbers and strings is important in these circumstances because it determines how many bits are required, hence the emphasis on numerical types such as byte (8 bits), short (16 bits), int (32 bits) and so on.

Of course as programming languages and databases have become more sophisticated, the anticipated size of data has become less important, and other reasons for data typing have acquired prominence. In particular, declaring the data types of variables and arguments in programming languages allows them to be type checked, arguably preventing bugs in the code, and aids optimisation.

But how do datatypes apply to XML? To some extent, it depends on how you use XML. To some people, XML is a transport mechanism between applications: a way of exposing data in a database, or passing information between two computer programs. In these data-oriented applications, the relevant data types are those we find in databases and programming languages, and what's important is how we move from in-memory representations of that data to a serialised form and back again. When XML is written by machine, there's also more scope for using elements and attributes to expose the structure within values, such as using separate elements for each item in a list, so the only datatypes that are really needed are pretty simple.

For another group of XML users, the XML documents themselves are primary. Where XML documents are created by people rather than machines, we tend to find data types that make XML easier to read and write, such as abbreviations or code lists and values that have an implicit internal structure, such as comma-separated lists or CSS declarations. In these applications, the data types we need are more complex, and rarely those that are needed within a computer program.

XML Schema attempted to satisfy both sets of users in its treatment of datatypes. Thus, the XML Schema datatype hierarchy contains datatypes familiar from programming languages, such as xs:byte and xs:double, as well as those from XML DTDs, such as xs:NMTOKEN.

It will come as no surprise that DTLL, as part of DSDL, is most interested in satisfying the requirements of the latter group of users. DSDL, particularly [RELAX NG] and [Schematron], also has a history of aiming to support what XML users actually need to do while XML Schema tries to encourage best practice by only supporting certain kinds of constraints. For example, XML Schema encourages the use of standard formats for date/times: the only date/time format it supports is the ISO 8601 format YYYY-MM-DDThh:mm:ss. But in practice, many markup languages use other formats for date/times, such as seconds-after-midnight-on-1st-Jan-1970, the HTTP date/time format FFF, DD mmm YYYY hh:mm:ss ZZZ, or locale-specific date/times such as DD/MM/YY to make it easier for human authors to author and read the XML.

On that basis, before we go any further, let's look at how data types are used in an XML context and what kinds of datatypes are actually used in existing markup languages.

How are Datatypes Used in XML Applications?

Just as we shouldn't define schemas just for the sake of it, we shouldn't define data types just because we can. Schemas of all kinds are around for a purpose, often more than one, and the ways in which we want to use a schema should inform the design of any schema language.

In this section, we'll look at the ways in which schemas, and most particular data type definitions, can be used within an XML application. This should help us identify what datatype libraries and individual datatype definitions need to do.

Validation

Validation is probably the first thing you think of when you hear the word "schema". Checking the validity of a document against a schema enables applications to make assumptions about the structure of the document that it couldn't otherwise make, leading to less error-checking in code.

There are two aspects to validity when it comes to checking the type of a piece of data: "is the value allowed?" and "does the supplied value equal another particular value?". For example, if an attribute has a fixed value of 1.0 and is typed as requiring a decimal number then v1.0 isn't permitted because it is not a decimal number, and 2.0 isn't permitted because it's not equal to 1.0, but 1 is allowed because it represents the same decimal number as the fixed value 1.0.

Equality testing provides one of the biggest challenges for datatype definition languages. When testing equality, whitespace can be preserved or normalised away; case can be significant or insignificant; comparisons can alphabetic, numeric or require more sophisticated knowledge of the fundamental notions behind the data type (Is the colour black equal to #000000? Is the duration PT24H equal to P1D?)

Equality testing is also particularly important when a datatype library is used with [RELAX NG]. Although RELAX NG doesn't include support for fixed values within its core, it does allow users to enumerate possible values for attributes or elements. For example

<attribute name="color">
  <!-- Be patriotic! -->
  <choice>
    <value type="color">red</value>
    <value type="color">white</value>
    <value type="color">blue</value>
  </choice>
</attribute>
To support this, a datatype library used by RELAX NG must be able to test whether the value supplied for a particular attribute or element equals one of those listed in the RELAX NG schema. In this case, we would want color="#FFFFFF" to be permitted, because the supplied value #FFFFFF is equal to the enumerated value white.

Documentation

One of the most useful aspects of datatypes, and schemas in general, is that they provide documentation of an XML vocabulary for the authors and consumers of documents. It's therefore important that a datatype definition language be human-readable, so that users can understand the permissable values for particular elements or attributes and, if they have a value in mind, how to format that value so that it is acceptable.

Application Support

Annotating the data within an XML document with datatype information can aid applications that process that document. Applications can also examine schemas in order to provide appropriate optimised access to XML documents that are valid against that schema.

Datatype definitions can be used by a data binding framework such as [JAXB] to translate XML data into a corresponding data type within a programming language or database. For example, a data-binding application will know that the value of an attribute labelled as a xs:byte can be mapped to a Java byte. However, these mappings can be carried out by a data-binding technology purely based on the name of the type, so data binding does not require any particular support from a datatype definition language aside that it give names to datatypes.

[XPath 2.0], and the specifications that are based on it, utilise data types in two ways. Firstly, XPath variables are typed, and XPath expressions can be analyzed to check their type safety and for optimisation, arguably making XPath easier to debug and faster to run. Secondly, documents that are queried with XPath have typed values, which means that XPath expressions can often be simpler and more robust than they would otherwise be.

Supporting datatypes that could be used in XPath1 and similar languages requires more from a datatype definition language than is needed when all we care about is validation. Implementations must be told how to convert from values of one type to values of another, and how to do magnitude comparisons between values (to support less-than and greater-than operators, and sorting).

A final kind of application that may gain benefit from a datatype definition language is the XML editor. Knowing the datatype of an attribute or element could enable XML editors to prompt users for acceptable values. For example, users might be presented with listboxes for enumerated values, calendars to enter dates, sliders for numbers and so on. Here, what's needed is a method of associating a GUI control with the creation of a value, and the serialization of that value into the XML document.

What Does XML Data Look Like?

Let's now consider the kinds of datatypes that we encounter in common XML vocabularies.

XML data appears in attribute values, the content of text-only elements, and, more rarely, in mixed-content elements. In XML terms, data is a sequence of Unicode characters (strings), which includes control characters in XML 1.1. XML has some default whitespace processing: line endings are normalised to #xA and whitespace in attributes are replaced with spaces, but whitespace characters can always be escaped using entities, so even those that are normally normalised away can appear in data values.

Here, we'll look at well-known markup languages. These are langauges that are used extensively, and are hard to change: it's therefore useful for schema languages to support what they do. While they're not very representative of the kinds of markup languages that users generally write. if a language can support the datatypes that these languages contain, it will likely be able to support anything. The XML vocabularies we'll look at are:

  • XML attributes (xml:lang, xml:space etc.)
  • DocBook
  • XHTML
  • SVG
  • MathML
  • Dublin Core
  • XInclude
  • XSLT
  • XSL-FO
  • XML Schema
  • RELAX NG
  • XForms

The next subsections talk about the different kinds of values that are found in these languages and give examples of what they look like.

Standard Atomic Datatypes

These fall into three main categories: strings, numbers and booleans.

Unsurprisingly, strings or textual data appear in most of the markup languages we're looking at. Sometimes whitespace is significant, sometimes it isn't; sometimes case is significant, sometimes it isn't. The actual characters allowed within a string may be constrained (for example, to ASCII characters only to facilitate mapping onto an HTTP header). In some languages, the lengths of the strings allowed in particular attributes or elements is constrained, but mostly when there is a constraint on length, the string must be a single character. For example, in XHTML the char attribute that specifies the character on which a table column is aligned accepts a single character.

Numbers come in three forms: integers (7), decimals (7.5), and scientific format (7.5E3). It's worth noting that these different formats are distinct from the computer science partition of numbers as integers, decimals and floating point numbers, which is all to do with how the numbers are represented as bits. In XML vocabularies, what matters is the lexical representation of the number: how it is represented in characters.

Boolean values in these markup languages are variously written as true/false, 1/0, or as yes/no.

Enumerated Values

Enumerations of legal values are fairly common in these markup languages. For example xml:space allows preserve or default and XInclude's parse attribute allows either xml or text.

Enumerations can be case-insensitive (as is the case with XHTML LinkTypes, for example). With large sets of enumerated values, the list of possibilities is often held remotely. For example, many markup languages refer to IANA registered media types, which are listed on the IANA site. The [genericode] project seeks to provide a standard way of storing code lists and their related information.

Structured values, which we'll look at later, may have parts whose legal values are enumerated. For example xml:lang can contain language and country codes that are each enumerated separately. On the flip side, sometimes enumerated values are a special subset of allowed values. For example, SVG color keywords are a subset of the colour format that SVG allows.

Enumerated values are often a key into a larger set of information about the particular value. For example, the language value en is used to represent the language English, and the colour value black is equivalent to the RGB notation #000000. This extra information is useful for documentation purposes, as it helps those authoring and using the XML data to understand what the value actually means.

Lists

Lists are common in these markup languages. The most common kind are whitespace-delimited lists, especially as these correspond well to the SGML datatypes NMTOKENS or IDREFS. However, comma-separated lists also exist (for example, XHTML's URI lists are comma-separated, as are media-type lists (by inheritance from CSS2)). SVG list values are usually separated by either whitespace or a comma, and [Dublin Core Separated Values] are semi-colon-separated.

Simple Structured Values

Datatypes start becoming more interesting when we look at structured values. Simple structured values conform to a regular grammar, describable using a regular expression, though perhaps not easily.

One of the most common kinds of structured values are numbers combined with units, such as lengths (36pt, 3px), frequencies (5Hz, 16kHz), angles (90deg), durations (3s, 150ms), proportions (5*) and percentages (25%). Some units are absolute, others relative based on application-specific information.

Other examples of structured values are dates and times, URI references, colours in RGB notation, SVG path data and transformations, MathML group alignment, XInclude's accept and accept-language attributes, XPath 2.0 sequence types, XPath subsets (as in W3C XML Schema) and P3P type names (as in XForms).

It's worth noting that regular expressions alone are sometimes insufficient when validating a structure value. Although it's possible to create a regular expression to validate a date/time, if you try to incorporate leap year checks the regular expression becomes so complex that it's almost impossible to understand.

Complex Structured Values

Markup languages also sometimes use values that don't conform to a regular grammar, and therefore can't be validated using regular expressions, but are typically described using EBNF notation. Examples are XPointers, XPaths, XSLT patterns, XSL-FO expressions, and regular expressions themselves.

Use of Context Information

Whatever the structure of the value, the validity and semantics of a value can depend on where it appears in an XML document. This happens at two levels.

First, the XML Infoset level provides context information that allows interpretation of qualified names and namespace prefixes (through the in-scope namespaces), relative URIs (through the base URI), declared unparsed entities and notations, IDs and IDREFs. For example, to tell whether my:qualified-name is a legal qualified name, I have to know whether the prefix my is declared on the element on which it appears.

Second, the application that process the markup language may have its own rules about how to interpret particular values. For example, relative lengths are generally interpreted based on the width or height of an enclosing element, or the current font size in place.

Datatype Library Language (DTLL)

So how do we support all the different kinds of data, and all the uses of datatypes that we've looked at above in a generic language? We can't, and we shouldn't try at this stage. Instead, DTLL attempts to tackle the core requirements for validation of XML data and provide a framework that is inherently extensible so that other features can be added later. The core of DTLL supports regular grammars, but not all context-free languages; you can't use core DTLL to validate XPaths, for example, but there is built-in extensibility for implementation-specific support, or support in future versions.

DTLL Fundamentals

DTLL adopts one of the fundamental notions of XML Schema in that it separates the lexical space from the value space. The lexical space defines which strings are acceptable values for attributes and elements of the particular type. For example, a datatype for decimal numbers would allow any string matching the regular expression [0-9]+(\.[0-9]+)?, including 160.0, 000160 or 160.000. The value space defines the semantics of the values from the lexical space, and the datatype definition states how lexical representations map onto values in the value space.

However, DTLL differs from XML Schema in the role and mechanics of the value space. In XML Schema, the value spaces of the primitive datatypes, such as xs:decimal or xs:duration, are defined in prose within the XML Schema Datatypes Recommendation. Subtypes define their own subsets of this value space, but the essential structure of the value space is fixed. It is impossible to define new value spaces (doing so would be equivalent to defining new primitive datatypes).

The definitions of the value spaces of the primitive XML Schema datatypes are used by applications to indicate how values should be represented internally, and, crucially, how values should be compared. For example, xs:QName datatype has a value space defined as follows.

The "value space" of QName is the set of tuples {namespace name, local part}, where namespace name is an anyURI and local part is an NCName.

This definition indicates that only the namespace URI and the local part of a qualified name are important when it comes to comparing qualified names; that the prefix is ignored. If a schema defines a fixed value for an attribute containing a xs:QName then the prefix specified for that fixed value doesn't matter, only the namespace URI with which the prefix is associated.

In DTLL, in contrast with XML Schema, every datatype has its own value space. Each lexical representation of a datatype is mapped onto a sequence of named properties; these properties form the value space for the type. If two lexical representations of a given datatype map onto a same set of property/value pairs, then the values are considered to be equal; if any of the property values are different, then the values are different. Equality between property values is, in turn, based on the type assigned to each property.

For example, the order attribute of the <svg:feConvolveMatrix> element contains one or two integers separated by either a comma or white-space. If the second integer isn't given, it defaults to the same value as the first integer. So the value space for this datatype consists of a pair of integers, and we want the values 5, 5, 5, and 05 05 to be equal. In DTLL, the definition for this datatype is

<datatype name="feConvolveMatrix.order">
  <choice>
    <all>
      <regex>[0-9]+</regex>
      <property name="orderX" select="number(.)" />
      <property name="orderY" select="number(.)" />
    </all>
    <all>
      <regex>(?'X'[0-9]+)(\s|(\s?,\s?))(?'Y'[0-9]+)</regex>
      <property name="orderX" select="number($X)" />
      <property name="orderY" select="number($Y)" />
    </all>
  </choice>
</datatype>
Here, the orderX and orderY properties make up the value space: if two lexical representations map to the same pair of values for those properties, then those lexical representations count as equal.

In XML Schema, the closest you could get would be to define a subtype of xs:token (rather than xs:string, which wouldn't normalize whitespace) with an appropriate pattern facet:

<xs:simpleType name="feConvolveMatrix.order">
  <xs:restriction base="xs:token">
    <xs:pattern select="[0-9]+((\s|(\s?,\s?))[0-9]+)?" />
  </xs:restriction>
</xs:simpleType>
Although this correctly checks the syntax for this datatype, if a fixed value of 3 were defined for an attribute using this datatype, the schema would not allow the equivalent values 3,3 or 03 03. XML Schema allows us to define new lexical spaces, but not new value spaces.

Datatype Definitions

Datatypes in DTLL are defined with a <datatype> element. The elements within the <datatype> element have two main roles: to test a target value to see if it's valid, and to assign values to the properties of the datatype.

Before they are tested against a datatype definition, target values are whitespace normalized. The normalize-whitespace attribute on the <datatype> determines whether and how the whitespace is normalized, with the usual values preserve, replace or collapse (the default).

Extension elements (any element that isn't in the datatypes namespace which is currently http://www.jenitennison.com/datatypes and usually associated with the prefix dt) can be used anywhere within a datatype definition. These can be used to provide documentation and examples, or to provide additional tests. The dt:must-understand attribute on an extension element indicates whether an implementation needs to recognise the extension element or not.

Tests

There are four kinds of tests in the core of DTLL: regular expressions (<regex>), list definitions (<list>), validity against other datatypes (<valid>), and general conditions (<condition>).

Regular Expressions

The <regex> element tests the target value against a regular expression. case-insensitive and ignore-regex-whitespace attributes determine whether the match is case sensitive or not, and whether whitespace within the regular expression can be ignored or not. For example, a case-insensitive list of keywords could be written

<regex case-insensitive="true" ignore-regex-whitespace="true">
  black |
  white |
  red   |
  ...
</regex>

The regular expression syntax used in DTLL is mostly the same as that used in XPath 2.0, which is an extension of that used in XML Schema. However, it's extended to allow subexpressions to be named. The syntax (?'name'group) is used to associate a name with the substring matched by the group.2

If named groups are used, the <regex> element creates a number of variable bindings between the names of the groups and the substrings matched by the groups. This enables users to quickly pull out parts of the target value for further testing or assignment to properties. For example, in the following datatype, the day, month and year parts of a date are identified using the regular expression and assigned to relevant properties

<datatype name="UKDate">
  <regex>(?'d'[0-9]{1,2})[-/.](?'m'[0-9]{1,2})[-/.](?'y'[0-9]{4})</regex>
  <property name="day" select="number($d)" />
  <property name="month" select="number($m)" />
  <property name="year" select="number($y)" />
</datatype>

Lists

The <list> element tokenises the target value based on a separator (defined using a regular expression in the separator attribute). Each of the items in the list is then tested against a datatype. The datatype can be specified either through reference to another named datatype (using a type attribute and child <param> elements) or with a nested anonymous datatype (using a child <datatype> element).

As an example, the Points datatype in SVG uses comma-or-whitespace separated pairs of coordinates. The following definition can be used

<datatype name="Points">
  <list separator="(\s?,\s?)|\s">
    <datatype>
      <regex>[-+]?(([0-9]+\.?)|([0-9]*\.[0-9]+))([eE][-+][0-9]+)?</regex>
    </datatype>
  </list>
  <!-- Make sure the coordinates come in pairs -->
  <regex>[^\s,]+((\s?,\s?)|\s)[^\s,]+(((\s?,\s?)|\s)[^\s,]+((\s?,\s?)|\s)[^\s,]+)*</regex>
</datatype>

Validity Testing

It's often useful to test the entirety or just a part of a target value against another datatype. The <valid> element specifies a value to test using the select attribute (defaulting to the target value), and tests it against a datatype specified either through the type attribute (and child <param> elements) or through an anonymous child <datatype> element.

For example, the following datatype definitions might be used to test the value of a xml:lang attribute (or rather, a much simplified version of it):

<datatype name="Language">
  <regex>(?'lang'[a-z]{2})-(?'country'[A-Z]{2})</regex>
  <valid select="$lang" type="TwoLetterLanguage" />
  <valid select="$country" type="TwoLetterCountry" />
</datatype>              
             
<datatype name="TwoLetterLanguage">
  <regex>aa|ab|af|ak|sq|am|...</regex>
</datatype>              
              
<datatype name="TwoLetterCountry">
  <regex>AF|AX|AL|DZ|AS|AD|...</regex>
</datatype>

General Conditions

General conditions can be tested with the <condition> element. The test attribute holds an XPath expression whose effective boolean value must be true for the target value to be valid.3 For example, the UK date datatype we looked at earlier could be tightened up a bit with a couple of conditions:

<datatype name="UKDate">
  <regex>(?'d'[0-9]{1,2})[-/.](?'m'[0-9]{1,2})[-/.](?'y'[0-9]{4})</regex>
  <property name="day" select="number($d)" />
  <property name="month" select="number($m)" />
  <property name="year" select="number($y)" />
  <condition test="$day >= 1 and 31 >= $day" />
  <condition test="$month >= 1 and 12 >= $month" />
</datatype>
We haven't dealt with months having different numbers of days, let alone leap years; although it's possible to do so with pure XPath 1.0, DTLL offers some mechanisms for combining tests that provide a bit more help; we'll look at those next.

Combining Conditions

By default, the tests specified at the top level of a datatype definition must all be satisfied for a target value to be considered valid. DTLL provides three elements, equivalent to the logical operators and, or and not, to combine tests together: <all>, <choice>, and <except>.

The <choice> element is very powerful: you can use it to provide enumerations of different legal values

<datatype name="TwoLetterLanguage">
  <choice>
    <regex>aa</regex>
    <regex>ab</regex>
    <regex>af</regex>
    <regex>ak</regex>
    ...
  </choice>
</datatype>
or, in combination with the <all> element, to provide if/then conditionals (which are missing from XPath 1.0)
<datatype name="UKDate">
  <regex>(?'d'[0-9]{1,2})[-/.](?'m'[0-9]{1,2})[-/.](?'y'[0-9]{4})</regex>
  <property name="day" select="number($d)" />
  <property name="month" select="number($m)" />
  <property name="year" select="number($y)" />
  <condition test="$day >= 1" />
  <choice>
    <all>
      <condition test="$month = 1 or $month = 3 or $month = 5 or $month = 7 or
                       $month = 8 or $month = 10 or $month = 12" />
      <condition test="31 >= $day" />
    </all>
    <all>
      <condition test="$month = 4 or $month = 6 or $month = 9 or $month = 11" />
      <condition test="30 >= $day" />
    </all>
    <all>
      <condition test="$month = 2" />
      <condition test="28 >= $day" />
    </all>
    <all>
      <condition test="$month = 2" />
      <condition test="$day = 29" />
      <condition test="$year mod 4 = 0" />
      <except>
        <all>
          <condition test="$year mod 100 = 0" />
          <except>
            <condition test="$year mod 400 = 0" />
          </except>
        </all>
      </except>
    </all>
  </choice>
</datatype>

In fact, a combination of the <choice> and <valid> elements helps break down complex regular expressions. For example, the Points datatype that we looked at earlier is specified in SVG with the following:

list-of-points:
    wsp* coordinate-pairs? wsp*
coordinate-pairs:
    coordinate-pair
    | coordinate-pair comma-wsp coordinate-pairs
coordinate-pair:
    coordinate comma-wsp coordinate
coordinate:
    number
number:
    sign? integer-constant
    | sign? floating-point-constant
comma-wsp:
    (wsp+ comma? wsp*) | (comma wsp*)
comma:
    ","
integer-constant:
    digit-sequence
floating-point-constant:
    fractional-constant exponent?
    | digit-sequence exponent
fractional-constant:
    digit-sequence? "." digit-sequence
    | digit-sequence "."
exponent:
    ( "e" | "E" ) sign? digit-sequence
sign:
    "+" | "-"
digit-sequence:
    digit
    | digit digit-sequence
digit:
    "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
wsp:
    (#x20 | #x9 | #xD | #xA)+

The DTLL datatype definitions can match this specification in structure:

<datatype name="coordinate-pairs">
  <regex>(?'coord1'[^\s,]+)((\s?,\s?)|\s)(?'coord2'[^\s,]+)(((\s?,\s?)|\s)(?'pairs'.+))?</regex>
  <valid select="$coord1" type="coordinate" />
  <valid select="$coord2" type="coordinate" />
  <choice>
    <condition test="$pairs = ''" />
    <valid select="$pairs" type="coordinate-pairs" />
  </choice>
</datatype>
              
<datatype name="coordinate">
  <valid select="." type="number" />
</datatype>        
        
<datatype name="number">
  <choice>
    <all>
      <regex>[-+]?(?'int'.+)</regex>
      <valid select="$int" type="integer-constant" />
    </all>
    <all>
      <regex>[-+]?(?'float'.+)</regex>
      <valid select="$float" type="floating-point-constant" />
    </all>
  </choice>
</datatype>        
        
<datatype name="integer-constant">
  <regex>[0-9]+</regex>
</datatype>        
        
<datatype name="floating-point-constant">
  <choice>
    <all>
      <regex>(?'frac'[^eE]+)(?'exp'[eE].+)?</regex>
      <valid select="$frac" type="fractional-constant" />
      <choice>
        <condition test="$exp = ''" />
        <valid select="$exp" type="exponent" />
      </choice>
    </all>
    <all>
      <regex>[0-9]+(?'exp'[eE].+)</regex>
      <valid select="$exp" type="exponent" />
    </all>
  </choice>
</datatype>        
        
<datatype name="fractional-constant">
  <choice>
    <regex>[0-9]*\.[0-9]+</regex>
    <regex>[0-9]+\.</regex>
  </choice>
</datatype>
        
<datatype name="exponent">
  <regex>[eE][-+]?[0-9]+</regex>
</datatype>

Properties, Parameters and Variables

Properties, parameters and variables all work in much the same way, but have different uses. As we've seen, properties (<property>) define the value space for a datatype, and determine whether two target values are considered equal or not. Parameters (<param>) provide a mechanism to configure a datatype, such as specifying minimum and maximum values for a number. Variables (<variable>) are used within a datatype definition as a temporary store for values.

Each of the <property>, <param> and <variable> elements has the same structure. The name attribute provides a name, and you can use either the value or select attribute to provide a value. The value attribute provides a fixed value, while the select attribute provides a calculated value. If neither a value nor a select attribute is given, then the value defaults to the empty string.

Properties, parameters and variables are all exposed as variables within XPath expressions for use in conditions or the definition of other variables. For example, a string with a maximum, minimum, and/or fixed length could be defined as

<datatype name="string">
  <param name="length" />
  <param name="minLength" select="$length" />
  <param name="maxLength" select="$length" />
  <condition test="$length = '' or 
                   ($minLength = $length and 
                    $maxLength = $length)" />
  <condition test="$maxLength >= $minLength" />

  <variable name="actualLength" select="string-length(.)" />
  <condition test="$minLength = '' or $actualLength >= $minLength" />
  <condition test="$maxLength = '' or $actualLength &lt;= $maxLength" />
</datatype>

The type of a property, parameter or variable can be specified using either a type attribute combined with nested <param> elements, or a nested <datatype> element specifying an anonymous datatype. If specified, the value of the property, parameter or variable is the string value of the value specified by the XPath expression, and it must be valid against the given type. Otherwise, the value of the parameter, property or variable is the type returned by the XPath expression that is used to set it (a string, number, boolean or node-set).

Properties can be defined within different parts of a <choice>. As long as they are named and typed the same, they can be compared. For example, colours used in CSS2 could be defined with:

<datatype name="css:color">
  <variable name="HEX" value="0123456789ABCDEF" />
  <variable name="hex" value="0123456789abcdef" />
  <choice>
    <!-- keywords -->
    <choice>
      <all>
        <regex case-insensitive="true">aqua</regex>
        <property name="red" value="0" />
        <property name="green" value="127" />
        <property name="blue" value="127" />
      </all>
      <all>
        <regex case-insensitive="true">black</regex>
        <property name="red" value="0" />
        <property name="green" value="0" />
        <property name="blue" value="0" />
      </all>
      <all>
        <regex case-insensitive="true">blue</regex>
        <property name="red" value="0" />
        <property name="green" value="0" />
        <property name="blue" value="127" />
      </all>
      ...
    </choice>
            
    <!-- #RGB notation -->
    <all>
      <regex>#(?'r'[0-9a-fA-F])(?'g'[0-9a-fA-F])(?'b'[0-9a-fA-F)</regex>
      <variable name="R" select="translate($r, $hex, $HEX)" />
      <variable name="G" select="translate($g, $hex, $HEX)" />
      <variable name="B" select="translate($b, $hex, $HEX)" />
      <property name="red" 
        select="string-length(substring-before($HEX, $R)) * 17" />
      <property name="green"
        select="string-length(substring-before($HEX, $G)) * 17" />
      <property name="blue"
        select="string-length(substring-before($HEX, $B)) * 17" />
    </all>
            
    <!-- #RRGGBB notation -->
    <all>
      <regex>#(?'rr'[0-9a-fA-F]{2})(?'gg'[0-9a-fA-F]{2})(?'bb'[0-9a-fA-F]{2})</regex>
      <variable name="RR" select="translate($rr, $hex, $HEX)" />
      <variable name="GG" select="translate($gg, $hex, $HEX)" />
      <variable name="BB" select="translate($bb, $hex, $HEX)" />
      <property name="red" 
        select="string-length(substring-before($HEX, substring($RR, 1, 1))) * 16
                + string-length(substring-before($HEX, substring($RR, 2, 1)))" />
      <property name="green" 
        select="string-length(substring-before($HEX, substring($GG, 1, 1))) * 16
                + string-length(substring-before($HEX, substring($GG, 2, 1)))" />
      <property name="blue" 
        select="string-length(substring-before($HEX, substring($BB, 1, 1))) * 16
                + string-length(substring-before($HEX, substring($BB, 2, 1)))" />
    </all>
            
    <!-- rgb(red, green, blue) notation -->            
    <all>
      <regex>rgb\(\s?(?'r'[0-9]+)\s?,\s?(?'g'[0-9]+)\s?,\s?(?'b'[0-9]+)\s?\)</regex>
      <property name="red" select="$r" />
      <property name="green" select="$g" />
      <property name="blue" select="$b" />
    </all>
            
    <!-- rbg(red%, green%, blue%) notation -->
    <all>
      <regex>rgb\(\s?(?'r'[0-9](\.[0-9]+)?)%\s?,\s?(?'g'[0-9]+(\.[0-9]+)?)%\s?,\s?(?'b'[0-9]+
      (\.[0-9])?)%\s?\)</regex>
      <property name="red" select="2.55 * $r" />
      <property name="green" select="2.55 * $g" />
      <property name="blue" select="2.55 * $b" /> 
    </all>
  </choice>  
</datatype>
The colours red, #f00, #FF0000, rgb(255,0,0), and rgb(100%, 0%, 0%) are equivalent under this definition.

There are two standard parameters that, if declared for a datatype, allow extra context information to be used within a datatype definition. These standard parameters are $dt:in-scope-namespaces, which is a node set of namespace nodes, and $dt:base-uri, which is a string.4 For example, a datatype definition for the stylesheet-prefix and result-prefix attributes on the <xsl:namespace-alias> element in XSLT would be

<dt:datatype name="namespace-prefix">
  <dt:param name="dt:in-scope-namespaces" />
  <dt:choice>
    <dt:condition test=". = '#default'" />
    <dt:all>
      <dt:variable name="prefix" select="." />
      <dt:condition test="$dt:in-scope-namespaces[name(.) = $prefix]" />
    </dt:all>
  </dt:choice>
</dt:datatype>

Datatype Libraries

Datatype libraries are a number of named datatype definitions within a <datatypes> wrapper. The <datatypes> element must have a version attribute to indicate the version of DTLL that's being used. Currently, that version is 0.5.

DTLL deliberately borrows several features from [RELAX NG], both because RELAX NG is well-designed and because it is likely to be familiar to the users of DTLL.

<div> elements are used to structure datatype libraries, and unqualified names are resolved using the namespace specified in the nearest ancestor ns attribute (rather than the default namespace of the datatype library document).

If there's more than one definition for a given named datatype, those definitions are merged based on the value of a combine attribute. The combine attribute can have the value choice (in which case they are combined in a <choice> element), or all (in which case they are combined in a <all> element). For example, the floating-point-constant datatype we looked at earlier could be defined as

<datatype name="floating-point-constant" combine="choice">
  <regex>(?'frac'[^eE]+)(?'exp'[eE].+)?</regex>
  <valid select="$frac" type="fractional-constant" />
  <choice>
    <condition test="$exp = ''" />
    <valid select="$exp" type="exponent" />
  </choice>
</datatype>                  
          
<datatype name="floating-point-constant" combine="choice">
  <regex>[0-9]+(?'exp'[eE].+)</regex>
  <valid select="$exp" type="exponent" />
</datatype>

Modular datatype libraries are created using an <include> element: the content of the datatype library referenced using the href attribute is included in-place, wrapped in a <div> element, but any datatype definitions within the <include> replace those with the same name from the referenced datatype library.

Extension elements are allowed at the datatype library level as well as within datatype definitions, to provide documentation or extra information. Again, the dt:must-understand attribute can be used to indicate when an extension element must be recognised by the implementation. For example, an implementation might support <xsl:function> elements within a datatype library for the definition of user-defined functions used within the datatype definitions. These should be labelled with dt:must-understand="true" since an implementation that doesn't recognise them wouldn't have implementations for the user-defined functions and therefore couldn't correctly implement the datatype definitions.

Implementation

There are at least three implementations of DTLL under development. My own is a proof-of-concept written in XSLT 2.0. It works by transforming a datatype library into an XSLT 2.0 stylesheet that defines two main functions:
xs:boolean dt:valid(xs:string value, xs:QName datatype[, xs:string* params])

Returns true if the value is valid against the datatype with the specified parameters. The parameters are specified as a sequence of names and values.

xs:boolean dt:equal(xs:string value1, xs:string value2, xs:QName datatype[, xs:string* params])

Returns true if the values are equal according to the datatype with the specified parameters.

You can then import the generated stylesheet into your own in order to validate or test the equality of values.

The stylesheet also defines the following functions:
xs:anyAtomicType+ dt:properties(xs:string value, xs:QName datatype[, xs:string* params])

Returns a sequence that contains information about the properties of the value when considered as a value of specified datatype with the specified parameters.

xs:anyAtomicType dt:property(xs:string name, xs:anyAtomicType+ properties)

Returns the value of the named property.

xs:QName dt:property-type(xs:string name, xs:anyAtomicType+ properties)

Returns the xs:QName of the type of the named property. If the value is of an XPath type, then it returns the xs:QName dt:default.

These functions allow you to get further information about a particular value. You can get the values of particular properties, and if those properties are themselves typed, then properties of properties and so on. Using these functions, it's possible to create functions akin to the XPath 2.0 functions hours-from-dateTime() and namespace-uri-from-QName() for your own datatypes.

Implementation of DTLL has proved mostly straightfoward. Of course, implementation in XSLT 2.0 is aided a great deal by the fact that XSLT 2.0 has a built-in implementation of XPath and uses almost the same regular expression syntax as DTLL.

XPath 1.0 expressions are used in many places in DTLL: in <condition>, <valid>, and the variable-binding elements. XSLT 2.0 can obviously process XPath 1.0 expressions but will treat them as XPath 2.0 expressions by default. To ensure that users don't have to include casts that aren't required in XPath 1.0, the implementation uses version="1.0" wherever such expressions are evaluated. This is more lenient than it should be, since it allows datatype libraries to use XPath 2.0 functions and operators that aren't supported in XPath 1.0, and there are a few corner cases of incompatibilities, but it's sufficient for a proof-of-concept. For other implementations, open source XPath 1.0 engines will be very useful.

Implementing regular expression matching for <regex> and tokenizing for <list> is fairly trivial in XSLT 2.0. The only challenge is to translate the given regular expression into one that doesn't include the named groups allowed in DTLL, and to then retrieve the substrings that match those groups and assign them to variables. Implementations in other languages will have to deal with the fact that the regular expression syntax used by XML Schema, XPath 2.0 and DTLL is not the same as that used in other languages, such as Java or Python. However, since there are open-source implementations of XML Schema and XPath 2.0 available, there is, at least, code available to learn from.

Probably the stickiest aspect of the implementation is dealing with circular definitions, which aren't forbidden in DTLL. For example, it's OK to give properties that hold the results of converting a value from one type to another, such as in

<datatype name="decimal">
  <regex>(?'whole'[0-9]+)(\.(?'frac'[0-9]+))?</regex>
  <property name="wholePart" type="integer" select="number($whole)" />
  ...
</datatype>        
        
<datatype name="integer">
  <regex>[0-9]+</regex>
  <property type="decimal" select="." />
</datatype>

With these definitions, if we want to test whether 05 is a legal integer, it passes the regex [0-9]+ but also needs to be a valid decimal. To test whether 05 is a valid decimal, we need to test whether the numerical value of the whole part of the decimal (5) is a valid integer. Testing 5 as an integer meants testing 5 as a decimal and so on. Similarly, the integers 05 and 5 can only be judged equal if they are equal as decimals, and they are only equal as decimals if the numerical value of the whole part (5 in each case) are equal integers.

So, when testing validity of a target value and equality of two values, an implementation has to keep a stack of what's in the process of being tested to prevent infinite recursion. If it's asked to test the validity of a value that's in the process of being tested, then the value is judged to be valid (at least according to that test; there may be others on which it fails). When testing equality, the stopping condition is when the two values are identical strings, since such values must be equal. If the values don't converge on the same string then they are judged unequal.

Discussion

The examples in this paper have shown that DTLL is an effective and flexible method of defining datatypes. DTLL addresses many of the problems that users have with XML Schema's method of defining datatypes by giving users a lot more control over the restrictions they place on the lexical space of a datatype, and by letting them define their own value spaces when appropriate.

We have made a deliberate decision in the design of DTLL to focus on 80% of the problem: DTLL addresses validation, rather than providing datatypes for applications such as XPath 2.0; and it scopes out the harder 20% of datatypes, such as XPath expressions. But if DTLL were to address the remaining 20%, how would it do so?

Increasing the range of datatypes supported by DTLL is pretty trivial, and I expect that implementations of DTLL will do so: it's just a matter of adding different kinds of testing elements within datatype definitions. For example, an <edt:ebnf> element could be defined to provide an EBNF definition for a string, which would open the door to validating regular expressions, XPath expressions and other complex structured values. Implementations could also provide elements that test values with programming languages that are more powerful and flexible than XPath, enabling more complex computations to be carried out, for example testing if a value is a prime number or not.

Moving DTLL into a role beyond validation, particularly into being used to supply types to a language such as XPath, is a harder challenge, but possible. There are three aspects to the challenge: dealing with newly created values, providing conversions between datatypes, and providing mechanisms for performing magnitude comparisons between values.

One of the fundamental notions underlying DTLL is that the lexical representation of a value is primary, and the value space secondary. This is fine for validation because we are always supplied with lexical representations of values. XPath, on the other hand, needs to create new values: for example, when a duration is added to a date, the result is a newly created date; when a string is converted to a boolean, the result is a new boolean.

In XML Schema, the canonical lexical representation of a datatype is the way in which a given value should be represented as a string: a mapping from the value space to the lexical space. For example, the canonical lexical representation of the xs:decimal 00160 is 160.0: there are no leading zeros, and there must be at least one decimal place but no other trailing zeros. However, XML Schema runs into problems because it's possible to provide constraints on the lexical representation of a value (defined using the xs:pattern facet in XML Schema) that contradict the rules governing the canonical lexical representation. For example, prices are commonly defined as xs:decimal numbers with two decimal places:

<xs:simpleType name="price">
  <xs:restriction base="xs:decimal">
    <xs:pattern value="[0-9]+\.[0-9]{2}" />
  </xs:restriction>
</xs:simpleType>
Given a price such as 12.00, the canonical lexical representation of 12.0 is not a legal price; if a price such as 12.00 is round-tripped through XPath, you get an error. To get around this problem, users are told not to use patterns that don't allow the canonical lexical representation of a value; in this case, users must either drop the requirement for two decimal places or derive their price datatype from xs:token rather than xs:decimal: they must choose to either allow lexical representations they don't want to allow, or to have comparisons between prices be incorrect.

It would be perfectly possible in DTLL to define a standard property that holds a lexical representation that could be used as a canonical lexical representation if no other lexical representation were provided (i.e. for newly created values). Because the person defining the datatype is in full control of the canonical lexical representation (unlike the users of XML Schema), it is a lot more reasonable to give them responsibility to ensure the provided canonical lexical representation was a legal lexical representation. For example, a price could be defined5 as

<datatype name="price">
  <regex>[0-9]+\.[0-9]{2}</regex>
  <property name="value" select="number(.)" />
  <variable name="w" select="floor($value)" />
  <variable name="f" select="concat('0', round(($value - $w) * 100))" />
  <property name="dt:canonical-lexical-representation"
    select="concat($w, '.', substring($f, string-length($f) - 1, 2))" />
</datatype>

Conversions between datatypes are an interesting area, especially because in most type systems, type hierarchies play a big part in enabling automatic conversions to take place. For example, an integer can always be converted to a decimal because an integer is a decimal. I suspect that users of DTLL will build implicit type hierarchies using the <valid> element. For example, part of the numeric type hierarchy from XML Schema might be written in DTLL as

<datatype name="decimal">
  <regex>(?'whole'[-+]?[0-9]+)(\.[0-9]+)?</regex>
  <valid type="integer" select="$whole" />
</datatype>        
        
<datatype name="integer">
  <regex>[-+][0-9]+</regex>
  <valid type="decimal" />
</datatype>        
                
<datatype name="long">
  <valid type="integer" />
  <condition test=". &lt;= 9223372036854775807" />
  <condition test=". >= -9223372036854775808" />
</datatype>        
        
<datatype name="int">
  <valid type="long" />
  <condition test=". &lt;= 2147483647" />
  <condition test=". >= -2147483648" />
</datatype>        
        
<datatype name="short">
  <valid type="int" />
  <condition test=". &lt;= 32767" />
  <condition test=". >= -32768" />
</datatype>        
        
<datatype name="byte">
  <valid type="short" />
  <condition test=". &lt;= 127" />
  <condition test=". >= -128" />
</datatype>
Implicitly, here, every byte is an integer, and therefore providing a byte where an integer is expected would be OK. Would it be legal for an implementation to assume that a <valid> element that tested the target value itself indicated such a type hierarchy? And what about the <valid> element within the definition for the decimal datatype above? In XPath 2.0, decimals are converted to integers by using the whole part of the decimal number (omitting everything after the decimal point). Could <valid> elements of this form be used to encode that kind of mapping rule?

I think it's dangerous to overuse <valid> elements in this way. There's a distinction between the relationships between the lexical representations of two types and the value spaces of those types. Say we had an antiquated gender datatype that was encoded as 0 (male) or 1 (female). Just because the lexical space corresponds to the lexical space for boolean does not mean that we should interpret males as false and females as true!

Previous versions of DTLL have included mapping mechanisms, but more work needs to be done in this area to come up with a syntax that is both easy to use and powerful enough to express the varying kinds of relationships that types can have with each other.

Probably the greatest boon of scoping DTLL only to validation is that it prevents any consideration of magnitude comparisons between values: all we have to worry about is whether two values are equal or not, not which one is larger. It's not obvious that a datatype definition language is the correct place to define comparisons anyway, in that different languages that use those datatypes might have different requirements. For example, while XPath 2.0 uses the built-in XML Schema datatypes, it overrides what XML Schema says about how those datatypes are ordered to avoid dealing with the complexities of partial ordering. For example, strings are unordered in XML Schema, but ordered based on character codepoints in XPath 2.0; XPath 2.0 also defines its own mechanisms for comparing xs:date, xs:time and other date/time datatypes. Other languages that deal with XML data might have alternative methods of dealing with these partially ordered datatypes.

What's probably required here is a method of defining a number of possible collations for a given datatype. These collations could be defined by comparing the properties of a value in a particular order. For example, a collation for a date would mean comparing the year, month and day of the date in that order. Previous versions of DTLL have included the definition of collations, but, again, this needs more time to get it right.

DTLL is still in the process of standardisation as part of DSDL, but it has gone through several iterations now, and is approaching stability, with at least three implementations under development. The big question is: does it meet your needs?

Notes

1.

It's worth noting that XPath 2.0 only supports XML Schema datatypes, and actually defines its own casting and comparison rules. A more flexible query language is a matter of ongoing research.

2.

The (?'name'group) syntax is used to name subexpressions in .NET. The Python syntax (?P<name>group) isn't workable within XML because the angle brackets would have to be escaped.

3.

DTLL uses XPath 1.0, for ease of implementation and to avoid issues with the datatyping aspects of XPath 2.0. In particular, the implicit type conversions of XPath 1.0 prove very useful when parsing datatype values, which often contain substrings that should be treated as numbers, for example.

4.

DTLL only has built-in support for using the context information that is provided when a datatype library is used with RELAX NG. It doesn't have built-in support for providing the names of unparsed entities (necessary for ENTITY and ENTITIES DTD datatypes), notations (necessary for NOTATION DTD datatype), or IDs (necessary for IDREF and IDREFS DTD datatypes). This information is only available if a document is processed with a DTD, and if a document has a DTD, then it can be validated against that DTD in order to check the validity of attributes of these datatypes.

5.

We can't use the format-number() function here, since it's part of XSLT rather than XPath.


Bibliography

[DTLL] Datatype Library Language (DTLL) http://www.jenitennison.com/datatypes

[Dublin Core Separated Values] DCMI DCSV: A syntax for representing simple structured data in a text string. http://dublincore.org/documents/dcmi-dcsv/

[EXPRESS] ISO 10303-28, STEP Part 28, Implementation method: XML representation of EXPRESS schemas and data http://www.tc184-sc4.org/SC4_Open/SC4_Work_Products_Documents/STEP_(10303)/Files/p28_n140.PDF

[genericode] genericode http://www.genericode.org/

[HyTime] HyTime: ISO 10744:1997 Hypermedia/Time-based Structuring Language http://www.pms.ifi.lmu.de/mitarbeiter/ohlbach/multimedia/HYTIME/ISO/toc.html

[ISO 11404: Language-Independent Datatypes] Language-Independent Datatypes. ISO 11404. 15 December 1996. http://standards.iso.org/ittf/PubliclyAvailableStandards/s019346_ISO_IEC_TR_11404_1996(E).zip

[JAXB] JSR 222: The Java Architecture for XML Binding (JAXB) 2.0. http://jcp.org/en/jsr/detail?id=222

[RELAX NG] Document Schema Definition Languages (DSDL) — Part 2: Regular grammar-based validation — RELAX NG http://www1.y12.doe.gov/capabilities/sgml/sc34/document/0362_files/relaxng-is.pdf

[Schematron] Document Schema Definition Languages (DSDL) — Part 3: Rule-based validation — Schematron http://www.schematron.com/iso/dsdl-3-fdis.pdf

[W3C XML Schema] XML Schema Part 2: Datatypes Second Edition. W3C Recommendation. 28 October 2004. http://www.w3.org/TR/xmlschema-2/

[XPath 2.0] XML Path Language (XPath) 2.0. W3C Candidate Recommendation. 3 November 2005. http://www.w3.org/TR/xpath20



Datatypes for XML: the Datatyping Library Language (DTLL)

Jeni Tennison [Jeni Tennison Consulting Ltd]
jeni@jenitennison.com