A Standards-based XML Schema Implementation Comparison Framework

Henry S. Thompson
Richard Tobin

Abstract

This paper describes a standards-based approach to the task of comparing two implements of the W3C XML Schema language: no special-purpose software is required, nor any common API for accessing the results of schema validity assessment. A combination of XSLT transformations and appeal to an XML manifestation of XML Infosets is all that is required.

Keywords: XSD/W3C Schema; Validating

Henry S. Thompson

Editor of the XML Schema Part 1: Structures specification, co-developer of XSV, the first freely available XML Schema validator

Richard Tobin

Editor of the XML Infoset specification; developer of RXP, a widely used validating XML parser; co-developer of XSV, the first freely available XML Schema validator.

A Standards-based XML Schema Implementation Comparison Framework

Henry S. Thompson [HCRC Language Technology Group; World Wide Web Consortium]
Richard Tobin [HCRC Language Technology Group]

Extreme Markup Languages 2001® (Montréal, Québec)

Copyright © 2001 Henry S. Thompson and Richard Tobin. Reproduced with permission.

Introduction

Although it is currently deemed unlikely that a reference implementation of the W3C XML Schema language will be designated, none-the-less comparing the output of one implementation with another is an obviously useful tool for anyone developing an implementation, or choosing one.

This paper describes a standards-based approach to this task: no special-purpose software is required, nor any common API for accessing the results of schema validity assessment. A combination of XSLT transformations and appeal to an XML manifestation of XML Infosets is all that is required.

Terminology Note: Throughout this paper, the word schema and the phrase XML Schema are used to refer to the W3C XML Schema language, as defined in the published Recommendations [structures] and [datatypes]

Reflected Infosets

The only requirement on an XML Schema processor for its output to be comparable with another's by the method described here is that it be able to reflect the post schema-validation Infoset (PSVI, per [structures]) as an XML document, itself valid with respect to the published XML Schema for such representations of the PSVI [psvi schema].

The comparison architecture

The adjoining figure illustrates the basic shape of the architecture.

Figure 1: Standards-based PSVI comparison architecture
[Link to open this graphic in a separate page]

The comparison starts with two inputs, a source file and a source schema. Two different schema processors (or two different versions of the same processor) undertake schema-validity assessment of the source using the schema, and reflect the resulting PSVI into XML documents, respectively rPSVI1 and rPSVI2. The next figure shows a fragment of such a reflected PSVI.

Figure 2: Reflected PSVI fragment
<element>
 <namespaceName>http://foo</namespaceName>
 <localName>xx</localName>
 <children/>
 <attributes/>
 <namespaceAttributes/>
 <inScopeNamespaces>
   . . .
 </inScopeNamespaces>
 <baseURI>file:/d:/work/XMLinter/ischema/tiny.xml</baseURI>
 <psv:schemaInformation xsi:null='true'/>
 <psv:validationAttempted>partial</psv:validationAttempted>
 <psv:validationContext>
   <pointer ref='g1'/>
 </psv:validationContext>
 <psv:validity>valid</psv:validity>
 <psv:errorCode xsi:null='true'/>
 <psv:schemaNormalizedValue>0</psv:schemaNormalizedValue>
 <psv:typeDefinition>
   <pointer ref='f:type._anon_7'/>
 </psv:typeDefinition>
 <psv:memberTypeDefinition xsi:null='true'/>
 <psv:elementDeclaration>
   <pointer ref='f:elt.xx'/>
 </psv:elementDeclaration>
 <psv:null>false</psv:null>
</element>

The next step (the blue number circle) is not strictly necessary: it is needed only to enable various approaches to displaying the differences detected later. It actually consists of two stages:

  1. an XSLT stylesheet which assigns a unique ID, section number style, to every element in its input;
  2. an XSLT stylesheet which produces an HTML image of its input XML, providing named anchors wherever an ID is found.
The boxes in the diagram labelled nrPSVI1 and nrPSVI2 are the result of this process. Here's an example: the same fragment as displayed above, but laid out and formatted with HTML, in the style made familiar by Microsoft's Internet Explorer default display of XML, with IDs and anchors (although of course you can't see the anchors):

Figure 3: Formatted and IDed reflected PSVI fragment
[Link to open this graphic in a separate page]

The next step in the overall comparison architecture is represented by the blue compare circle: as directed by a comparison control document (not shown in the picture) it compares the two reflected Infoset documents and outputs one of three possible styles of result:

  1. An XLink link base, consisting of one complex XLink for each difference between the two inputs, with two arcs, one to the locus of the difference in one input, the other to the other;
  2. A reconstruction of the original input XML document, presented as HTML with pairs of links to the reflected PSVIs whereever they differ;
  3. A reconstruction of the original input XML Schema, presented as HTML with pairs of links to the reflected PSVIs whereever they differ.

Option (1) would actually be appropriate for a difference computation for any pair of XML documents, not just reflected PSVIs. Options (2) and (3), on the other hand, only make sense for reflected PSVIs.

In option (1), the XLinks use XPointers to the XML reflections of the PSVI, using the IDs added above. In options (2) and (3) the HTML is linked using normal HTML links, i.e. <a href='...'>, pointing to the anchors in the HTML version of the reflected PSVI, e.g. nrPSVI1. Here's a screenshot of a simple version of option (2):

Figure 4: Screen shot of difference display, option (2)
[Link to open this graphic in a separate page]

One pair of difference pointers can be seen in the left-most pane: they both point to the element with ID e1.1.1.10 in the two input documents, and the element with this ID can be seen in the two right-hand panes, with, as indicated, different content. A great deal of work remains to be done in exploring the best way to display differences of this sort.

Comparison details

It is not immediately apparent that XSLT is suitable for comparing two documents: it is normally described as mapping from an input to an output. In the case at hand, the input document is actually a comparison configuration document, and the two documents to be compared are actually passed as invocation parameters.

The comparison algorithm used assumes a reasonable degree of similarity between the two inputs, such as might be expected if they were governed by the same schema or DTD. It is therefore not a general-purpose tree difference algorithm, and does not proceed by the standard approach of some form of Viterbi search through a space of costed insertion, deletion and movement operations. Rather it alternates between set alignment (between sets of attributes, and sets of like-named element children) and value comparison. In particular, it never explores the possibility that element or attribute name differences are not definitive, i.e. that the simplest expression of a difference might be that the labels on two sub-trees have been swapped. In fact order is never considerd at all: the trees are compared as if they were un-ordered.

Aside from the root template, all the templates in the comparison stylesheet are named, and comparison is implemented as a recursive algorithm over pairs of node sets, using XSLT as a functional programming language with a particular affinity for operations on XML document trees. The details which follow assume output option (1), as it is the most general.

To compare two element node sets, A and B

  1. Partition the A set into a set of node sets, where each subordinate set contains all and only like-named elements;
  2. For each such same-name set, compare it with the set of nodes with that name from the B set.

To compare two element node sets, A and B, whose nodes all have the same name

  1. If the sets are of different cardinality, signal an error with respect to the parent nodes and the element name.
  2. Otherwise, if the sets both have cardinality one, compare the individual nodes.
  3. Otherwise use keys to compare the two node sets.

To compare two individual nodes, X and Y

  1. For each attribute of X
    1. If Y has no attribute of the same name, signal an attribute missing error with respect to X vs. Y.
    2. Otherwise, if the attribute's value is different between X and Y, signal an attribute value error with respect to X vs. Y.
  2. For each attribute of Y missing from X, signal an attribute missing error with respect to Y vs. X.
  3. If X and Y have unequal character data content, ignoring ignorable whitespace, signal a content mismatch error with respect to X vs. Y.
  4. Compare the node set of X's element children to the node set of Y's element children.

To compare two element node sets, A and B, using keys

For each node in A, extract its key (see below), and compare it to the single node with the same key from B or, if there is no such node, or there are more than one, signal an unmatched key error with respect to A, A's parent and B's parent.

To extract the key of an element node

  1. Look up the element's qualified name in the key declaration table from the comparison configuration table.
  2. If there is none, signal a missing declaration error with respect to the element.
  3. Otherwise, evaluate the field expressions of the key declaration with respect to the element to assemble a key.

At several points the above process is controlled by the contents of the input comparison configuration file. This has five kinds of information:

  1. Process-namespace declarations. These specify a namespace name, signalling that all elements in that namespace should be compared;
  2. Process-element declarations. These specify a name, signalling that all elements with that name should be compared;
  3. Ignore-attribute declarations. These specify an attribute name and, optionally, an element name, signalling that attributes of that name (on that element) should not be compared;
  4. Follow declarations. These specify an element name and an attribute name, signalling that a matching element should use the referent of the named attribute in place of itself;
  5. Key declarations. These specify an element name and one or more field patterns, signalling that the values of the nodes matching those patterns (typically children or grandchildren) uniquely identify the elements of the given name.
The implication of the first two points above is that by default element nodes are not compared but attribute nodes (of elements which are being compared) are.

Status and further work

The architecture as pictured is completely implemented, with the exception at the time of writing of the box labelled schemaDiff. This is awaiting work of wider significance on an alternative representation for XML Schema components: one which is more verbose and explicit than that in [structures], but more transparent and readable than that in [psvi schema]. It has not yet been widely deployed or tested, but we expect to release it soon as part of the XML Schema Test Collection initiative (see http://www.w3.org/2001/05/xmlschema-test-collection.html).

Although many decisions about this architecture were made with the specific task of comparing reflected PSV Infosets in view, it is potentially of more general utility. The limits of that utility could stand detailed exploration. Some limitations could easily be overcome (using order as a key, in cases where order actually does matter, for example), but others are intrinsic to the recursive descent approach adopted here, at least when coupled with the pecularities of the chosen implementation language (identifying transpositions, insertions or deletions within element sequences as such, for example). On the whole, it seems more likely that this architecture will be of use for data-oriented, as opposed to document oriented, document types.


Acknowledgments

The work reported here was supported by the World Wide Web Consortium and Microsoft Corporation, to whom thanks are owed but no responsibility for error should be attributed.


Bibliography

[datatypes] XML Schema Part 2: Datatypes, Paul V. Biron and Ashok Malhotra, eds. World Wide Web Consortium, Boston, MA. 2001. Available online as http://www.w3.org/TR/REC-xmlschema-2-20010502/

[psvi schema] A schema for serialized infosets, Richard Tobin and Henry S. Thompson. 2001. Available online as http://www.w3.org/2001/05/serialized-infoset-schema.html

[structures] XML Schema Part 1: Structures, Henry S. Thompson et al. eds. World Wide Web Consortium, Boston, MA. 2001. Available online as http://www.w3.org/TR/REC-xmlschema-1-20010502/



A Standards-based XML Schema Implementation Comparison Framework

Henry S. Thompson [HCRC Language Technology Group; World Wide Web Consortium]
Richard Tobin [HCRC Language Technology Group]