This paper describes a standards-based approach to the task of comparing two implements of the W3C XML Schema language: no special-purpose software is required, nor any common API for accessing the results of schema validity assessment. A combination of XSLT transformations and appeal to an XML manifestation of XML Infosets is all that is required.
Although it is currently deemed unlikely that a reference implementation of the W3C XML Schema language will be designated, none-the-less comparing the output of one implementation with another is an obviously useful tool for anyone developing an implementation, or choosing one.
This paper describes a standards-based approach to this task: no special-purpose software is required, nor any common API for accessing the results of schema validity assessment. A combination of XSLT transformations and appeal to an XML manifestation of XML Infosets is all that is required.
Terminology Note: Throughout this paper, the word schema and the phrase XML Schema are used to refer to the W3C XML Schema language, as defined in the published Recommendations [structures] and [datatypes]
The only requirement on an XML Schema processor for its output to be comparable with another's by the method described here is that it be able to reflect the post schema-validation Infoset (PSVI, per [structures]) as an XML document, itself valid with respect to the published XML Schema for such representations of the PSVI [psvi schema].
The adjoining figure illustrates the basic shape of the architecture.
The comparison starts with two inputs, a source file and a source schema. Two different schema processors (or two different versions of the same processor) undertake schema-validity assessment of the source using the schema, and reflect the resulting PSVI into XML documents, respectively rPSVI1 and rPSVI2. The next figure shows a fragment of such a reflected PSVI.
<element> <namespaceName>http://foo</namespaceName> <localName>xx</localName> <children/> <attributes/> <namespaceAttributes/> <inScopeNamespaces> . . . </inScopeNamespaces> <baseURI>file:/d:/work/XMLinter/ischema/tiny.xml</baseURI> <psv:schemaInformation xsi:null='true'/> <psv:validationAttempted>partial</psv:validationAttempted> <psv:validationContext> <pointer ref='g1'/> </psv:validationContext> <psv:validity>valid</psv:validity> <psv:errorCode xsi:null='true'/> <psv:schemaNormalizedValue>0</psv:schemaNormalizedValue> <psv:typeDefinition> <pointer ref='f:type._anon_7'/> </psv:typeDefinition> <psv:memberTypeDefinition xsi:null='true'/> <psv:elementDeclaration> <pointer ref='f:elt.xx'/> </psv:elementDeclaration> <psv:null>false</psv:null> </element>
The next step (the blue number circle) is not strictly necessary: it is needed only to enable various approaches to displaying the differences detected later. It actually consists of two stages:
The next step in the overall comparison architecture is represented by the blue compare circle: as directed by a comparison control document (not shown in the picture) it compares the two reflected Infoset documents and outputs one of three possible styles of result:
Option (1) would actually be appropriate for a difference computation for any pair of XML documents, not just reflected PSVIs. Options (2) and (3), on the other hand, only make sense for reflected PSVIs.
In option (1), the XLinks use XPointers to the XML reflections of the PSVI, using the IDs added above. In options (2) and (3) the HTML is linked using normal HTML links, i.e. <a href='...'>, pointing to the anchors in the HTML version of the reflected PSVI, e.g. nrPSVI1. Here's a screenshot of a simple version of option (2):
One pair of difference pointers can be seen in the left-most pane: they both point to the element with ID e184.108.40.206 in the two input documents, and the element with this ID can be seen in the two right-hand panes, with, as indicated, different content. A great deal of work remains to be done in exploring the best way to display differences of this sort.
It is not immediately apparent that XSLT is suitable for comparing two documents: it is normally described as mapping from an input to an output. In the case at hand, the input document is actually a comparison configuration document, and the two documents to be compared are actually passed as invocation parameters.
The comparison algorithm used assumes a reasonable degree of similarity between the two inputs, such as might be expected if they were governed by the same schema or DTD. It is therefore not a general-purpose tree difference algorithm, and does not proceed by the standard approach of some form of Viterbi search through a space of costed insertion, deletion and movement operations. Rather it alternates between set alignment (between sets of attributes, and sets of like-named element children) and value comparison. In particular, it never explores the possibility that element or attribute name differences are not definitive, i.e. that the simplest expression of a difference might be that the labels on two sub-trees have been swapped. In fact order is never considerd at all: the trees are compared as if they were un-ordered.
Aside from the root template, all the templates in the comparison stylesheet are named, and comparison is implemented as a recursive algorithm over pairs of node sets, using XSLT as a functional programming language with a particular affinity for operations on XML document trees. The details which follow assume output option (1), as it is the most general.
|To compare two element node sets, A and B||
|To compare two element node sets, A and B, whose nodes all have the same name||
|To compare two individual nodes, X and Y||
|To compare two element node sets, A and B, using keys||
For each node in A, extract its key (see below), and compare it to the single node with the same key from B or, if there is no such node, or there are more than one, signal an unmatched key error with respect to A, A's parent and B's parent.
|To extract the key of an element node||
At several points the above process is controlled by the contents of the input comparison configuration file. This has five kinds of information:
The architecture as pictured is completely implemented, with the exception at the time of writing of the box labelled schemaDiff. This is awaiting work of wider significance on an alternative representation for XML Schema components: one which is more verbose and explicit than that in [structures], but more transparent and readable than that in [psvi schema]. It has not yet been widely deployed or tested, but we expect to release it soon as part of the XML Schema Test Collection initiative (see http://www.w3.org/2001/05/xmlschema-test-collection.html).
Although many decisions about this architecture were made with the specific task of comparing reflected PSV Infosets in view, it is potentially of more general utility. The limits of that utility could stand detailed exploration. Some limitations could easily be overcome (using order as a key, in cases where order actually does matter, for example), but others are intrinsic to the recursive descent approach adopted here, at least when coupled with the pecularities of the chosen implementation language (identifying transpositions, insertions or deletions within element sequences as such, for example). On the whole, it seems more likely that this architecture will be of use for data-oriented, as opposed to document oriented, document types.
The work reported here was supported by the World Wide Web Consortium and Microsoft Corporation, to whom thanks are owed but no responsibility for error should be attributed.
[psvi schema] A schema for serialized infosets, Richard Tobin and Henry S. Thompson. 2001. Available online as http://www.w3.org/2001/05/serialized-infoset-schema.html