Comparing Markup Languages

Jeni Tennison
jeni@jenitennison.com

Abstract

Our industry needs to be able to compare markup languages to help to create mappings and transformations between such languages; to create pipelining applications that can automatically put together appropriate transformations from one language to another; and to design better markup languages. A prototype Java application (the Comparer) evaluates two document instances that contain the same information and provides both a node-by-node comparison and an "overall similarity metric" covering both syntactic and semantic similarity.

Keywords: Transforming; Mapping; Markup Languages; Semantics

Jeni Tennison

Jeni is an independent consultant and author on XML, XSLT and related technologies. She has a background in knowledge engineering: her PhD was on developing ontologies collaboratively over the web. Her interest in representing information led to XML, and the requirement to support different views of information to XSLT. She seems to spend most of her time answering people's email about XSLT and XML Schemas. She lives in Nottingham, England, with one man, two cats, three games consoles, four computers, and lots of Lego.

Comparing Markup Languages

Jeni Tennison [Jeni Tennison Consulting Ltd]

Extreme Markup Languages 2002® (Montréal, Québec)

Copyright © 2002 Jeni Tennison. Reproduced with permission.

Introduction

XML [Extensible Markup Language] is named after its extensibility, which gives us the ability to create markup languages tailored to suit our data and processing requirements. And, indeed, our ability to create our own markup languages has led to us using XML for representing a huge variety of information in many different ways.

XML's power comes from the fact that gives us a common syntax. This fixed syntax enables us to use APIs [Application Programming Interfaces] such as SAX [Simple API for XML] and DOM [Document Object Model] and generic tools such as XSLT [Extensible Stylesheet Language: Transformations] processors and schema validators. It also enables us to focus on the semantics of our messages and data, such that our effort can go into the hard problems such as what information we need to exchange and how it is best represented.

However, when two applications communicate it is not enough that they share the same syntax; for meaningful communication, they must share at least a portion of the same semantics. For example, I can identify words and sentences in a Finnish paragraph, but I have no idea what they mean. Without a shared semantics, communication cannot be effective.

The downside of the freedom to create our own markup languages is that my markup language might have very different semantics from your markup language, even if we're actually talking about the same thing. When we exchange XML messages, as humans we might have some hope of understanding each other's messages, as long as we share a natural language and we haven't used too many abbreviations. Computers, which don't have much in the way of common sense, let alone understanding of natural language, have a much more difficult time. To an application written for my markup language, messages written in your markup language are meaningless.

The most straight-forward, and currently dominant, approach to this problem is to say that we shouldn't be using different markup languages if we're talking about the same thing. The aim of this approach is to centralise, standardise and homogenise our use of markup languages. For example, a centralised solution to the problem of mapping between multiple markup languages would be to create a single representation that covers the varying conceptual models of these markup languages, and translate between them via this central hub.

I'd argue that a centralised approach is not a realistic approach to take with XML. For one thing, if different groups have sufficiently different conceptualisations of a domain that they create their own markup languages, it's unlikely that creating a central, consensual, representation is going to be possible. In addition, diversity is a pre-requisite of evolution and progression - the most useful markup languages will emerge in areas where there is variety and competition.

In this paper, I'm going to explore some of the implications of working with diverse markup languages and introduce a technique for comparing markup languages. In the next section, I'm going to highlight some of the tasks that are important when working with diverse markup languages and show how automated comparisons could help with these tasks. I will then outline a prototype tool for comparing instance documents, discuss the algorithm that it uses for comparing nodes, and give a couple of illustrations of the tool in use. Finally, I will discuss the next step in developing the tool and using it within other applications.

Motivation

Let's assume that we accept that different groups of users will use different markup languages to represent information on the same topic area. In this environment of multiple, decentralised, markup languages, there are three main tasks that will need to be addressed:

  • creating mappings/transformations between markup languages
  • creating pipelines that link together existing mappings/transformations
  • developing and changing markup languages over time

In this section, I'm going to look at what each of these tasks entails, and look at how the ability to automatically compare markup languages might help with each of them.

Mapping and Transformation

If information is stored in one markup language, and you need it in another markup language, you need to have some way of mapping between those markup languages - of transforming from one to another.

There are multiple mapping and transformation technologies available for XML, in particular XSLT [XSLT] and XQuery [XQuery], but also several other languages such as XSLScript [XSLScript], Templatotron [Templatotron] and AF:NG [Architectural Forms: A New Generation] [AF:NG]. A large number of transformations are also done through XML manipulation through DOM, SAX, or even through basic string parsing.

However, it's not enough to have the technology; developers also need to be able to use the technology - to create the instructions that drive the transformation engines. Here, tool support is growing, particularly in the form of mapping tools, which now exist in abundance. The utility of mapping tools is to provide a simple, quick way to enable developers to write transformations, such that the code for the straight-forward, one-to-one mappings is written automatically. This frees the developer to concentrate on the harder aspects of the transformation, which often necessitate an understanding of those semantics of a markup language that arise by convention (rather than being represented in XML structures) and might well only be articulated in natural-language documentation.

Typically the developer is presented with two trees, generally created from DTDs [Document Type Definitions], schemas or sample instance documents, one for the source markup language and one for the result. The developer's job is to draw lines between these two trees to indicate how the elements, attributes and values from the source map on to the elements, attributes and values in the result.

When a developer uses a mapping tool they are using their knowledge and intelligence to compare markup languages. Drawing lines between nodes isn't difficult; drawing the right lines can be. There is scope here for some automation, for the developer to be provided with prompts about what mappings might be useful between the various nodes. Even a little bit of automation of the simpler mappings would allow developers to concentrate on the harder mappings where human intelligence and common sense is absolutely necessary.

To support such automation, it's necessary to define an algorithm that compares markup languages on a node-by-node basis. Such a comparison must be able to tell whether the element called <name> in the source is most similar to the element called <author> in the result, or whether an attribute called value in the source is probably equivalent to the textual content of a <price> element in the result. Even a rough metric would be useful if it assists the user in narrowing down the possible mappings to choose from.

Pipelines

In a world where multiple markup languages for the same domain coexist, it's rarely practical to develop a dedicated transformation between all pairs of markup languages. Instead, transformations can be constructed on demand by pipelining several existing transformations together. In hub-and-spoke arrangements, every transformation is achievable through two transformations - from markup language A to the central markup language to markup language B. More general networks, where there is no single central language to go through, require more complex pipelines that may involve several transformations.

The idea of constructing pipelines has been around for a little while, but has been gaining more ground recently, with a good implementation in Cocoon [Cocoon] and the XML Pipeline W3C [World Wide Web Consortium] Note [XML-pipeline].

The pipelines in both Cocoon and the XML Pipeline Definition Markup Language are non-ambiguous by design. A pipeline document states that in order to get from markup language A to markup language B, you must go through markup language C. It is illegal to state that it is also possible to go through markup language D.

The killer pipeline application will be one that can deal with ambiguity and can identify the shortest route between two markup languages. When given the possibilities of transforming between markup language A and markup language B, given that there are mappings from A to both C and D and from both C and D to markup language B, such an application should be able to tell whether to go through markup language C or markup language D.

To work out which two transformations are going to be the most effective, such an application must be able to work out the "distance" between the various markup languages; a metric that it can use to compare different possible pipelines in order to choose which to use. The "distance" could be measured in terms of the speed of such a transformation, but it is easy to transform information to gibberish very quickly, so that is unlikely to be an effective measure.

One measure that could be useful is the amount of information that is lost through the transformation. The less loss there is in a pipeline, the more effective the pipeline.

Information can be lost in two ways during a transformation. First, the content itself can simply be discarded; an application might ignore the metadata associated with a book, for example. Second, the information can be retained, but in an arrangement such that it is difficult to extract that information into its original structure again. For example, the transformation from: <name><forename>Jeni</forename><surname>Tennison</surname></name> to: <author>Tennison, Jeni</author> is lossy, not because content is discarded but because the semantic labels for the parts of the content are lost.

Looking from the other direction, we can think about the amount of information that is gained through the transformation. When we consider the reverse transformation, from the content of the <author> element to the content of the <name> element, we can see that information about the meaning of the words is added en route. However, this transformation has its own downside, in that its complexity is likely to be both computationally expensive and prone to bugs.

A good pipeline, then, is one that minimises the amount of change that a document has to go through to get from one markup language to another. Losing information is bad, but even transformations that gain information are to be avoided if that information is unnecessary. There is no point in transforming the <author> element into a <name> element if the structure of the name in the final document is just the same as it was in the initial one.

Thus a metric that summarised the overall similarity between two markup languages gives a good indication of the "distance" in terms of transformation between the two markup languages, and could assist a pipeline engine in choosing the best series of transformations.

Markup Language Development

Part of recognising that different markup languages can be used to represent the same information is recognising that those markup languages can change over time, by merging with each other, by breaking down into separate modules, or simply by evolving.

If two or more markup languages are very similar, there's a good argument for merging them - the higher the adoption of a particular markup language, the better for all concerned. The first challenge is to recognise where markup languages are similar to each other. Markup languages that appear very similar in terms of their vocabulary might actually be built around fundamentally different conceptual models and thus be very hard to merge; similarly, markup languages that use completely different terminology for their elements and attributes could rest on the same conceptual foundations and therefore be easy to bring together. Here, as in automating pipelines, a metric that summarises the similarity between markup languages could be used to identify the likelihood of a successful merger.

When a decision has been made to bring two markup languages together, the focus has to shift onto consensus building. Again, comparisons between elements and attributes in the two markup languages can help to identify areas where two parties need to concentrate their efforts and those where the decisions are easy to make. This same process occurs during the development of a markup language, where different developers have different ideas about how it should be constructed - to achieve consensus, it helps to know where you differ.

A third type of evolution that markup languages go through, demonstrated with XHTML [Extensible Hypertext Markup Language] [XHTML], is modularisation. Modularisation is useful where the same markup language is used in different ways by different groups of users, since it allows them to pick out the parts of the markup language that they find useful, while ignoring the rest. Comparing documents that use the "same" markup language could help here, to identify communities of practice through statistical analysis such as cluster analysis.

Application

For each of the tasks discussed above, it appears that a semi-automated method of comparing markup languages could help:

  • to identify mappings between nodes in different markup languages, assisting in the development of transformations and to highlight areas of disparity when combining, developing or modularising markup languages
  • to provide an overall metric for the similarity between two markup languages, enabling pipeline applications to choose the best series of transformations for a particular goal, and to give an assessment of the utility of merging markup languages

A related challenge is present in knowledge engineering, where those developing and using ontologies need to compare conceptualisations, merge ontologies, and track changes during ontology development. Ontologies are a lot like markup languages: they demonstrate one, supposedly consensual, way of structuring the information that's interesting about a particular domain. Concepts in ontologies are like element declarations in schemas; properties in ontologies like attribute declarations in schemas; relations in ontologies like element content in schemas.

When comparing ontologies (for example [Maedche & Staab, 2001]; [Noy & Musen, 2000]), knowledge engineers look at two kinds of similarity: syntactic similarity and semantic similarity. Syntactic similarity is based on the names of the concepts, while semantic similarity is based on the overlap between their slots and relations. I will follow this method for comparing markup languages - syntactic similarity based on the names of elements or attributes, and semantic similarity based on the types of the content and values of those elements or attributes - while recognising that it does not capture all the subtleties of semantic equivalence.

Dividing similarity into syntactic and semantic similarity means that the result of a comparison between any two nodes can fall into one of four categories [Shaw & Gaines, 1989], as shown in the following table:

Table 1
  same syntax different syntax
same semantics consensus correspondence
different semantics conflict contrast

For example, two elements correspond to each other if they hold the same information but have different names. Two elements conflict with each other if they are named the same thing, but hold different information.

In terms of mappings, consensus or correspondence between two nodes indicates that there is a mapping between them, while conflicting nodes might lead to a false mapping being made. While developing markup languages, correspondence indicates places where terminological decisions have to be made, while conflict requires detailed discussion. Areas of contrast highlight places where a markup language should leave room for user extensions to fulfil the requirements of different communities.

Implementation

To try out the idea of comparing nodes from two documents to create a metric, I've created a small prototype application in Java. The Java application consists of three classes:
Comparison

represents a comparison between two nodes

ComparisonConfiguration

holds configuration options, in particular weightings used when calculating similarity metrics

Comparer

an application that takes two instance documents and produces an XML document that summarises the comparisons made between them

The Comparison class provides an interface for interactive comparisons (as would be needed in a mapping tool), but the current prototype simply performs standalone comparisons through the Comparer application. The resulting XML document is presented in an interactive way, however, through an XSLT stylesheet and client-side scripting in Internet Explorer. Figure 1 shows an overview of the result of a comparison between two very simple documents using a stylesheet designed to display comparisons on a node-by-node basis:

Figure 1: Simple comparison
[Link to open this graphic in a separate page]

In Figure 1, I'm comparing two very simple XML documents, test1.xml and test2.xml. These two documents have an overall similarity rating of 40% - this is the similarity between the document elements, <name> and <author>. In Figure 1, I'm looking at the comparisons for the <forename> element, highlighted in grey on the left. There are two relevant comparisons - one with the first attribute and one with the last attribute of the <author> element, listed in the middle, with those nodes highlighted (very faintly in the case of the last attribute) in the code on the right.

Figure 2 focuses on the details of the similarity, shown in the middle of the table, for clarity. Here, I'm looking at the details about the comparison with the first attribute. This comparison has an overall similarity of 40%, based on a syntactic similarity of 5% and a semantic similarity of 55%. This similarity rating is broken down some more; I'll go into that in more detail in the rest of this section.

Figure 2: Simple comparison details
[Link to open this graphic in a separate page]

This interface assigns different colours to different comparisons, based on whether they demonstrate consensus (green), correspondence (yellow) or conflict (red). The intensity of the colour indicates the strength of the similarity; for example, a bright yellow indicates that there is a strong correspondence between two nodes.

In doing the comparisons, I've only considered three types of nodes: elements, attributes and text nodes; comments and processing instructions aren't included in the comparison. The algorithm that I've used has come about from a combination of logic and testing. There are several areas where it can be improved, which I'll discuss later in the paper.

An assessment of the overall similarity between two nodes is based on a combination of their syntactic and semantic similarity. Between two text nodes, I've calculated the overall similarity as exactly the same as the semantic similarity. For comparisons involving elements and attributes, I've calculated the overall similarity between two nodes as a combination of their syntactic and semantic similarity, weighted in favour of semantic similarity (in a 3:7 ratio - this ratio can be adjusted through the ComparisonConfiguration interface).

Measuring Syntactic Similarity

The first step in comparing two nodes, then, is to work out their syntactic similarity. The syntactic similarity of a text node and either an element or attribute is always 0; the syntactic similarity of two elements, two attributes or an element and attribute is based on their names.

Names of elements and attributes are qualified names, of course, so comparing names has to happen at two levels: comparing local names and comparing namespace URIs. The similarity between namespace URIs only makes a difference if the local names are the same to start with: just because two elements or attributes have the same namespace URI doesn't mean that they are syntactically similar, although similar local names are assessed as more similar if the namespace URIs are the same.

To assess the similarity of the local parts of the names, I use an algorithm based on the syntactic similarity measure for two strings developed in [Maedche & Staab, 2001]. Their syntactic similarity measure is calculated from the edit distance between two strings [Levenshtein, 1966] - the number of edits that need to be made to change one string to another. They calculate the syntactic similarity measure as the following, pinned to a minimum of 0:

(length of shortest string - edit distance between strings) / 
length of shortest string

This measurement deals well with the similarity between names such as Purchase-Order and PurchaseOrder, which would be assigned a similarity of 12/13, or around 90%. However, in tests I found that this measure led to a high number of false positives, especially with short names. For example, based on this measure, the similarity between name and type is 25%. Conversely, it deals badly with shortenings of names - the similarity between p and para is 0. It also gives no "special dispensation" to case-sensitive comparisons: strictly speaking, NAME and name also have a similarity of 0.

Since short names, abbreviations, and case differences are common in XML naming (as opposed to in the naming of concepts in ontologies, where all three are rare), I wanted to devise a measure that would deal with these similarities to give results that are more intuitive.

The biases in the syntactic similarity measure as proposed by [Maedche & Staab, 2001] appear to be the result of two factors:

  • how the edit distance is calculated
  • how the edit distance is combined with only the string length of the shortest string

Normally, edit distance is based on the number of changes from one string to another, where each addition, deletion or replacement counts as a single change. To shift the balance, I calculate the edit distance by counting addition, deletion and case changes as single changes, but if one letter is replaced by another (aside from its lowercase or uppercase equivalent), this counts as two changes - an addition and a deletion.

Consequently, the maximum edit distance between two strings is the sum of the lengths of the two strings, so to create a number between 0 and 1, the difference between the sum of the length of the strings and the edit distance can be divided by the sum of the lengths of the two strings. However, again I found that this led to a high proportion of what I considered to be false negatives - squaring the result makes low scores proportionally lower and reduces this number.

The syntactic similarity measure that I use is then:

(sum of string lengths - edit distance between strings)² / 
(sum of string lengths)²

I find that this gives similarity measures that are more intuitively correct. The similarity between name and type is about 5%; the similarity between p and para is around 15%; and the similarity between NAME and name is 25%. Very similar names still score very highly: the similarity between PurchaseOrder and Purchase-Order is 90%, and the similarity between author and Author is 85%.

Semantic Similarity

Creating a measurement for the semantic similarity between two nodes is a much more challenging task, at least for some combinations of node types. These comparisons can be divided into three classes: those that involve comparing simple values (two text nodes, two attribute nodes or an attribute and a text node), those that involve comparing a simple value and a complex value (an element and an attribute or text node), and those that involve comparing complex values (two elements).

In this section, I'll go through each possible kind of comparison in turn, explaining how I've calculated it.

Similarity Between Simple Values

The easiest semantic comparison is between two text nodes, two attribute nodes, or a text node and an attribute. The only semantic feature of these nodes is their string value; I've therefore calculated these semantic similarities as being the syntactic similarity between their two string values.

For the purposes of comparison of two strings in this way, I use the same syntactic similarity measure as that for comparing local names as discussed previously. This time, as well as the special treatment of replacements that involve changing the case of a character, changes that involve replacing whitespace characters (spaces, tabs, newlines or carriage returns) with each other only count as one, rather than two, changes when calculating the edit distance.

Similarity Between Simple and Complex Values

The similarity between a simple value and a complex value is more complicated.

I calculate it using the following algorithm:

  1. compare the string value of the element with the simple value, using the usual method of comparing strings (as for two simple values)
  2. create a list of all the subnodes of the element, where the subnodes are attributes and child elements or text nodes
  3. perform comparisons between the simple value and each of these subnodes
  4. identify the best comparison between the simple value and a subnode
  5. divide the highest of the result of the string value comparison and the best subnode comparison by the number of subnodes

At a basic level, the more structure an element holds, the less similarity it has with some text or an attribute. But this algorithm recognises the fact that elements can have similar semantics to individual text nodes or attributes, but have internal markup.

If we look at some examples, and their calculated semantic similarity, these features become clear. Imagine comparing a text node or attribute with the value "Jeni Tennison" with the following:

  • <name>Jeni Tennison</name> - 100% (both the comparison between the string values and the comparison between the child text node and the simple value score 100%, and there is only one subnode)
  • <name value="Jeni Tennison" /> - 70% (the overall similarity between the value attribute and the text node is 70% due to their syntactic dissimilarity)
  • <name><forename>Jeni</forename><surname>Tennison</surname></name> - 45% (the string value of the element and the text node are compared)
  • <person name="Jeni Tennison" email="jeni@jenitennison.com" /> - 35% (the additional attribute halves the similarity)
  • <person name="Jeni Tennison">jeni@jenitennison.com</person> - 35%
  • <name forename="Jeni" surname="Tennison" /> - 20% (the surname attribute is most similar to the string)

Similarity Between Complex Values

Calculating the similarity between complex values is another level of complexity again.

The algorithm that I use compares each subnode (attribute or child) of one element with each subnode of the other element. From these comparisons, the best match for each of the subnodes is found, if there is one, and the semantic similarity of the two elements is the average of these best matches.

If there are no good matches between the subnodes (a good match is currently one that provides a greater than 30% similarity - this cut-off point can be altered through the ComparisonConfiguration), then the semantic match between the elements is calculated based on the match between their string values, provided they have a string value. If the two elements do not have string values, then their semantic similarity is 0 (otherwise all empty elements end up being very similar to each other!).

In addition, if the subnodes don't match, the Comparer application continues to make comparisons down the hierarchy - each element is compared with the subnodes of the other element. These comparisons don't contribute towards the semantic similarity of the comparison between the two elements themselves, but it means that the tool can detect and report similarities even if the similar elements occur at different levels in the subtree.

Demonstration

In this section, I'll give a couple of examples of the Comparer application that indicate some of its features and limitations.

Comparing Similar Documents

This first example shows two documents that are basically the same, except for a few syntax changes. This example demonstrates how a mapping tool could identify mappings between two instance documents, and suggest mappings to the user as they build up a transformation.

The two documents both represent a series of transactions on a bank account, including information about the date of the transaction, the type of the transaction, the amount and the payee. In statement1.xml, the date, type and payee of the transaction are held in attributes and the value as the content of a <Trans> element, as follows:

<Trans>
   <Trans date="2001-03-01" type="DD" payee="TV Licence">8.91</Trans>
   <Trans date="2001-03-01" type="DD" payee="British Gas">22.00</Trans>
   <Trans date="2001-03-09" type="SO" payee="Rent">633.00</Trans>
   <Trans date="2001-03-10" type="WD" payee="Cash">50.00</Trans>
   <Trans date="2001-03-15" type="CR" payee="Interest">0.58</Trans>
   <Trans date="2001-03-03" type="CR" payee="Client">400.00</Trans>
</Trans>
In statement2.xml, on the other hand, all four pieces of information about the transaction are held in subelements of a <transaction> element, as follows:
<transactions>
   <transaction>
      <date>2001-03-09</date>
      <type>SO</type>
      <payee>Rent</payee>
      <value>633.00</value>
   </transaction>
   <transaction>
      <date>2001-03-03</date>
      <type>CR</type>
      <payee>Client</payee>
      <value>400.00</value>
   </transaction>
   <transaction>
      <date>2001-03-10</date>
      <type>WD</type>
      <payee>Cash</payee>
      <value>50.00</value>
   </transaction>
   <transaction>
      <date>2001-03-01</date>
      <type>DD</type>
      <payee>British Gas</payee>
      <value>22.00</value>
   </transaction>
   <transaction>
      <date>2001-03-01</date>
      <type>DD</type>
      <payee>TV Licence</payee>
      <value>8.91</value>
   </transaction>
   <transaction>
      <date>2001-03-15</date>
      <type>CR</type>
      <payee>Interest</payee>
      <value>0.58</value>
   </transaction>
</transactions>

The overall similarity between the two files is calculated as about 55%. Viewing the similarities for the third <Trans> element shows some details of the comparison:

Figure 3: Finding mappings for a Trans element
[Link to open this graphic in a separate page]

The third <Trans> element is similar to all six of the <transaction> elements, to varying degrees. The closest similarity is with the first of the <transaction> elements, with which there's a high semantic similarity (85%) and a lowish syntactic similarity (30%), indicating correspondence. The lowish syntactic similarity arises because the names of the two elements aren't the same and string comparisons give no special weighting to abbreviations. The high semantic similarity arises because the information that the two elements encode is the same - it's the same transaction, albeit represented differently. To a user, the fact that the third <Trans> element corresponds to the first <transaction> element indicates that there is some reordering going on between the two documents, as well as renaming and switching from elements to attributes.

Looking at the date attribute of the same <Trans> element shows the mappings between nodes at a lower level:

Figure 4: Finding mappings for a date attribute
[Link to open this graphic in a separate page]

The date attribute of the third <Trans> element is most similar to the <date> element of the first <transaction> element, as expected. There are multiple other comparisons, with other <date> elements, their text node children, and other elements that have similar names (such as type and value), though most of them are very weak. Note that these comparisons are carried out without knowledge about the context in which the <date> element occurs, only on its value.

Comparing Dissimilar Documents

Most mappings are of the type in the previous demonstration, where the differences between two documents are primarily syntactic - moving information between elements and attributes, or reordering it, for example. However, it's interesting to see whether such a tool can also isolate difficulties that arise between two markup languages that demonstrate a different conceptual view.

This example gives a very simple example of this by comparing an XML Schema [XML Schema] schema and a RELAX NG [Regular Expression Language for XML: Next Generation] [RELAX NG] schema for the same markup language. XML Schema and RELAX NG are very different schema languages with the same purpose (validating and documenting other markup languages), so it's interesting to see where they differ and where there are similarities between them.

The XML Schema document is as follows:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

<xs:element name="name">
  <xs:complexType>
    <xs:simpleContent>
      <xs:extension base="xs:string">
        <xs:attributeGroup ref="contact" />
      </xs:extension>
    </xs:simpleContent>
  </xs:complexType>
</xs:element>

<xs:attributeGroup name="contact">
  <xs:attribute name="email" type="xs:string" use="required" />
  <xs:attribute name="web" type="xs:anyURI" use="required" />
</xs:attributeGroup>

</xs:schema>
While the RELAX NG schema for the same markup language is as follows:
<grammar xmlns="http://relaxng.org/ns/structure/1.0"
         datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">

<start>
   <element name="name">
      <ref name="contact" />
      <data type="string" />
   </element>
</start>

<define name="contact">
  <attribute name="email">
    <data type="string" />
  </attribute>
  <attribute name="web">
    <data type="anyURI" />
  </attribute>
</define>

</grammar>

The schemas have been purposefully created so that they are similar in approach, and are very simple, but the overall similarity between them is still low - around 15%.

Looking more closely, the comparison tool manages to identify similarities between, for example, the <xs:attributeGroup> reference and the <ref> element:

Figure 5: Finding mappings for the xs:attributeGroup element
[Link to open this graphic in a separate page]

However, it is unable to find anything in the RELAX NG schema that corresponds particularly well to the use attribute of the <xs:attribute> element in the XML Schema schema:

Figure 6: Finding mappings for the use attribute
[Link to open this graphic in a separate page]

And, while it finds several possibilities, there's no strong match in the XML Schema schema for the <start> element in the RELAX NG schema:

Figure 7: Finding mappings for the start element
[Link to open this graphic in a separate page]

Again, this doesn't tell us anything we don't already know - RELAX NG doesn't use attributes to indicate occurrence, and XML Schema doesn't allow you to state which element should be the document element - however it's good to see that the comparison tool can identify these differences between the markup languages.

Conclusion

This paper has described three areas where automated comparisons between markup languages could be useful:

  • assisting developers in creating mappings or transformations between two markup languages
  • providing a metric that pipeline processors could use to decide which series of transformations to use
  • identifying similarities and differences between markup languages to help their designers review alternatives

I discussed these three areas to motivate the development of a tool for comparing markup languages. There are also a couple of other areas which do not provide much motivation in and of themselves, but which are potentially interesting uses of the tool.

First, a metric of similarity between two markup languages could provide a mechanism for assessing the quality of a transformation between those two markup languages. Say that you had a metric that gave the similarity between two markup languages as 80%. If you wrote a transformation between those markup languages, and measured the similarity between the source and the result of that transformation, you should hit the 80% mark; if you don't, then there's likely to be something wrong with your transformation script.

Second, node-by-node comparisons can be used to indicate differences between versions of the same instance document. The comparisons that I've illustrated above have been between two instance documents that represent the same information in different markup languages. It's possible to use the same tool to compare two instance documents that hold different information in the same markup language; the tool can indicate where the differences between them lie, in a much fuzzier manner than existing differencing tools such as DeltaXML [DeltaXML].

In this paper, I've also put forward a simple algorithm that can compare two instance documents to provide node-by-node comparisons and an overall similarity metric. There are a number of additions to this algorithm that would make it more accurate:

  • better recognition of abbreviations and alternative naming conventions
  • dictionaries for comparing markup languages in different natural languages
  • using ordering information when calculating semantic similarity, such that the fact that the <Trans> and <transaction> elements in the earlier example were in different orders subtracts from the semantic similarity of the wrapper elements
  • recognising cases where a type/role/class attribute maps to the name of an element
  • making use of ID and IDREF attributes, or identity constraints in schemas, to recognise the semantic similarity between nesting and referencing information
It would also be interesting to build these comparisons into an interactive mapping tool, in which comparisons were incremental, such that information about "approved" mappings increases the similarity between ancestors and descendants of the mapped nodes.

The current Comparer application uses instance documents that contain the same information as the source of the comparisons. Another approach could study the DTDs or schemas for the markup languages to identify similarities.

Instance documents are the only method of identifying mappings accurately - DTDs and schemas provide a lot of information about the syntax of a markup language, so that they can check that syntax, but they do not convey much about its semantics. For example, as far as a schema is concerned one decimal number with two decimal places is the same as another decimal number with two decimal places, even if those numbers represent entirely different concepts (one a height and one a price, for example). In addition, in many markup languages it's possible to use different representations for the same set of information - schemas don't provide everything that there is to know about how a markup language is used.

On the other hand, using instance documents as the basis of a comparison means that the accuracy of the comparison is highly reliant on the instance documents that are used. For example, the XML Schema and RELAX NG schemas that I compared earlier in the paper were purposefully designed to be fairly similar in structure, but it would be very possible to design them differently internally (one as a deep nested structure and the other as a flat referencing structure), which would have a big impact on the overall assessment of their similarity. To make any general statements about the similarities between two markup languages as a whole, you would have to run multiple comparisons between pairs of instance documents, and choosing or constructing those instance documents could be as much, or more, work than comparing the markup languages by eye. Using information from schemas for the two markup languages could provide a fairer comparison, covering the entirety of the markup languages; these comparisons could be augmented, perhaps, with a smaller number of instance document pairs than would otherwise be required.

To conclude, my aim with this paper is to draw attention to the fact that we need to work with, rather than against, a diverse set of markup languages. We need tools and techniques that tackle the hard problems that arise when information is represented in a different form from the one that's required, and these tools and techniques shouldn't rely on an all-embracing consensual conceptual model being available. A metric for similarity between markup languages (or parts of markup languages) might help, and this paper has demonstrated that it's possible to create a simple but fairly effective algorithm for such a metric. Building similar algorithms into existing tools, and developing applications built around the availability of such metrics, is the next step.


Bibliography

[AF:NG] Cowan, J. (2002) Architectural Forms: A New Generation URL: http://home.ccil.org/~cowan/XML/afng.html

[Cocoon] Apache Project (2002) Apache Cocoon URL: http://xml.apache.org/cocoon/

[DeltaXML] Monsell EDM Ltd (2001) DeltaXML - Managing Change in an XML Environment URL: http://www.deltaxml.com/

[Levenshtein, 1966] Levenshtein, I.V. (1966) Binary Codes capable of correcting deletions, insertions, and reversals. Cybernetics and Control Theory, 10(8), 707-710

[Maedche & Staab, 2001] Maedche, A. & Staab, S. (2001) Similarity Measures and a Comparison Study AIFB Internal Report no.408, Karlsruhe. URL: http://ontobroker.semanticweb.org/ontos/report-aifb-408.pdf

[Noy & Musen, 2000] Noy, N.F. & Musen, M.A. (2000) PROMPT: algorithm and tool for automated ontology merging and alignment. In Proceedings of the Seventh National Conference on Artificial Intelligence (AAAI-2000), Austin, TX, 450-455. URL: http://www-smi.stanford.edu/pubs/SMI_Report/SMI-2000-0831.pdf

[RELAX NG] Clark, J. & Murata, M. (2001) RELAX NG Specification, Oasis Committee Specification. URL: http://www.oasis-open.org/committees/relax-ng/spec-20011203.html

[Shaw & Gaines, 1989] Shaw, M.L.G. & Gaines, B.R. (1989) Comparing conceptual structures: consensus, conflict, correspondence and contrast. Knowledge Acquisition, 1, 341-363. URL: http://ksi.cpsc.ucalgary.ca/articles/KBS/COCO/

[Templatotron] Nic, M. (2002) Templatotron URL: http://www.zvon.org/ZvonSW/Templatotron/Output/index.html

[XHTML] Altheim, M., Boumphrey, F., Dooley, S., McCarron, S., Schnitzenbaumer, S. & Wugofski, T (2001) Modularization of XHTML, W3C Recommendation. URL: http://www.w3.org/TR/xhtml-modularization/

[XML Schema] Thompson, H.S., Beech, D., Maloney, M. & Mendelsohn, N. (2001) XML Schema Part 1: Structures, W3C Recommendation. URL: http://www.w3.org/TR/xmlschema-1/

[XML-pipeline] Walsh, N. & Maler, E. (2002) XML Pipeline Definition Language, W3C Note. URL: http://www.w3.org/TR/XML-pipeline/

[XQuery] Boag, S., Chamberlin, D., Fernandez, M.F., Florescu, D., Robie, J., Siméon, J. & Stefanescu, M. (2001) XQuery 1.0: An XML Query Language, W3C Working Draft. URL: http://www.w3.org/TR/xquery

[XSLScript] Tchistopolskii, P. (2001) XSLScript URL: http://www.pault.com/pault/prod/XSLScript

[XSLT] Clark, J. (1999) XSL Transformations (XSLT), W3C Recommendation. URL: http://www.w3.org/TR/xslt



Comparing Markup Languages

Jeni Tennison [Jeni Tennison Consulting Ltd]
jeni@jenitennison.com