Processing references to inaccessible documents: Constructing identifiers with Relax NG and XSLT

Matthijs Breebaart


When information products contain references (such as hyperlinks) to other information resources, sometimes the source document points to a fee-based repository although the same resource is available elsewhere free of charge. These fee-based references can be effectively replaced by references to the same materials in no-fee repositories. Our approach to this money-saving exercise raises issues of information identification and information equivalence. In our approach, we focus on the information transferred in an identifier. This information is captured in a set of property / value pairs. Domain-specific Relax NG schemata are used to capture these properties and values. XSLT stylesheets are used to transform the XML fragments into a more compact URI form.

Keywords: XSLT; RelaxNG

Matthijs Breebaart

Matthijs Breebaart is an information architect at the Dutch Tax and Customs Administration, Centre for Professional Development and Communication.

Processing references to inaccessible documents

Constructing identifiers with Relax NG and XSLT

Matthijs Breebaart [Dutch Tax and Customs Administration, Centre for Professional Development and Communication]

Extreme Markup Languages 2005® (Montréal, Québec)

Copyright © 2005 Matthijs Breebaart. Reproduced with permission.


An increasing amount of information is published for free on the internet. It is tempting for customers of commercial publishers to try to leverage this free information. However, there is no such thing as a free lunch. Knowledge workers continue to rely on the specialized bits of information that only commercial publishers supply. They also like an integral view on the information instead of multiple silos. If organizations want to combine publicly available information with commercial products, they have to do the integration themselves.

These days, mentioning the words "integration" and "documents" inevitably leads to XML. It is generally agreed upon that publishers should at least supply XML documents, and preferably custom schema support and XLink. Unfortunately, XML does not solve all integration worries. One area that needs attention is processing references from XML documents to other documents. The main issue here is not so much the use of XML but the way identifiers are used. XML cannot solve this problem, because identifiers work on a different level.

A hypothetical scenario illustrates the problem. Suppose that document aDoc was bought from a commercial publisher. This aDoc contains references to a cDoc by the same publisher. If bDoc were a public equivalent of cDoc, it would make sense for an organization to use bDoc and only acquire aDoc. In this case, the reference to cDoc must be replaced by a reference to bDoc. For this to happen, our organization needs to determine that bDoc and cDoc are equivalent. An important source of information for determining equivalence is the identifier.

This paper starts with an exploration of the issues of identifiers and equivalence. Next, a case study is described in which references to Dutch legislation from comments and other documents are processed. The paper finishes with a comparison of other approaches and a discussion of the techniques used.

From document URLs to medium-neutral identifiers

Integration is all about identity. Things that have the same identifier should be merged. Identity also plays a key role in referencing. A reference from aDoc to bDoc requires the use of some identifier for bDoc. With the advent of the web and XML, URLs became a popular mechanism for referencing. This is not surprising. However, they are not perfect. It is not always clear what a URL is identifying. Suppose that an author uses a URL to point to a related resource. In many cases, the author is using the document identified by the URL as a proxy for the information contained in the document. In other cases, the hyperlink is about the document itself.

The difference between "URL as proxy for an idea" and "URL as a document" might seem academic but has far reaching consequences1. In the first case, a URL of an equivalent document could replace the URL. In the second case, the document itself is required. A "URL as a proxy for idea" could be considered a medium- or language-independent version of a document. Of course, there is no such thing as a medium-neutral document: it is an abstraction.

The reason for introducing these kinds of abstractions is that they allow grouping and the subsequent attachment of references from other documents or concepts. This way, document management is simplified. Obviously, it is required that the concept representing the group is addressable. In other words, a concept needs an identifier.

Organizations tend to have their own definition of what constitutes a concept and how to identify them. Some organizations use naming schemes, where the part of the name before an extension represents the concept. Others put different renderings in the same folder. In these cases, the concept definition is fairly implicit. The alternative is to use a model to formally describe the abstractions that are possible in a domain. An example would be IFLA's "Functional requirements for bibliographic records" [IFLA 1998] with its Item, Manifestation, Expression, and Work entities. In comparison, content management tools generally offer limited support for specifying and sharing these kinds of models.

The lack of possibilities for sharing models is one of the reasons that organizations keep using document URLs for references to documents. This is generally not a problem within the walls of the own organization. However, in the transfer to other organizations, implicit semantics are lost. If the receiver has access to the documents identified by URLs, they can at least read them. If not, the receiver is out of luck.

Therefore, integration would benefit if document-URLs were replaced by more abstract identifiers representing the concepts the documents are implementing. What would such abstract identifiers look like?

Abstract identifiers

The idea of a concept representing a group of documents implies some procedure for determining group membership. This is a thorny issue. One approach is to define a set of property / value types, process a set of documents and determine the values of the properties for each document. Each resulting set of property / value pairs is a concept representing a document, and when two sets are identical the two documents are part of the same group2.

Thus, the easy answer to the question what an abstract identifier looks like is: "a set of property / value pairs". A more elaborate answer includes details on the implementation side of the problem.

The first thing that needs to be done, is the modeling of the set of property / value pairs. A model describes each property, the possible values, and the relationships between properties. Each concept is an instance of the model.

The next step is to determine a procedure for serialization of the concept. There are several options here. One way would be to use a system where identifiers are handles for retrieving descriptions about the thing being identified. In these kinds of systems, the information is split into an identifier part and a description part.

Many designers like their identifiers to be meaningless because it allows them to select an algorithm or procedure that outputs short and predictable identifiers. The former characteristic optimizes the bandwidth and storage necessary, the latter makes life for processing applications easier.

However, there is no rule specifying that identifiers have to be meaningless. For example, the 10-digit ISBN number consists of 4 distinctive parts3: country of origin or language code, publisher, item and checksum. Someone in the know would be able to deduce some properties of the book being identified by looking at the ISBN identifier itself.

The ISBN example shows that it is possible to shift information from the description part to the identifier part and vice-versa. The designers of the ISBN number could have used a meaningless identifier and put the information about publisher etc. in an accompanying description. They decided not to do this. Apparently, there is a continuum from meaningless, dumb identifiers with a rich description to meaningful, smart identifiers without a description.

So, identifiers carry information that is embedded in either the identifier itself, in the description that's returned when resolving the identifier, or in some combination of the two. Processing takes place at two levels. An identifier is matched as a literal value against some database. If this fails, a processor needs more information that might be available by resolving the identifier and processing the description. This secondary processing would benefit greatly if the description was structured instead of a freeform text.

The implication is that the real answer to our question what an abstract identifier looks like is "it depends on technical, organizational, and process requirements". Therefore, we need to take a closer look at the problem domain.

Case study: legal information in the Netherlands

For many organizations, access to legal information is important. In the Dutch Tax and Customs Administration (DTCA), legal information consists of laws at different levels of government, case law, explanations by professionals (comments), and training materials. These documents are highly interconnected.

Managing connections is difficult because of the peculiarities of Dutch legislation. Laws contain articles, optionally organized in chapters, parts, etc. Over time, the legislator modifies articles. Although knowledge workers are generally interested in the current variant of a law, they occasionally need a snapshot of the law as valid at a date years ago.

The legislator does not assign a digital, persistent, unique identifier to (a variant of) an article. After an initial full version of a law, only modifications of articles are published. Commercial publishers apply each modification and publish so called consolidated versions of the law.

DTCA buys additional materials like comments from multiple4 commercial publishers. Because there is no published identifier for (variants of) a law, each publisher uses the URLs of their own - internally consolidated - version.

This means that a comment from publisher A about article X includes a hyperlink to publisher A's implementation of X. Not surprisingly, publisher B would use a different URL for article X. The result is that DTCA is having difficulties providing its knowledge workers with an integral view5, and substantial replication of information exists on its intranet6.

It is expected that the integration problem will be solved at the source (by the legislator assigning identifiers). However, these kinds of processes take time. As an interim measure, DTCA participated in a series of meetings with publishers and other large customers to discuss the possibility of an open standard for referring to Dutch regulations. The aim was to standardize the way legal concepts like "Law income tax 2001, article 3.1" would be described by the publishers. If publishers would supply these kinds of standardized descriptions, their customers were able to do a much better job automatically processing the statements and integrating comments and other documents about legislation with a single source for the legal texts.

One might ask why any commercial publisher would want to do something like that. They would have to do more work and receive less money in return. Luckily, there is more to it. Regarding the "less money" part, the benefits of better and easier integration might tempt customers to pay more for the content they're really interested in7. Minimizing the "do more" part is also important. A solution should be easy to implement and easy to administer for the publishers. This is an important requirement considering the fact that many publishers have complex internal production systems that are not always easy to extend or modify.

In the previous section, it was stated that it is possible to distinguish between a model and the serialization. In this case, the model would have to capture the relevant concepts in the context of referring to Dutch legislation. For instance, the model has to define the notions of law and article because otherwise no description would be possible containing these two notions. The model is like a forecast of the diverse kinds of references to Dutch law. It is described next.

A model for references to Dutch legal information

The central entity in the model is a law, for instance "The law on Income Tax 2001". A law consists of a number of articles (optionally) organized by chapters, parts, or other containers. During its lifetime, modifications to articles are published. Each modification causes a new variant of the article. An article with a specific period of validity is called a consolidation.

It should be possible to distinguish between consolidations and the texts that caused them (publications by the legislator). Therefore, both are added to the model. Each entity carries a set of properties. Some properties (like date-enacted) are only relevant to consolidations, while others (like date-published) are relevant to publications.

There are some complicating factors. Generally, authors point to a single consolidation, like "article 1.1 of the law IB2001 as valid on Jan. 1 2003". However, they might also want to say something about "article 1.1 of the law IB2001 regardless validity interval". In this case, they are referring to a set of consolidations. Additionally, they might want to refer to something like "article 1.1 of the law IB2001 as valid on Jan. 1 2003 and later" or "... until a certain date".

A second complication is the possibility of retroactive application. Occasionally, the legislator decides to change a text and set the date-enacted to some date in the past; the second variant supersedes the first one. From the point of view of a reference, it might be necessary to refer to either variant8. The model should allow for the distinction between both variants.

The existence of variants and retroactive application illustrates the need for authors to be as precise as possible when referring to laws. Suppose that I've written a comment about some article. After a while, the legislator enacts a new variant. If I was only referring to the article in general, I would probably not mind a change. On the other hand, if I'm referring to information specific to the variant9, I want to be notified. With the proper references (set of consolidations versus a specific consolidation), a content management system could be programmed to send out notifications with a varying priority.

Another complicating factor is exception handling. Although legislators are supposed to follow a "meta" law for writing laws, a number of exceptions exist that need to be handled. For instance, almost every regulation assigns unique numbers to articles within the regulation10, but some laws renumber articles within each chapter. In the latter case, saying something like "law X article Y" is not going to be good enough. Considering the fact that many laws are active for decades, quite a few exceptions need to be handled.

From a model to a schema

The textual description in the previous section needs to be formalized. One way of doing this is to use an XML schema. There are several advantages to using an XML schema. Both W3C XML Schema and Relax NG have access to extensive datatype libraries, including regular expressions. These libraries offer powerful tools for describing the (kinds of) values that are valid. An additional benefit is the ability to annotate XML schemas with XHTML text or SVG images. This way, the textual description and the XML schema could be generated from a single source with different XSLT transformations. A third benefit is the built-in possibility for extension and specialization. Both Relax NG and W3C XML Schema offer opportunities for this kind of modification. A fourth benefit is the availability of tools.

In the project Relax NG was used as the schema language of choice. The main reasons for doing so were readability (with the compact syntax), support for W3C datatypes, support for specialization, availability of documentation, and the availability of James Clark's Trang for generation of XML Schema's. Trang was able to translate our RNC document to a W3C XML Schema file, which could be used by those preferring W3C XML Schema. We did have to refrain from Relax NG specific features like co-occurrence constraints.

The following Relax NG compact schema was written11:

grammar {
	start = element regulationIdentifier {
		## key distinction between consolidation and publication
		(consolidationIdentifier | publicationIdentifier),
		attribute schema-id {text}?

	consolidationIdentifier = element consolidationIdentifier {
		element regulationID {
			## optional description of the kind of value used for identifying a regulation
			attribute schema {text}?,
	       	## either a set of consolidations or a single consolidation
        	(cSet, consolidation)?

        cSet = element cSet {
        	element validity-interval {

        consolidation = element consolidation {
        	element validity-interval {
                ## necessary when dealing with retroactive variants
                element source_publication {

        publicationIdentifier = element publicationIdentifier {
                element number {string},
                element year {xsd:gYear},
                element description {text}?,

        ## formal publication designation: staatsblad or staatscourant
        publicationtype = element type {("stb" | "stcrt")}

        structureLocation= element structure {
	        ## recursion required for dealing with non-unique article numbers

        structureType = element sType {
                ("artikel" | "paragraaf" | "afdeling" | "hoofdstuk" | "regeling" | 
                "bijlage" | "enig-artikel" | "titeldeel" | "wijzig-artikel" | 
                "sub-paragraaf" | "boek" | "deel" | "tree" | "structuurtekst" | 

        label = element sLabel {text}
        date-enacted = element date-enacted {xsd:date}
        date-repealed = element date-repealed {("9999-12-31" | xsd:date)}

Some schema techniques are notable. Instead of choosing element types like article, paragraph, the sType element is constrained to use a value from an enumeration. This way, the schema is easily extensible.

From a schema to an instance

The nice thing about an XML schema is that it validates XML fragments. For example, the concept "consolidation article 3.1 in law IB2001 as valid at Jan. 1 2002" looks in XML form like:


The set of consolidations "article 3.1 in law IB2001 from jan. 1 2002" looks a bit different:


Each XML fragment provides sufficient information to point to a concept in Dutch legislation. It is not concerned in any way with the regulation text or its schema. It is up to the receiver to map the information to a (location in a) text document or some database pointer.

The regulationID element in the example contains an integer value representing a regulation. We could also have used a citation like "Wet inkomstenbelasting 2001" or "Wet IB 2001". Because of the fact that there are less than 10.000 regulations in the Netherlands and that there are not that many new laws each years, it makes sense to use a meaningless identifier for this part and use a description with the elements "citation_title" {text} and "short_title" {text}. The combination of number and description is easily made available in an XML document or a database.

The other elements follow a different approach. Because active laws are modified hundreds of times each year, it would be a procedural nightmare to assign a unique identifier to all possible locations. Therefore, a more descriptive approach was taken for the locations within each regulation. In this case, the properties in the model are used to specify a location.

From an XML fragment to an URI-like identifier

The XML fragment could be used directly in the documents that are exchanged. However, in many cases a less verbose version would be preferable. This leads to the question how to translate the XML fragment to a URI form.

This proved not that difficult. The hierarchy that characterizes an XML document is realized with a ":" separator. At every level, one or more property-value pairs are used. Property-value pairs are separated by a "&" character. It looks like:

identifier := property "=" value (&property "=" value){*}

An example would be something like:


It's a bit long. Luckily, compression is easy. For instance, we could throw away property names and promote some values to property names: "structure.sType=artikel&structure.sLabel=3.1" becomes "artikel=3.1". Combined with truncation, this would lead to:


Which begins to look like a useful identifier.

The transformation of the XML document to a identifier can be specified in many ways. One advantage of XML is that XSLT can be used. The following XSLT document specifies a transformation from the XML document to the identifier, and includes all rules for compression.

<xsl:stylesheet xmlns:xsl="" version="1.0">
  <xsl:output method="text"/>
  <xsl:template match="regulationIdentifier">
      <xsl:if test="publicationIdentifier">p</xsl:if>
      <xsl:apply-templates />

  <xsl:template match="publicationIdentifier">
      <xsl:if test="number">&nummer=<xsl:value-of select="nummer" /></xsl:if>
      <xsl:if test="year">&jaartal=<xsl:value-of select="jaartal" /></xsl:if>
      <xsl:if test="type">&soort=<xsl:value-of select="soort" /></xsl:if>
      <xsl:if test="description">&kenmerk=<xsl:value-of select="kenmerk" /></xsl:if>
      <xsl:if test="structure">&<xsl:apply-templates select="structure"/></xsl:if>
  <xsl:template match="consolidationIdentifier">
      <xsl:apply-templates />
  <xsl:template match="regelationID">
          <xsl:when test="following::cSet">v:</xsl:when>
          <xsl:when test="following::consolidation">c:</xsl:when>
      <xsl:value-of select="."/><xsl:apply-templates />

  <xsl:template match="consolidation|cSet">&<xsl:apply-templates />

  <xsl:template match="structure">
      <xsl:value-of select="sType" />=<xsl:value-of select="sLabel" />
      <xsl:if test="structure">:
          <xsl:apply-templates select="structure"/>

  <xsl:template match="validity-interval"><xsl:apply-templates /></xsl:template>

  <xsl:template match="date-repealed|date-enacted">
      <xsl:if test="string(.)">&<xsl:value-of 
                              select="substring(local-name(),6,1)" />=<xsl:value-of
                              select="." />
  <xsl:template match="source_publication"><xsl:apply-templates /></xsl:template>

  <xsl:template match="text()" />

Applying the stylesheet to the two sample XML documents leads to the following results:

1 c:76619&artikel=3.1&e=2002-01-01&r=9999-12-31
2. v:76619&artikel=3.1&e=2002-01-01

The stylesheet is straightforward. The one thing to take into account is whitespace construction in the result tree. Careful placement of the instructions within each xsl:template is required to prevent line breaks in the output.

Constructing and processing identifiers

One of the aims was to provide an easy to implement solution. The use of XML schema, XML fragments and XSLT transformations provides some advantages.

Publishers need to add a single step to their processing pipeline. Considering the fact that the necessary information is already available12, the complexity of generating a conforming URI is deemed relatively low. It is up to the publisher to decide which technique is used. The widespread use of XML in the publishing industry guarantees the availability of suitable tools.

Customers will encounter conforming URIs when processing documents they've acquired. If the URI form is used, the specifications allow them to understand what is being pointed to13. DTCA is currently building a system in which the URIs are connected to internally available URLs. In this system, the procedure for processing document references is as follows:

  • Suppose that the document under consideration is comment A from publisher B. The first step is to check if the document is already processed before. The system is queried for the document address.
  • If the document was not processed before, it is added to the system. The document address is added as an occurrence.
  • If the document contains URI form references to legislation, each target is tested for existence in the system. The URI is used to query the system.
  • If found, the system is told to make a connection between the found topic and the topic representing the medium neutral document added earlier in the processing.
  • If not found, a new location must be added to the system. A program is invoked to process the properties embedded in the URI. For instance, the script will start with checking whether a topic representing the regulation is already available. In this case, the regulation number is used in a query.

In the scenario above, it was presupposed that the processing application would recognize the URIs and apply the right processing steps. There are several ways of guaranteeing this. The easiest way is to assign a prefix. Suppose that we would want to use the prefix "lex", a complete identifier would look like "lex://v:76619&artikel=3.1&i=2002-01-01".

A more robust approach is connecting the identifier to a document explaining its semantics. In this scenario, the identifier uses a prefix, for instance "lex". The header of the document containing the identifier would then declare the connection between "lex" and the document describing the semantics, for instance "<meta name="prefix.lex" content="http://someserver/somedoc.xml" />. In our case, the second approach is easy to realize. Both the model and the transformation are available as XML. The example file somedoc.xml might contain or point to the model and transformation documents. This way, someone receiving the XML document but unaware of the "lex" prefix contained in it would be able to find out what is meant by this identifier.

Comparison to other approaches

This paper is about the construction and exchange of identifiers. Some notable influences were:

  1. Norme in Rete identifiers
  2. IVOA identifiers
  3. Digital Object Identifiers

Norme in Rete identifiers

In the legal world, the Italian government project Norme in rete ("rules and regulations online") served as an inspiration14. Part of this project was the delivery of a set of rules for creating URN names for Italian legal texts. These names consist of several parts (agency, kind of text, other characteristics, version). An example of a standardized name is:


A BNF description was used to formalize the construction of these names. The main difference with Norme in rete is that we replaced the BNF descriptions for XML schema's for the reasons noted in the section "from a model to a schema".


An IVOA (international virtual observatory alliance) [IVOA 2004] identifier has an equivalent XML-tagged form and a URI-compliant form. The XML form look like:

The URI form looks like:
The IVOA specification contains an ABNF definition for the URI form and an XML schema for XML form. We adopted the idea of using multiple forms, but replaced the ABNF specification by an XSLT transformation. The main reason for doing so is that the combination of an annotated XML schema with an accompanying XSLT transformation is more easily transferable.


A Digital Object Identifier (DOI)15 is a globally unique and persistent number assigned to an information resource. An example of a DOI is:

Each DOI consists of a prefix that identifies the content owner and a suffix that identifies the content itself. A key component of the DOI system is the Handle system16

Although the principles behind DOIs are sound, we did not use them in this project. There were several reasons for doing so. Firstly, working with DOIs adds significant overhead. For instance, a Registration Agency needs to be contacted for a block of suffixes. As this was an interim solution, we did not want to incur the overhead. For the same reason we did not look for URN registration.

A second problem with DOIs is assignment of identifiers to older laws. It would be relatively easy to assign DOIs to new laws, but more complex to process decades of older laws. A more descriptive scheme makes life easier because identifiers for older laws are only created whenever necessary. One could argue that the approach in this paper allows for lazy, "just-in-time" creation of identifiers. A related problem is the exploding number of possible locations. Some regulations contain thousands of variants. In combination with sets of consolidations, a very large number of possible legal concepts exists. Should each legal concept be assigned a DOI? The approach taken avoids this question entirely.

One possible approach would be to use DOIs for the regulation-level identifiers only. This way, the idea of the "law income tax 2001" would be assigned a DOI number. A DOI application profile could capture the precise semantics of the metadata (citation, title). This way, DOI numbers are integrated in the larger identifier17

Summary and conclusion

The approach outlined in this paper could be considered a bottom-up approach to the question of integration. The lack of a central registry providing unique, persistent identifiers forced us to look at alternative solutions.

We set out in two directions. First, a model was built describing concepts in legislation that authors might want to refer to. As with all models, a balance between usability and detail is necessary. In our case, existing references were analyzed and several meetings with publishers and customers were organized.

Second, rules for constructing identifiers were devised. In our solution, we tried to embed as much information as possible in the identifier itself. This way, the need for a central authority managing identifiers and descriptions is reduced. The resulting identifiers are more or less self-describing and autonomic. In an environment where documents may last for decades, this is a benefit.

The combination of XML schema and XSLT worked surprisingly well in our case study. XML schema is sufficiently expressive to describe the properties and values in the model, and XSLT enables us to generate compact URI-like identifiers. An important benefit of using XML technologies is that model and transformation are easy to exchange over the web.



See for instance [Pepper, Steve & Sylvia Schwab 2003] for an overview of this issue with regards to the semantic web.


The "set of property / value pairs" was inspired by the TMRM notion of SIP.




It is unlikely that DTCA could / would like to buy all required documents from 1 publisher. Besides, contracts generally run for three years, while some laws exist for decades


It is not possible to get an overview of available comments about article X while reading article X.


For instance, there are at least 6 complete versions of the Law on Income tax available on the intranet, each with a different set of supporting documents.


In other words: pay more per page.


Although it was modified, its existence cannot be denied. In the tax world, someone might have appealed a decision that was based on the earlier text. The appeal might be added to the person's file. Years later, it should be possible to retrace the text that formed the basis for the appeal.


Like "article X is about definition of income" versus "according to article 1.1 a levy of E100 should be applied"


Often like "chapternumber.articlenumber" (1.1). Insertions are handled like 1.7a.


Element names were (loosely) translated from Dutch with help of the MetaLex schema (see


They are referring to their own documents.


Unfortunately, XSLT does not enable the translation from URI form to XML form. However, it is not particularly difficult to write a procedural program to that end with the XSLT source and the XML schema available.


See Due to limited knowledge of Italian, the information in this section might be incomplete, outdated or incorrect.






The current schema is ready for this kind of usage (with the optional schema attribute of the regulationID element).


[IFLA 1998] Functional Requirements for Bibliographic Records Final Report (

[IVOA 2004] IVOA identifiers version 1.1 proposed recommendation 2004 June 21. (

[Moore, Graham 2002] Identities & Names in Knowledge Management XML Europe 2002

[Pepper, Steve & Sylvia Schwab 2003] Curing the web's identity crisis: subject indicators for RDF. (

Processing references to inaccessible documents

Matthijs Breebaart [Dutch Tax and Customs Administration, Centre for Professional Development and Communication]