Advanced approaches to XML document validation

Petr Nalevka
petr@nalevka.com
Jirka Kosek
jirka@kosek.cz

Abstract

Validation of XML documents is a challenging task whose aim is to ensure interoperability of such documents in various environments. With massive use of different XML languages, automated validation is often the only way to ensure compliance with different standards and recommendations. That's why it is important to use expressive validation languages and powerful and convenient validation tools and techniques to discover maximum standard violations automatically. This is the aim of the Relaxed project which is introduced within this text. Different approaches to face today's pressing validation issues and challenges used in Relaxed are examined; e. g. maximizing validation results by using expressive validation languages, schema modularization and reusability, effective compound document validation and validation of non-XML languages.

Keywords: NVDL; Validating

Petr Nalevka

Petr Nalevka is an IT consultant with over 8 years extensive experience across a variety of IT projects using contemporary enterprise-level web-based and server-side technologies. The highlights include development and architecture of an modular billing system for large ISPs, distributed system for stock indexes calculation and other projects for governmental institutions or international banks.

Moreover, Petr works on development and maintenance of Relaxed and JNVDL, both open source projects. Relaxed is a web document validation service and JNVDL is a Java implementation of the NVDL international standard for compound document validation. This work is sponsored by an academical grant.

Jirka Kosek

Jirka Kosek is a freelance XML consultant and teacher at University of Economics in Prague. He has over ten years experience in providing XML consultancy and training. Jirka is an active member in several standardization bodies: OASIS (DocBook TC and RELAX NG TC), W3C (XSL WG and ITS WG) and ISO/IEC JTC1/SC34 (DSDL, Topic Maps).

Jirka Kosek is an author of several books about Web technologies. He also wrote numerous articles for IT developer magazines. In his free time he is contributing code into DocBook XSL stylesheets open-source project.

Advanced approaches to XML document validation

Petr Nalevka [University of Economics, Prague, Department of Information and Knowledge Engineering]
Jirka Kosek [University of Economics, Prague, LISp (Laboratory for Intelligent Systems Prague)]

Extreme Markup Languages 2007® (Montréal, Québec)

Copyright © 2007 Petr Nalevka and Jirka Kosek. Reproduced with permission.

Introduction

When exchanging XML documents between different systems, validation is important to ensure interoperability. Validation helps to keep XML documents compliant with standards and recommendations for their issueing. Because of limited knowledge and resources, keeping documents standard-compliant relies in most cases solely on automated validation. That's why it is important to use expressive schema languages which are able to formalize maximum constraints expressed in the different language specifications and powerful and user-friendly validation tools and techniques.

The ever growing use of compound XML documents1 creates new challenges for automated validation. It is obviously more difficult to interpret a compound document than a standalone one as the processing application needs to consider every vocabulary fragment in the context of other vocabularies to adjust the interpretation accordingly. Compound document validation needs to face new levels of complexity emerging from the fact, different vocabularies can be combined together in various ways, but only some of them are eligible. That's why validation of compound documents becomes an compelling task which requires special approaches.

This text introduces an universal validation tool called Relaxed (http://relaxed.sourceforge.net) which makes maximum use of modern validation approaches and techniques to deliver comprehensive validation results and to allow straightforward compound document validation. Relaxed focuses mostly on Web documents, where validation plays a major role. Web means using XML documents on a massive scale in a heterogeneous environment, where the same document needs to be correctly interpreted on distinct platforms using different clients. Not only interoperability but also accessibility is very important in case of Web documents and that poses additional demands on the quality of automated validation and expressivity of the schema languages used.

Relaxed project introduction

Relaxed is basically an automated validation tool for XML documents which focuses mainly on validation of Web documents. It is open source and it has been developed and is further maintained by authors of this text. At first, Relaxed was aimed to create a validation service which would overcome some of the limitations of the widely used W3C validator. The W3C validation service relies solely on DTDs to define various constraints. Such approach is lacking expressive power and namespace support. On the contrary, Relaxed uses modern expressive validation languages for describing maximum constrains to deliver comprehensive validation results to Web document authors in order to help them keep their documents as standard compliant as possible. Today, Relaxed became a universal validation platform aimed to provide support for all sort of validation tasks including validation of predefined or custom compound languages.

Part of the Relaxed project is a HTML 4.0 / XHTML 1.0 schema written from scratch using Relax NG with embedded Schematron rules. Many additional and even complicated restrictions have been expressed thanks to the combination of those two languages which makes those schemas more powerful that the official DTDs provided by W3C. HTML 4.0 and XHTML 1.0 are the today's most widespread standards. In addition, Relaxed is able to validate some of the WAI's WCAG 1.0 restrictions using its own schemas.

Relaxed also features support for validation of compound documents based partly on Relax NG, but mainly on NVDL (Namespace-based validation dispatching language); an international standard for compound document validation. There are predefined schemas for validation of e. g. XHTML 1.0 + SVG 1.1, XHTML 1.0 + MathML 2.0, XHTML 1.0 + MathML 2.0 + SVG 1.1 documents ready to be used and users may easily create ad-hoc NVDL schemas for their custom compound languages. NVDL support is enabled through JNVDL; a Java-based implementation of the NVDL specification which was developed as part of the Relaxed project.

In addition to the schemas, the Relaxed project consists also of an extensible validation engine written in Java. The engine has support for grouping and annotation schema resources to make them easily accessible to users and for preprocessing of validated instances with filters. Filters are used for example to convert legacy SGML-based HTML 4.01 documents into XML to make them validable by XML-oriented schema languages used in Relaxed or to force a particular doctype definition for validated instances. Forcing doctype is used to adjust Relaxed doctype specific document handling feature. Such feature makes it possible for example to dispatch strict, transitional or frameset HTML documents to be validated against a strict, transitional or frameset schema.

Relaxed validation capabilities are accessible for Web document authors through a Web-based interface. Authors may specify input documents using an URL or upload them directly to the Relaxed server. The validation process can be adjusted using several user options. Relaxed user interface is described in Section “The user interface”.

An exhaustive description of all Relaxed features, project architecture, usage, schemas involved and their expressive power can be found in [RLXD] and [HTML-VAL].

Seeking schema language expressiveness

Shortly after launching the XML 1.0 standard, it became apparent that DTDs are lacking several critical features needed in many XML applications. The two most important and missing features were support for data types and namespaces. DTD does not include the concept of data types. Every element or attribute value is considered to be almost an arbitrary string. It is not possible to define content to look like a number, a date or a string with a given length. If such constraints are defined in the language specification, DTD does not allow to validate such constraints automatically.

Missing namespace support restrains DTD to be used for single namespace documents only. Even the use of prefixes in such documents is problematic. This is an important drawback, as nowadays, combining vocabularies became the preferable approach for extending XML languages. Limitations of DTD demonstrate the fact it matters what schema language is used to describe a particular vocabulary. It makes sense to choose the most expressive and easy to maintain language.

Several new schema languages were created to overcome DTD limitations. Many of them were just prototypes or proprietary ones. Only two new schema languages got broader acceptance–W3C XML Schema [XMLSCH-ST] and Relax NG [RNG]. Both of those languages have very good support for data typing and namespaces. At the same time there are also big differences between them. Formal comparison of DTD, W3C XML Schema and Relax NG can be found in [SCHTAX]. To summarize briefly, Relax NG is the most expressive language and offers the greatest flexibility in modularizing and combining schemas. This is the reason why Relax NG is very popular for creating complex document oriented schemas like TEI (http://www.tei-c.org/) or DocBook (http://docbook.org). W3C XML Schema is enforcing unambiguity2 and thus they are very popular in scenarios where unambiguous mapping from XML to object or database representation is required.

All previously mentioned schema languages are so called grammar based languages. They define grammar of the XML vocabulary by enumerating all elements and their content models. This concept is based on defining just simple paren-child relations and thus may not be sufficient in all situations. Some complex constraints require to define relationships of various nodes in completely different context across the document. In same cases, those constraints may only be defined using rule based schema languages like e. g. Schematron. Schematron schema consist of a set of rules which may be expressed for example using XPath expressions that are evaluated against the validated document.

Some schema languages are more or less suitable for constraining some document facets. In order to gain better validation results it is reasonable to combine several schema languages and validate document against all of them. Combination of Schematron with Relax NG or W3C XML Schema is an example of such powerful constraint language. Moreover, extensibility of both W3C XML Schema and Relax NG allows to embed Schematron rules directly into the grammar based schema.

To prove the power of Relax NG and Schematron for Web documents validation, as part of the Relaxed project, XHTML 1.0 has been re-defined from scratch using this combination of languages. Such definition is able to validate more constraints that the original schema and thus keep validated documents closer to standard compliance.

Reformulation of XHTML in Relax NG and Schematron

Relax NG and Schematron do not only bring expressivity, but also some sort of elegance of use, good tool support, possibility to easily integrate3 both languages and a great support for modularity. That's why Relax NG and Schematron are the languages of the choice for defining complex and modular schemas.

Relaxed XHTML schemas4 demonstrate the power of Relax NG modularity. When looking at W3C's XHTML modularization implementations using DTD or XML Schema, a specific model or driver schema is needed for every module combination used. In Relax NG, modularity is much more straightforward. All desired modules can be simply included on the fly without any further preparations. In addition, new modules can be introduced easily without altering any other involved module.

Example 1. Modularity in RELAX NG

Suppose that a separate hypertext module should add possibility to use anchors (the a element) for linking within inline 5 elements. This can be accomplished by a simple definition that adds an a element into a list of elements which are permitted at the inline level.

<define name="Inline.class" combine="choice">
  <ref name="a"/>
</define>

There is no need to completely redefine the content model in which a occurs as is necessary in DTD and W3C XML Schema.

The trick here is done through the combine="choice" method which basically extends the HTML Inline.class definition to contain either some element from the inline model defined elsewhere or the anchor element.

HTML 4.01 / XHTML 1.0 specification defines three language subsets (strict, transitional and frameset) and every of them has its own monolithic schema. Those three schemas contain a huge number of duplicities. Most of the definition is basically repeated in an unchanged form in all of them. This approach is error-prone and difficult to maintain. One small change in the shared language subset requires definition modifications across all three different schemas. Such schemas are also less readable and difficult to understand. For instance, to find out which elements are shared by all three subsets, it is necessary to go through all of the schemas.

Relaxed schemas solve all the previously outlined problems. They define the three language subsets just by including the right modules. Common modules are shared among all the subsets. There is no duplicity and separation into modules brings better readability and easier maintenance. With such modular architecture, it is easy to fine-tune the level of restriction during every validation process just by including the right modules into the final schema.

Extending Relax NG schemas with modules

With a good initial schema organization it is easy to extend Relax NG schemas to define additional language constructs or even foreign vocabulary fragments6 in some context of the validated documents. Thanks to well-designed modularity in Relax NG, making such extensions is straightforward in comparison to other schema languages and nicely designed modules with extended definitions may be even reused in various different schemas.

As an extension example, let's consider the XHTML schema to be extended to allow MathML fragments. As MathML is normally not defined within XHTML, any MathML occurrence would be automatically rejected. Example 2 demonstrates how to allow foreign elements within some context of XHTML documents using wildcard named patterns (anyName). The foreignElement definition allows an arbitrary tree of foreign (non-XHTML) elements and attributes wherever referenced.

Example 2. Wildcard named pattern

<define name="foreignElement">
  <element>
    <anyName>
      <except>
        <nsName ns="http://www.w3.org/1999/xhtml"/>
      </except>
    </anyName>
    <zeroOrMore>
      <choice>
        <attribute>
          <anyName>
            <except>
              <nsName
                  ns="http://www.w3.org/1999/xhtml"/>
            </except>
          </anyName>
        </attribute>
        <text/>
        <ref name="foreignElement"/>
      </choice>
    </zeroOrMore>
  </element>
</define>

Referencing foreignElement within inline or block level content model would cause the schema not to report any MathML fragment within this context as an error. But an XHTML + MathML schema is expected to validate any MathML fragment against its schema as defined in Example 3. This example shows a simple MathML module. Occurrence of MathML in the inline or block level content model is allowed and its consequent validation against mathml2.rng is enabled whenever including such module into the XHTML modular schema. If not included the schema rejects by default any foreign elements. This demonstrates how easy it is to extend a schema and make it less restrictive just by including a new module.

Example 3. MathML module for the XHTML schema

<grammar ns="http://www.w3.org/1999/xhtml">
<!-- xhtml-mathml-module -->

<define name="Block.class" combine="choice">
  <externalRef href="../mathml/mathml2.rng"
       ns="http://www.w3.org/1998/Math/MathML"/>
</define>

<define name="Inline.class" combine="choice">
  <externalRef href="../mathml/mathml2.rng"
       ns="http://www.w3.org/1998/Math/MathML"/>
</define>

</grammar>

The combine="choice" combining method extends the HTML Block.class and Inline.class definitions to contain either some element from the inline or block content model or a MathML fragment.

Additional modules may also make a schema more restrictive when included. This can be done by overriding a definition completely when importing a module. Relax NG validates documents strictly which means every construct which is not defined is implicitly forbidden. This model makes it easier to make a schema less restrictive using extension modules. Schematron on the other hand validates documents laxly and thus additional modules always make the schema more restrictive. That is another reason to use those two languages in combination as they can significantly simplify modularization of definitions.

Using Schematron to enforce additional checks

Schematron does not only simplify modularization, but there are several other reasons why embedding Schematron patterns may help to enhance the schema definition. Some restrictions are simply inexpressible using grammar based languages like Relax NG as shown in Example 4.

Example 4. Selected options

Sometimes grammar based languages cannot express what is otherwise easily expressible by a rule based language. The following rule ensures that select elements with an absent multiple attribute can not have more selected options

<sch:rule context="html:select">
  <sch:report test="not(@multiple) and count(html:option[@selected]) > 1">
    Select elements which aren't marked as multiple may not have more then one selected option.
  </sch:report>
</sch:rule>

Combining Relax NG and Schematron brings a strong value added. It may not only increase the expressiveness, but also simplify the schemas and make them more human-readable. For example a simple XPath expression may be used instead of many lines of Relax NG definitions scattered across several schema modules. Both languages have a very distinct philosophy and their use is more or less suitable in different situations and for different purposes. Smart decisions about where to use which language can significantly improve the schemas by getting the best value of both.

Only thanks to the expressive power of Schematreon, some advanced and vague constraints contained within WCAG 1.0 (Web Content Accessibility Guidelines) could be formalized within the Relaxed project. Schematron WCAG module within Relaxed allows automated validation of some of the checkpoints. This feature may help Web document authors to make their documents more accessible.

Support for datatypes

A big advantage of using RELAX NG over DTD are definitely datatypes. DTD's datatypes are very elementary and incompetent to fully express the complexity of HTML datatype requirements. XML Schema datatypes bring a set of thirty seven carefully selected types which can be used within RELAX NG. This set may be further restricted by setting intervals or by using regular expressions.

Relaxed schemas reflect most of the HTML datatype requirements including lengths and multilenghts, characters, pixels, targets, font sizes, colors and many more.

Example 5. HTML target and tabindex datatype

<define name="Target.datatype">
  <data type="string">
    <param name="pattern">_(blank|self|parent|top)|[A-Za-z].*</param>
  </data>
</define>
<define name="tabindexNumber.datatype">
  <data type="nonNegativeInteger">
    <param name="pattern">[0-9]+</param>
    <param name="minInclusive">0</param>
    <param name="maxInclusive">32767</param>
  </data>
</define>

Example 6. Schematron datatypes

Datatypes may be also expressed using XPath functions in Schematron rules.

<assert test="string-length(string-normalize(./text())) &gt; 0">
Must contain a value.
</assert>
<assert test="number(./text())">
Int datatype.
</assert>

Relaxed validation power in action

This section demonstrates the enhanced validation power of the Relaxed XHTML schema. When validating the demonstrational XHTML document shown in Example 7 using the official W3C DTD for XHTML 1.0 or using the W3C validator7, no validation errors appear in the output and the document is considered to be perfectly valid.

Example 7. According to W3C validator this document is valid

01 <?xml version="1.0" encoding="utf-8"?>
02 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
03                       "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
04 <html xmlns="http://www.w3.org/1999/xhtml">
05   <head><title>W3C validator limitations demo</title></head>
06   <body>
07    <h1>Datatypes</h1>
08    <table border="10%">
09      <tbody>
10        <tr><td><font color="Ivory">B</font></td></tr>
11     </tbody>
12   </table>
13   <h1>Nested forms</h1>
14   <form name="form2" action="process.form">
15     <div>
16       <form action="process.subform">
17         <p>Something is wrong</p>
18       </form>
19     </div>
20   </form>
21   <h1>NAME and ID inconsistency</h1>
22   <form name="form1" id="form2" action="process.form">
23     <p>Something is wrong</p>
24   </form>
25 <a name="form2">Something is wrong</a>
26 </body>
27 </html>

A different situation occurs when validating the document using the Relaxed validation service8. Thanks to expressive power of Relax NG and Schematron, four different errors are detected.

    Errors detected in the demonstrational document using Relaxed
  • At line 8, the border size at table may not be specified using percentages.
  • At line 10, there is no Ivory color allowed for font colors.
  • At line 16, form elements cannot have any nested forms.
  • At line 22, the id and name attribute values has to be the same when used at the same element.

Compound documents

Compound document is a modern name for XML documents that consist of elements and attributes from different mark-up vocabularies. In other words, by combining two or more different XML languages in a single document we create a compound document. This was made technically possible thanks to XML Namespaces9

At first, some people considered namespaces to be a hostile element polluting XML with additional complexity without any reasonable need for it. At this time, the world of mark-up vocabularies has been dominated by over-grown monolithic languages, but soon they faced serious extensibility issues. The set of problems a language addresses is growing and evolving over time. To keep-up with the changing requirements, the monolithic approach constantly pollutes the vocabulary with new closely specific mark-up. This results in extensive, difficult to learn and difficult to maintain languages which are intended to solve all sort of problems, but not solving any of them in a satisfactory manner.

Recently, it is more and more obvious that some extensibility problems may be solved smarter using composition of more different single-purpose languages rather than further extending a monolithic language. If there already exists a widely adopted and understood vocabulary, which solves part of our problem, it makes a good sense to reuse it rather than introducing something new. With this approach, we gain an immense flexibility, as for every specific problem, we can adopt a specific combination of vocabularies.

Isolated single-purpose languages are easier to maintain and what is even more important, they can be easily reused in distinct applications. The domain of such languages is narrow and their aim is well defined, which helps to keep the language free of indiscreet extensions.

Nowadays there are many different applications of compound documents in many different areas. When once adopted, composition of different vocabularies seems as a natural and convenient approach to many language extension problems. Here is a short enumeration of some areas where are compound documents used: templating languages (XSLT), XML-based data exchange protocols (SOAP), office documents (ODF), Web-based rich client applications (SVG, XForms, MathML embedded in HTML), semantic Web languages (RDF, RDFS, OWL) and many more. With wide-spread adoption of compound document solutions, validation of such documents becomes an important issue which needs to be addressed.

Validation of compound documents

XML namespaces technically allow presence of multiple vocabularies inside one XML document, but there are many other issues, which need to be addressed, before adopting a compound document solution. Having descriptions of syntax and semantics of the particular vocabularies is insufficient for the client application to handle compound documents correctly. In addition, also syntax and semantics of the compound language needs to be defined.

Different vocabulary fragments can be combined in many different ways and even the isolated fragments are meaningful (in respect to their language semantics) and they are syntactically correct, the combination of such fragments can be difficult to interpret or it can be even semantically empty. For example, RDF metadata should not be placed into the body section of HTML documents, as they are not intended to be rendered. Such RDF fragment would cause the client application difficulties with interpretation. It rather makes sense to allow RDF only within head.

This implies, there are two major issues concerning compound documents. First, semantics for different compound languages needs to be created to make them correctly interpretable by different applications. Second, the way different languages are combined together needs to be constrained to allow just meaningful combinations. This requires some kind of a "meta-schema" to express such constraints.

Today, automated validation is absolutely essential for standalone XML languages to ensure their syntactical correctness and thus interoperability. But it is even more important for compound documents, as they bring additional interpretation complexity. To make compound documents applicable in a heterogeneous environment (as the Web environment for example), it is absolutely essential to provide powerful compound document validation tools and techniques. This requires schema languages able to cope with multiple namespaces and validation engines able to check document instances against such schemas. The Relaxed validation service is exactly such tool.

XML Schema or RELAX NG is insufficient

One approach to face the compound document validation problem is to use namespace support included in today's mainstream schema languages; e. g. in RELAX NG. Such approach was roughly demonstrated in Section “Extending Relax NG schemas with modules”. Namespace-aware schema languages use qualified names to define elements and attributes, thus compound document schemas can be created easily just by specifying the appropriate namespace for elements from different vocabularies.

Example 8. Allowing RDF inside the XHTML head section (RELAX NG)

<element name="head"
         ns="http://www.w3.org/1999/xhtml">
  <element name="title"
           ns="http://www.w3.org/1999/xhtml">
    <text/>
  </element>
  <interleave>
    ... other XHTML head elements ...
    <optional>
      <element name="RDF"
               ns="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
        ...
      </element>
    </optional>
  </interleave>
</element>

Unfortunately, using namespace-aware schema languages for compound document validation brings many significant drawbacks which limits this solution just to simple use-cases. There is no standard way for adding new vocabulary modules into an existing compound schema. Such task needs the right level of knowledge about implementation details of the compound schema and usually different modules needs to be altered to allow seamless integration with the new module.

Moreover, in most cases we cannot simply reuse the existing schemas for vocabularies we like to combine. Standalone language schemas aren't often well prepared to be combined with other schemas. Usually they don't have the right level of modularity and abstraction which is needed for their seamless integration. In addition, they are frequently written in different schema languages or even in languages which aren't namespace-aware at all (for example DTD). This implies, schemas for different vocabularies first need to be converted to the same namespace-aware schema language and slightly modified before they can be used as modules of the compound definition.

Implementing a large compound schema for several complex languages using namespace-aware schema languages would require special knowledge, long implementation time and it also leads to maintenance issues. As different languages evolve over time, the modified or converted schemas need constant updating.

To demonstrate the issues, lets consider the following example. A compound document schema for XHTML with embedded SVG and MathML in all block level and inline elements and RDF in the head section shall be implemented. Using a namespace-aware schema language (for example XML Schema) may cause several troubles. First, the official XHTML DTDs needs to be converted into XML Schema. This needs to be done in a specific way to prepare abstract classes for the head section and block and inline elements to make them easily extensible through additional modules. Further, the XML Schema for MathML can be partly reused, but not in an unchanged form. It needs to be modified to make it a module of the parent XHTML schema. The module needs to be further tailored in a specific way to allow MathML just in the context of the block and inline elements. A similar task needs to be done also for SVG and RDF. Another big problem occurs when SVG is preferred to be the parent language instead of HTML. In this case, all modules need to be duplicated and rewritten.

In general, every time a new vocabulary needs to be incorporated into the compound definition, its official schema first needs to be converted and modified. Moreover, different vocabulary modules needs to be constantly synchronized with new versions of the languages. This is an error-prone approach as different definitions are being duplicated.

To conclude, the namespace-aware schema language concept is applicable in simple cases, but it is not a solution which can be considered for complex scenarios. Reusability of existing schemas is an important requirement which is not satisfied at all within today's namespace-aware schema languages. That's why the Relaxed project uses a different approach to compound document validation which enables hundred percent reusability of existing single-namespace schemas. Moreover, this approach is independent of the different vocabulary schema's implementation details and the schema languages used.

NVDL

To solve compound document validation issues mentioned in Section “XML Schema or RELAX NG is insufficient”, the Relaxed project uses NVDL; Namespace-based Validation Dispatching Language which is "Part 4 of ISO/IEC 19757 DSDL" (Document Schema Definition Languages) international standard. NVDL is a simple "meta-schema" language which allows to control processing and validation of compound documents. Figure 1 demonstrates a particular validation dispatching process decomposed into several phases. An NVDL schema and a compound document instance shown in Example 9 are potential participants of such process.

The essence of NVDL is dividing XML document instances into sections each of which contains elements or attributes from a single namespace. A of such sections is first constructed for every validated instance (see Example 10). Sections are further combined or manipulated in various ways to create so called validation candidates.

Manipulation of sections is achieved through rules and their corresponding actions defined in an NVDL script. Actions are executed on a particular section whenever they match a certain rule; usually in case the sections namespace matches the rule's namespace wildcard.

There are several actions defined in NVDL; e. g. attach for attaching sections back to their parent, unwrap to handle wrapped sections and validate to send a particular validation fragment to a particular validator.

In most cases, after executing actions, single namespace validation candidates are obtained. Such fragments are finally filtered for redundancy and independently send for validation against different subschemas10 (see Example 11).

Figure 1: NVDL validation process at a glance
[Link to open this graphic in a separate page]

Example 9. NVDL validation process

The following example shows an NVDL schema and a compound document instance which relates to Figure 1. In this case, NS1 represents the XHTML namespace and NS2 stands for the XForms namespace.

<html xmlns="http://www.w3.org/1999/xhtml" xmlns:xf="http://www.w3.org/2002/xforms">
<head>
<xf:model>
  <xf:instance>...</xf:instance>
  <xf:submission id="form" method="post"  action="getStockQuote.do"/>
</xf:model>
</head>

<body>
  <xf:group ref="stockquote">
    <xf:input ref="symbol"><xf:label>Symbol</xf:label></xf:input>
    <br />
    <xf:submit submission="form"><xf:label>Get Quote</xf:label></xf:submit>
  </xf:group>
</body></html>

To achieve behavior consistent with Figure 1, the following NVDL schema is applied to the previous compound document instance. Using such schema, the NVDL dispatcher first sends the root XHTML fragment for validation after filtering (unwrapping) any descendant XForms fragments and attaching any descendant XHTML fragments. XForms sections are handled in a similar way by filtering (unwrapping) any descendant XHTML. For any XHTML document instance with embedded XForms, the following NVDL schema causes one pure XHTML fragment to be send for validation against xhtml.xsd and one or more pure XForms fragments to be validated using xforms.rng.

<rules xmlns="http://purl.oclc.org/dsdl/nvdl/ns/structure/1.0">
<namespace ns="http://www.w3.org/1999/xhtml">
  <validate schema="xhtml.xsd"><!-- validated unwrapped XHTML -->s
    <mode>
      <namespace ns="http://www.w3.org/2002/xforms">
        <validate schema="xforms.rng"><!-- validate unwrapped XForms -->
          <mode><!-- attach descendant XForms together and unwrap any XHTML -->
            <namespace ns="http://www.w3.org/2002/xforms"><attach/></namespace>
            <namespace ns="http://www.w3.org/1999/xhtml"><unwrap/></namespace>
          </mode>
        </validate>
        <unwrap><!-- unwrap next XForms fragment -->
          <mode><!-- attach descendant XHTML together and unwrap any XForms -->
            <namespace ns="http://www.w3.org/2002/xforms"><unwrap/></namespace>
            <namespace ns="http://www.w3.org/1999/xhtml"><attach/></namespace>
          </mode>
        </unwrap>
      </namespace>
    </mode>
  </validate>
</namespace>
</rules>

Example 10. Decomposing sections

The instance shown in Example 9 is decomposed into the following section tree after applying the NVDL schema from the same example.

ES1 <html><head>ref to ES2</head>
<body>ref to ES4</body>
</html>

ES2 <xf:model>...</xf:model>

ES3 <br />

ES4 <xf:group ref="stockquote"><xf:input ref="symbol">...</xf:input>
ref to ES3
<xf:submit submission="form">...</xf:submit></xf:group>

Example 11. Dispatching validation fragments to validators

After executing attach and unwrap actions on the section tree shown in Example 10, the following resulting fragments are created and send independently for validation.

<html><head></head>
<body><br /></body>
</html> -> xhtml.xsd

<xf:model>...</xf:model> -> xforms.rng

<xf:group ref="stockquote"><xf:input ref="symbol">...</xf:input>
<xf:submit submission="form">...</xf:submit></xf:group> -> xforms.rng

When compared to other compound document validation approaches, NVDL offers many advantages. It features a standardized and easy to understand language to define compound document validation dispatching processes. With NVDL, different vocabularies may be easily allowed, banned or send for further validation depending on the particular context where they occur within the validated instance. For an exhaustive description of the NVDL language semantics refer to [NVDL].

The ability to create single namespace fragments allows not to care about namespaces in the subschemas at all. Single namespace schemas are easier to write and what is important also easy to reuse for various different compound languages. Moreover, NVDL is not bound to a specific schema language, different schema languages can be used in combination during a single validation process. Subschemas may be written in any preferable schema language e. g. RELAX NG, XML Schema, Schematron or DTD. This is again important in terms of reusability, because in the real world XML vocabularies are usually described using different schema languages. NVDL allows to reuse those schemas as they are. There is no need for converting or modifying them.

When having a set of subschemas for different vocabularies, using NVDL it is straightforward to create various NVDL definitions for all different combinations of such vocabularies without the need to introduce any changes to the particular subschemas at all.

Using entirely Relax NG or XML Schema for compound document validation leads to uniformity as it forces users to convert schemas for different vocabularies to the same language. NVDL, on the other hand, means variety, as it allows to choose the schema language with best suits the particular vocabulary needs. There is absolutely no need to choose a mainstream language.

Section “XML Schema or RELAX NG is insufficient” demonstrated how difficult it is to create a compound definition using a namespace-aware schema language. Let's use NVDL to create the same compound schema; XHTML with embedded SVG, MathML and RDF. In this case, there is no need for several experts to work on that for days. As illustrated in Example 12, one person can create such NVDL script in a matter of minutes. The reason is, existing schemas can be fully reused without making any changes to them.

Example 12. NVDL schema for XHTML with embedded SVG, MathML and RDF

<rules xmlns="http://purl.oclc.org/dsdl/nvdl/ns/structure/1.0" startMode="root">
  <mode name="root">
    <namespace ns="http://www.w3.org/1999/xhtml"><!-- XHTML is the parent language -->
      <validate schema="http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
        <context path="head" useMode="head"/>
        <context path="div|li|p...all block level elements" useMode="block_inline"/>
        <context path="a|em|span|...all inline elements" useMode="block_inline"/>
      </validate>
    </namespace>
  </mode>
  <mode name="block_inline"><!-- rules for block and inline context -->
    <namespace ns="http://www.w3.org/2000/svg">
      <validate schema="http://www.w3.org/TR/2002/WD-SVG11-20020108/SVG.xsd"/>
    </namespace>
    <namespace ns="http://www.w3.org/1998/Math/MathML">
      <validate schema="http://www.w3.org/Math/XMLSchema/mathml2/mathml2.xsd"/>
    </namespace>
  </mode>
  <mode namne="head"><!-- rules for head context -->
    <namespace ns="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      <validate schema="http://www.w3.org/2000/07/rdf.xsd">
        <mode><!-- attach any descendant foreign fragment -->
          <anyNamespace><attach/></anyNamespace>
        </mode>
      </validate>
    </namespace>
  </mode>
</rules>

In the NVDL script in Example 12, subschemas are referenced directly at their original locations using URLs. The script tells the NVDL engine that the only acceptable parent language is XHTML and other vocabularies are forbidden in that context. Plain XHTML is extracted from the validated document and send for validation against the official W3C DTDs. RDF sections may only occur in the context of the head element. Any foreign vocabulary contained inside the RDF fragment is attached to it before being send for validation. SVG and MathML fragments are allowed only in block and inline elements. Any other vocabulary in any other context of the document is rejected.

This simple example demonstrates the power of NVDL. Modifying the NVDL script to allow any other vocabulary in some context is a simple and straightforward task. In addition, the script contains only the required information about the compound language. Anything related to the grammar of the particular vocabularies is encapsulated in the subschemas where it really belongs. This makes NVDL schemas not only easy to design and maintain, but also easy to read and understand.

Note that Example 12 demonstrates the use of an NVDL context construct which allows to apply a specific handling to sections in a given path within their parent section. Several paths separated by | may be used within one context condition. Paths used in the example are relative, but absolute paths may be used as well.

Relaxed validation service

The new generation of Relaxed validation service uses JNVDL (http://jnvdl.sourceforge.net) internally for all validation dispatching tasks. The previous version of Relaxed relied on Relax NG for compound document validation. JNVDL is a Java-based implementation of the NVDL specification developed as part of the Relaxed project. Making JNVDL the integral part of Relax makes compound document validation an implicit feature of the validation service bringing all the advantages described in Section “NVDL”. Predefined compound schemas are easy to maintain and new schemas may be easily implemented and added to the repository. Because NVDL is an easy to understand language, Relax even allows Web document authors to create their own custom NVDL scripts and use them to validate their specific compound documents.

Real life issues with NVDL

Using our own implementation of the NVDL standard within a Web-based validation service brought us many interesting experiences with issues related to real life use of the NVDL technology. One of the issues which the JNVDL implementation faced is related to the fact that validators tend to report error locations using line and column numbers. When parsing the validated instance and turning it into validation fragments, the original position of elements and attributes is inevitably lost. There are two reasons for that. First, today parsers are lacking round-tripping support, thus some whitespaces which are considered to be irrelevant aren't reported and preserved. Second, different validation fragments are taken from different places within the original document. As they are send to validators separately, the original position is lost.

Such behavior may confuse users as the error line numbers reported by the particular validators aren't related to the original document but to the particular fragment context. To interpret the information correctly, users would need to deduce the line numbers from the original position of the validation fragments created by JNVDL.

To overcome those difficulties, JNVDL provides a proprietary round-tripping extension. Such extension preserves whitespaces from the original document and before validation fragments are being send to the particular validators, they are modified so that elements and attributes occur on the same lines as in the original document. Further, if an XML fragment is extracted from the middle of the document, JNVDL adds the appropriate number of empty lines before it to keep the fragment at the same location.

The problem of irrelevance of some whitespaces within XML documents makes the use of line numbers to locate elements and attribute problematic or even error-prone. Validation API designers should consider using a different mechanism to locate errors. For example XPath is a good candidate, as it is whitespace independent and precise.

There are also other issues related to real life use of NVDL. Sometimes, namespace-based dispatching is not sufficient as namespaces do not differentiate different language versions or mutations. As an example imagine the XHTML script, transitional and frameset mutation. All three language variation are based in the same namespace (http://www.w3.org/1999/xhtml), but they all need to be validated against a different schema. A similar situation occurs with different language versions. Both XSLT 1.0 and XSLT 2.0 share the same namespace, but XSLT 2.0 defines many additional elements and attributes and it even changes content model of some elements.

Example 13. Version information inside XSLT 2.0 stylesheet

<xsl:stylesheet
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  version="2.0">
  ...
</xsl:stylesheet>

Example 14. Version information in XHTML document

<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.0 Strict//EN' 'http://www.w3.org/TR/xhtml1/DTD/
xhtml1-strict.dtd'>
<html xmlns="http://www.w3.org/1999/xhtml">
  ...
</html>

To face versioning issues, NVDL offers a proprietary extension which allows to use enhanced conditions in NVDL namespace rules. Not only namespace of a particular input fragment determines when to trigger a particular NVDL rule, but also doctype of the validated document or an arbitrary XPath expression. This mechanism shall cover most issues with vocabulary versioning which is mostly inconsistent across various languages.

Example 15. JNVDL validation dispatching based on XPath expression

The version attribute value is evaluated and depending on the result, the appropriate fragment is send either to be validated against an XSLT 2.0 or XSLT 1.0 schema.

<rules xmlns="http://purl.oclc.org/dsdl/nvdl/ns/structure/1.0"
       xmlns:jnvdl="http://jnvdl.sf.net">
  <namespace ns="http://www.w3.org/1999/XSL/Transform"
             jnvdl:useWhen="@version = '1.0'">
    <validate schema="xslt1.xsd"/>
  </namespace>
  <namespace ns="http://www.w3.org/1999/XSL/Transform"
             jnvdl:useWhen="@version = '2.0'">
    <validate schema="xslt2.rng"/>
  </namespace>
</rules>

For more information about JNVDL refer to refer to [JNVDL].

The user interface

Users may access Relaxed validation functionality through an HTML user interface with their Web browser. The only mandatory parameter to start a validation process is the URL of the source document or a locally stored file path. But users can specify a number of additional validation parameters as well. At first, they choose what schema to use to validate their document. Schemas in the Relaxed schema repository are grouped into several predefined groups; e. g. Web, DocBook, SemanticWeb and others. Within each group, users may further select a particular schema to be used. For example in the Web group, users may validate against standalone XHTML 1.0, XHTML 1.0 with WCAG 1.0, XHTML+SVG and many other predefined schemas contained in the Relaxed schema repository.

Each group may contain additional parameters for adjusting the validation process. For example, the "doctype" select box in the Web group allows users to force Relaxed to handle their documents as a different document type than declared.

Some of the additional validation parameters are shared across all schema groups. For example the "view source" parameter, which appends the complete document's source at the bottom of the validation output and links the error messages to the corresponding source lines or "brief output", which hides messages with a low severity level when checked. Further, users may choose to disable or enable loading of external entities, use a predefined entity set from HTML, MathML or ISO or process XInclude within validated instances.

Figure 2: Relaxed user interface
[Link to open this graphic in a separate page]

In addition to predefined schema groups, Relaxed has been enhanced to support validation against user defined schemas. NVDL is an simple and easy to understand language for defining compound document validation and that's why it makes sense to enable users to define their own ad-hoc schemas to have full control over the validation dispatching process and to validate their own custom compound documents. Users may reference external subschemas or use subschemas which are part of the Relaxed schema repository available at http://<relaxed-server>/schema/*.

To give NVDL tenderfoots a quick start, Relaxed offers the "namespace restaurant" feature. In the namespace restaurant, users may choose vocabularies which are present in their custom compound documents from the vocabulary menu. Finally Relaxed generates a simple NVDL schema which allows and validates all the selected vocabularies in any context of the validated instance and rejects any other vocabularies. Such schema may be further edited and modified by the user before finally being used in the validation process.

Figure 3: Namespace restaurant
[Link to open this graphic in a separate page]

Legacy HTML support

For the Web schema group, Relaxed keeps backward compatibility with HTML 4.x documents which are not XML-based. For that reason the TagSoup (http://home.ccil.org/~cowan/XML/tagsoup/) library has been integrated into the Relaxed project. TagSoup is a SAX-compliant parser which allows standard XML tools to be applied to the real-life HTML documents. What is important, TagSoup architecture guarantees a well-formed output under all circumstances without any syntax error thrown. This means that from any HTML4.x document Relaxed gets always a well-formed input.

TagSoup repairs for example missing end tags, unknown entities, attribute minimization, overlapping tags and other XML well-formness violations. A modified version of TagSoup used within Relaxed also reports some SGML violations which are later corrected so that end users are informed about all errors. Support for other non-XML languages is planned in Relaxed, for example support for validation of HTML 5.

Example 16. Fixing overlapping tags

before TagSoup:

<p> <i> Hello,</p> world! </i>

after TagSoup:

<p> <i> Hello,</i></p><i> world! </i>

Future work

There is a wide area open for further extensions and improvements of both projects; JNVDL and Relaxed. A huge number of different compound languages could be formalized in NVDL and made part of the Relaxed predefined schema repository to enable their out-of-the-box validation. Relaxed schema repository could be enhanced to contain additional schema groups to cover other areas of compound document usage e. g. SOAP, JSPs and others.

Relaxed GUI could be enhanced for better user experience; making it easy to use and providing additional features as e. g. validation of several linked Web documents in one process or graphical interface for easier NVDL editing. A searchable annotated user-maintained schema repository could be added to the Relaxed interface as well; allowing the Relaxed community to maintain and share their own compound document schemas. Relaxed validation error output could be made more verbose by implementing annotation support into the validators with are used and by annotating schemas in the repository.

Conclusions

This article mentions today's most pressing issues of XML document validation and proposes solutions how to face them. It shows how to maximize validation results using expressive schema languages and their combinations, how to modularize schema definitions to enhance readability and maintenance, how to face compound document validation smartly while reusing existing single-namespace schemas (what ever schema languages they are written in) and how to used XML-based validation tools and techniques to validate non-XML documents; e. g. HTML 4.01.

Moreover, this article introduces the Relaxed project which is basically a swiss knife tool for XML document validation tasks. It examines Relaxed most important features and it introduces also the Web user interface of the validation service which is available to XML documents authors to automatically validate their documents. Also JNVDL, which is an open source implementation of the NVDL standard for compound document validation implemented within the Relaxed project, is briefly discussed. This article mentions issues which needed to be faces when bringing a new specification (NVDL) into real life usage.

Both–Relaxed and JNVDL–are open source projects hosted and available at sourceforge including accessible source code repository and binary packages.

Notes

1.

Compound documents are XML documents which consist of several mark-up vocabularies.

2.

This rule is called UPA (Unique Particle Attribution) in a W3C XML Schema terminology.

3.

Schematron can be easily embedded into any context of any Relax NG schema

4.

Those schemas were originally derivated from the work of James Clark –Modularization of XHTML in RELAX NG– (http://www.thaiopensource.com/relaxng/xhtml/).

5.

Inline elements in HTML may typically contain only text and other inline elements.

6.

Relax NG is namespace aware and thus supports validation of compound documents

7.

W3C validation service is accessible at http://validator.w3.org

8.

Relaxed validation service is accessible at http://relaxed.vse.cz

9.

In general, the Namespaces in XML recommendation [NS] solves problems of collision and recognition of elements and attributes from different mark-up vocabularies within one XML document. This is achieved through qualified names, that make different vocabulary elements or attributes distinguishable even they use the same local names.

10.

Subschema is defined in the NVDL specification as a schema referenced by the NVDL script.


Bibliography

[HTML-VAL] Nalevka, P.: Doplnkova validace HTML a XHTML dokumentu. University of Economics, Prague, 2003. Available at:

[HTML4] Ragget, D., Le Hors, A., Jacobs, I.: HTML 4.01 Specification. W3C, 1999. Available at:

[JNVDL] Kosek, J., Nalevka, P.: NVDL – a Breath of Fresh Air for Compound Document Validation In: XTech 2007 Proceedings. WWW 2007. May 15-18, 2007. Paris, France. Available at:

[MTHML] Carlisle, D., Ion, P., Miner, R., Poppelier, N.: Mathematical Markup Language (MathML) Version 2.0 (Second Edition). W3C, 2003. Available at:

[NS] Bray, T., Hollander, D., Layman, A., Tobin, R.: Namespaces in XML 1.0 (Second Edition). W3C, 2006. Available at:

[NVDL] Document Schema Definition Languages (DSDL) Part 4: Namespace-based Validation Dispatching Language NVDL. ISO/IEC 19757-4. 2006. Available at:

[RDF] Beckett, D.: RDF/XML Syntax Specification (Revised). W3C, 2004. Available at:

[RLXD] Kosek, J., Nalevka, P.: Relaxed – on the Way Towards True Validation of Compound Documents. In: WWW 2006 Proceedings. WWW 2006. May 23-26, 2006. Edinburgh, Scotland. Available at:

[RNG] Clark, J., Murata, M.: RELAX NG Specification. OASIS Committee Specification, 2001. Available at:

[SCHTAX] Murata, M., Dongwon, L., Murali, M., Kawaguchi, K.: Taxonomy of XML Schema Languages using Formal Language Theory. 2004. WWW:

[SCHTR] Jelliffe, R.: The Schematron Assertion Language 1.5. Academia Sinica Computing Centre, 2002. WWW:

[SVG] Ferraiolo, J., Fujisawa, S., Jackson, J.: Scalable Vector Graphics (SVG) 1.1 Specification. W3C, 2003. Available at:

[WCAG1] Chisholm, W., Vanderheiden, G., Jacobs, I.: Web Content Accessibility Guidelines 1.0. W3C WAI, 1999. WWW:

[XHTML1] XHTML – 1.0 The Extensible HyperText Markup Language (Second Edition). W3C, 2002. Available at:

[XHTMLMOD] Altheim, M., McCarron, S., Boumphrey, F., Dooley, S., Schnitzenbaumer, S., Wugofski, T.: Modularization of XHTML. W3C, 2001. Available at:

[XML] Bray, T., Paoli, J., Sperberg-McQueen, C., Maler, E., Yergeau, F.: Extensible Markup Language (XML) 1.0 (Fourth Edition). W3C, 2006. Available at:

[XMLSCH-DT] Biron, P., Malhotra, A.: XML Schema Part 2: Datatypes Second Edition. W3C, 2004. Available at:

[XMLSCH-ST] Thompson, H.S., Beech, D., Maloney, M., Mendelsohn, N.: XML Schema Part 1: Structures Second Edition. W3C, 2004. Available at:

[XSLT] Clark, J.: XSL Transformations (XSLT) Version 1.0. W3C, 1999. Available at:



Advanced approaches to XML document validation

Petr Nalevka [University of Economics, Prague, Department of Information and Knowledge Engineering]
petr@nalevka.com
Jirka Kosek [University of Economics, Prague, LISp (Laboratory for Intelligent Systems Prague)]
jirka@kosek.cz