On-the-fly Validation of XML Markup Languages using off-the-shelf Tools

Mikko Saesmaa
Pekka Kilpeläinen

Abstract

Validation of XML documents is often treated as a major operation, performed only at major transitions in the document's life cycle, after it has been created or when it enters some new stage of processing. Users editing XML documents, on the other hand, would appreciate instantaneous feedback of the correctness of the document each time anything changes. Such on-the-fly validation can be implemented in an XML editor using the current version of Java and freely available XML tools. Our experience is that on-the-fly validation can be implemented easily without introducing observable delays even on relatively large documents. To demonstrate this, we have built an experimental XML editor which validates documents on-the-fly after every modification. The editor supports editing of DTDs and validation according to DTDs and according to schemas written in W3C XML Schema and Relax NG.

Keywords: Editing/Authoring; Validating

Mikko Saesmaa

Mikko Saesmaa is an assistant of Computer Science at the University of Kuopio, Finland. He received his MSc in Computer Science in 2004. He is currently doing his postgraduate studies.

Pekka Kilpeläinen

Pekka Kilpeläinen is a professor of Computer Science at the University of Kuopio, Finland. He received his PhD in Computer Science at the University of Helsinki in 1993. His research interests are centered around the theory and practice of processing structured documents and XML. Prof. Kilpeläinen has been involved in the academia for example with designing tools such as a structured-text search tool called sgrep, an SGML transformation language called TranSID, and a declarative XML conversion language called XW.

On-the-fly Validation of XML Markup Languages using off-the-shelf Tools

Mikko Saesmaa [University of Kuopio]
Pekka Kilpeläinen [University of Kuopio]

Extreme Markup Languages 2007® (Montréal, Québec)

Copyright © 2007 Mikko Saesmaa and Pekka Kilpeläinen. Reproduced with permission.

Introduction and Overview

In XML editing, there is a constant need to know the validity of the document against its DTD or schema. To accomplish this, an editor has to support on-the-fly validation, i.e. to validate the document every time there is a change in the content. There should not be any noticeable delays so as not to annoy and distract the user. This raises questions about efficiency and scalability. The user would like the editor to support as wide range of XML markup languages as possible. This raises architectural issues, as in, how to make the architecture of the editor as flexible as possible, so that it is easy to plug-in validators.

Our objective was to test implementing on-the-fly validation of XML documents with an experimental editor, which we built using Java and its API for XML Processing (JAXP) [Sun06a]. Since we used traditional validators instead of incremental ones (for ongoing research, see recent articles on incremental validation [BML04], [BPV04], and [BLS06]), an interesting question was how efficient would on-the-fly validation be with different document sizes. Our experimental editor, called Xeditor, can use several different validators for different XML markup languages and provides on-the-fly validation to the user regardless of which of these validators is chosen. Presently it houses three validator implementations which are Xerces2 Java Parser (Xerces) [Apa06], the Sun Multi-Schema XML Validator (MSV) [Sun06b] and Jing [Cla03].

The original inspiration for our experiment came from a couple of articles in the TAG newsletter [Tra99a][Tra99b], which demonstrated how to program a validating XML editor using javascript and the XML DOM implementation of the Microsoft XML Parser. The resulting editor is lightweight and implements on-the-fly validating, but it is tied to the Microsoft XML parser and therefore lacks modularity and portability. In addition, no information is given on how well the implementation scales in respect to document sizes. A more recent product is the Architag XRay2 XML Editor1. On the downside, no specific information is given of its inner workings and, at present, it has limited schema support. Another XML editor, in active development, is Oxygen XML Editor2. It incorporates a wide list of features which includes, among others, on-the-fly validation against DTDs and several schema languages, and also supports adding custom validators. This editor is built using Java, like our experimental editor, so it benefits from the vast array of validator implementations made for this platform. However, unfortunately, no information is available of its inner workings or scaling issues either. A couple of editors mentioned in [BPV04] are XMLMind3 and XMLSpy4. The inner workings of both these editors remain unknown. On XMLMind, some information is shared about RELAX NG validation, which is supported using a trimmed version of Jing.

Xeditor supports validation against DTDs and schemas written in XML Schema (XSD) [FaW04] and RELAX NG (RNG) [ClM01]. Also, it is easy to add support for any schema language if it has a JAXP 1.3 compliant validator implementation. The user interface of Xeditor can be seen in Figure 1 with a top-menu, an editing view, an information area, and a bottom information panel that displays the status of the document. In this case the user is editing the RELAX NG schema for RELAX NG itself. The terminating 'r' in the start tag of the top-level 'grammar' element has been deleted, and the corresponding error message is immediately displayed in the bottom bar. Please notice that the current user interface has been designed for experimenting with the instantaneous-feedback architecture; eventual end-users of an XML editor would require a more polished look-and-feel.

Figure 1
[Link to open this graphic in a separate page]

The user interface of the Xeditor.

The rest of this paper is organized as follows. In Section 2 we discuss the technical architecture of Xeditor. In Section 3 we discuss the efficiency and scalability of Xeditor with selected test results. Finally, in Section 4 we summarize and discuss further research.

Supporting On-the-fly Validation

The Java Development Kit (JDK) comes with an easy to use yet powerful API for XML Processing (JAXP), which enables applications to parse and to transform XML documents. The architecture of JAXP is based on a factory design pattern, which allows the processor implementations to be selected at run-time. Released in 2005, JAXP version 1.3 introduced a new schema-independent Validation Framework, which separates validation from the parsing process; in the earlier versions the validation was an integral part of parsing. (An article by Brett McLaughlin [McL05] provides a nice introduction to the use of JAXP 1.3 validation functionality.)

Validation of XML document instances against their DTD was straightforward to perform already in JAXP 1.2, using an SAX XMLReader created by a JAXP SAXParserFactory with validation turned on. In the following subsections we will, instead, discuss three features of Xeditor that are neatly done by using JAXP 1.3.

Validating instances against XSD/RNG schemas

The JAXP validation API consists of three main classes: SchemaFactory, Schema and Validator. The SchemaFactory class is a schema compiler. It reads external representations of schemas and prepares them for the validation. The Schema class represents compiled Schema objects that are immutable in-memory representations of a grammar. From a Schema object, a Validator object can be created. It can then be used to check an XML document against a Schema. Figure 2 demonstrates how these classes are used in Xeditor. When supporting multiple schema languages, many SchemaFactory instances can be configured and used for each language (notice the two SchemaFactory objects in Fig 2, the first for XML Schema and the second for RELAX NG). A suitable validator implementation is located automatically when instantiating a SchemaFactory class. When the user chooses some schema in Xeditor, the schema is compiled with an appropriate SchemaFactory instance (using its newSchema-method). The schema object is used to create the actual validator. This validator can then be used to validate document instances against this particular schema.

If the user chooses to change the schema, a new Schema is simply instantiated from an appropriate SchemaFactory instance and used to create a new Validator instance. Only one Schema and Validator object is maintained at a time, since the time to recompile a schema is easily hidden in the time taken by the user to select the schema from a file menu. Finally, the Validator is prepared to validate document instances by setting an ErrorHandler, which displays possible error messages in Xeditor's information bar.

Figure 2
[Link to open this graphic in a separate page]

Initializing the validator for a chosen schema.

Support for established markup languages, such as XSLT [Cla99], and the schema languages XML Schema and RELAX NG themselves was easy to implement, using publicly available schemas and validators as black boxes. Presently Xeditor has Jing and MSV as RELAX NG validators and Xerces as an XML Schema validator. Both MSV and Xerces are used through JAXP 1.3 APIs, but due to certain difficulties in implementation, Jing is used through its native APIs.

The editing view of Xeditor is based on the Java Swing Document interface. The Document interface represents in this case the entire text of the document instance. On-the-fly functionality of Xeditor was realized by implementing a DocumentListener interface, listening to changes made to the document instance and calling the Validator with the document instance as a parameter any time there is a change (see Figure 3). Any errors encountered while validating are intercepted with a SAX ErrorHandler implementation and reported to the user. Efficiency is essential in order to avoid disruptive delays. For this reason the document instance is passed to the Validator as an in-memory stream, instead of circulating it through an external file.

Figure 3
[Link to open this graphic in a separate page]

Implementation of on-the-fly functionality in Xeditor.

The superiority of XML Schema vs. RELAX NG is a matter of considerable debate. Because of this, it was satisfying to observe that it was equally easy to support validation using either of them, and let the users make their choice.

Validating XSD/RNG schemas

The validation of schemas works similarly to the validation of instances against schemas. There are meta schemas available for both XSD or RNG that can be used as schema files in the validation process. In this way, the schema that the user is editing is the document instance which will be validated against a meta schema of the correct schema language. An example of this was shown in Figure 1, where the schema for RELAX NG is being edited and validated against a meta schema for RELAX NG. As a result, an error was found and displayed instantly to the user.

A JAXP SchemaFactory class could also be used as a standalone Schema validator, by giving it the document instance as a schema to be compiled. A SchemaFactory reports any errors encountered through an SAX ErrorHandler implementation. Applying a tailored schema language compiler this way could give us more specific error messages. On the other hand, it could be less efficient because in addition to validation, the schema would be compiled also. Furthermore, it would break the homogeneous architecture of validating schemas in Xeditor. For these reasons, we decided against using the SchemaFactory class this way.

Editing DTDs

We wanted Xeditor to support editing of DTDs and to provide immediate feedback of their correctness, too. A couple of obvious problems had to be overcome to reach this goal. To start with, the DTD formalism is not an XML-based markup language, and thus DTDs cannot be validated against any XML DTD or Schema. Of course, any XML processor has to be able to check that the declarations that are given in the prolog of a document instance are well-formed. Thus, a first solution would have been to wrap the DTD being edited inside an internal DTD subset of a dummy document, and to pass this document to the parser. This would have been a partial solution, though: We wanted to support editing of independent sets of declarations, which could be stored and used as external DTD subsets. XML poses slightly different constraints on the contents of external and internal DTD subsets, e.g., with respect to the use of parameter entities. Thus we need to pass the declarations to the parser for correct processing as an external DTD subset.

Storing the contents of the editor in an external file to be loaded by the parser after potentially every keystroke would cause noticeable delays. The trick we applied is to pass the contents of the editor buffer to the parser as a DTD subset, which is logically an external entity for the document, but physically an in-memory object. This is realized by implementing and introducing to the parser a SAX EntityResolver, such that when the parser invokes its resolveEntity method for the entity of interest, the contents of the editor buffer is returned as a response to the parser. Similarly to the validation of instances and schemas discussed earlier, any error messages are again displayed by an ErrorHandler to the information bar of the Xeditor user interface.

A dummy document that is used for passing the DTD entity to the parser is shown in Figure 4, and the organization for passing the DTD as an in-memory external entity to the parser is shown as a diagram in Figure 5.

Figure 4
<!DOCTYPE foo SYSTEM "Xeditor DTD" [<!ELEMENT foo EMPTY> ]> <foo/>

The dummy document used for validating DTDs. The string 'foo' is replaced with a random string.

Figure 5
[Link to open this graphic in a separate page]

Passing the DTD to the parser as an in-memory external entity.

An alternative approach to validating DTDs would be first to translate them into an XML markup form, and then validate the translated version against an appropriate DTD or schema. For example, NekoDTD [Cla04] is a tool that translates DTDs to corresponding XSD or RNG schemas. There are some obstacles to using such a translating approach for on-the-fly validation, though. First, translation normally requires the source to be correct. A document that is currently being edited is most of the time ill-formed, which makes the translation to another form very difficult. Another problem is that even when a translation is possible, it tends to lose some information of the source. In our case we wanted to provide immediate feedback to the user in the form of error messages pertaining to the markup that the user is currently editing. Error messages pertaining to the result of a translation wouldn't be meaningful to the user, and would thus need to be translated back so that they would refer to the DTD notation visible to the user. For these reasons we preferred to utilize the capabilities of an XML parser for checking the correctness of the DTD syntax directly.

Efficiency and Scalability

On-the-fly validation was surprisingly easy to add to Xeditor using JAXP 1.3. We also wanted to know how efficient and scalable would this kind of validation be in practice. A natural hypothesis was that the time usage would depend linearly on the length of the document scanned before encountering an error. Taking into account that any delay that is over 100 ms long is noticed by an average user, the interesting point was to see how big would the constant factor be with different validator implementations, thus affecting the size of documents that can be edited without observable delays.

Test arrangements

For testing efficiency and scalability of on-the-fly validation in the Xeditor, we coded a separate Java class for timing purposes. In it, we used Java's Date class and its getTime method to measure the time spent in milliseconds on validation. This was done by instantiating two Date objects, the first just before and the second just after the validation process, getting the time from them by getTime method and then calculating the difference of the end and start times. This measurement would occur every time the document changed.

The tests were conducted on a modern but fairly moderate Intel-based desktop PC running a 1.7GHz Pentium 4 processor and containing 512MB main memory. We used Java version 1.5.0_06 running on GNU/Linux operating system (Fedora Core release 4).

For each test case we loaded a test document instance in the Xeditor, chose a validator, a schema file and then made a number of errors at certain points in the document to see how long it would take to validate the document up to each of these error points. The first batch of errors was made at 20% of the document length, the second at 40%, the third at 60%, the fourth at 80% and the fifth at 100%. An error batch consisted of making the same error ten times at the same place. From these measurements we manually removed obvious deviations and then calculated the average value for each of the five error points. (We consider the type of errors to be immaterial on the time usage, which is dominated by the part of the document that is succesfully scanned before encountering the error.)

Test results

The validation performance test, whose results are shown in Figure 6, was arranged as follows. We used the official XML Schema schema file5 as a document instance in the editor. It is 2,535 lines long. We then made errors at regular intervals (on lines 507, 1,015, 1,521, 2,029 and 2,535) while validating it with validators Jing, MSV and Xerces. Jing and MSV were in RELAX NG mode and used W3C XML Schema -schema6 as a schema file. Xerces was in native XML Schema mode and used the previously mentioned XML Schema-schema as a schema file.

Figure 6
[Link to open this graphic in a separate page]

Performance of on-the-fly validation of various initial portions of XMLSchema.xsd.

We observed in our testing that, as expected, the time usage depends linearly on the length of the document scanned before encountering an error. This was true for all three validators used. Also, we noticed repeatedly during testing that Jing was the most efficient when compared to MSV and Xerces. Taking in account the complexity of XML Schema schema and that we used traditional validators in our on-the-fly implementation, the results are surprisingly good on moderate hardware.

One of our tests aimed to see how efficient these validators are with large documents, that is, how large a document can be edited with the on-the-fly validation of Xeditor before the response times grow too long (say, above 100 ms). In this test we used an XMLSchema document from the HL7 V3 distribution called voc.xsd7. The document consists of 847,156 bytes on 18,170 lines. We again made errors at 20 %, ..., 100 % from the start of the document (that is, on lines 3,633, 7,264, 10,895, 14,526, and 18,170). The validators and schema files used were the same as in the test described earlier. The test results are shown in Figure 7. With Jing as a validator, we could edit this large document without disruptive delays in response times. This wasn't possible with MSV or Xerces as a validator, because the response times grew too long. Of course, it is rather unusual to use editors for such large documents.

Figure 7
[Link to open this graphic in a separate page]

Performance of validating initial portions of a large document (voc.xsd).

Discussion

We have discussed the design and some experiences of an architecture that supports immediate validation of XML documents that are being edited, against schemas expressed using XML DTD, XML Schema, or RELAX NG. For example, the validation of schema documents written in XML Schema or RELAX NG, and of XSLT stylesheets was easy to realize by plugging in schemas of these languages into the editor. A recent version (1.3) of the JAXP interface makes it relatively easy to implement such a system using Java and freely available XML tools. Further, the architecture allows support for other schema languages to be included easily, too, provided that a JAXP interface for a validator of the language is available. Our experiments verified that the approach can be implemented efficiently, without observable delays while editing a document, and that the implementation scales well also to relatively large document instances. An arrangement to check the syntactic correctness of stand-alone DTDs (which are actually not XML documents) was also implemented and discussed.

The architecture provides also an interesting test-bed for validator implementations. While the editor-based approach does not support rigorous and systematic testing with collections of test-cases, the instantaneity of its feedback on the other hand makes it easy to check how validators behave on multiple variations of document structures. For example, we observed, without intentionally trying to compare validator functionality, that MSV and Jing spot slightly different errors in documents modeled by a common schema.

We are planning to use Xeditor as a tool for teaching XML markup languages to our students. Immediate feedback provided by on-the-fly validation should be helpful for them to tackle with the syntactic details of the various XML-based languages. The verification of this hypothesis would be an interesting topic of further study.

Another direction of further study would be to incrementalize the validation, that is, let the validator to restrict the checking just to the immediate neighborhood of the modification. This should increase the scalability of the system drastically, since the time of re-validation in most cases would not depend on the total size of the document at all. The incrementalization of XML validation in general has been considered by others (see [BML04], [BPV04], and [BLS06]). What we would find interesting is to continue to study the practical usability of standard validator interfaces: How can they be applied, or how do they need to be extended, in order to validate only a part of the document that is affected by the latest modification?

Notes

1.

Available at http://www.architag.com/xray/.

2.

Available at http://www.oxygenxml.com/.

3.

Available at http://www.xmlmind.com/xmleditor/

4.

Available at http://www.altova.com/products/xmlspy/xml_editor.html/

5.

Available at http://www.w3.org/2001/XMLSchema.xsd

6.

Available at http://www.jenitennison.com/schema/xmlschema.rng

7.

Available at http://www.hl7.org/v3ballot/html/processable/coreschemas/voc.xsd


Bibliography

[Apa06] Apache Software Foundation. Xerces2 Java Parser, 2006. http://xerces.apache.org/xerces2-j/.

[BLS06] D. Barbosa, G. Leighton, and A. Smith. Efficient incremental validation of XML documents after composite updates. In Database and XML Technologies, Proceedings of the 4th International XML Database Symposium, XSym 2006, pages 107-121, Seoul, Korea, September 2006. Springer-Verlag.

[BML04] D. Barbosa, A. O. Mendelzon, L. Libkin, L. Mignet, and M. Arenas. Efficient incremental validation of XML documents. In ICDE '04: Proceedings of the 20th International Conference on Data Engineering, page 671, Washington, DC, USA, 2004. IEEE Computer Society.

[BPV04] A. Balmin, Y. Papakonstantinou, and V. Vianu. Incremental validation of XML documents. ACM Trans. Database Syst., 29(4):710-751, 2004.

[Cla03] J. Clark. Jing, a RELAX NG validator in Java. Thai Open Source Software Center Ltd, 2003. http://www.thaiopensource.com/relaxng/jing.html.

[Cla04] A. Clark. CyberNeko DTD Converter, 2004. http://people.apache.org/~andyc/neko/doc/dtd/.

[Cla99] J. Clark. XSL Transformations (XSLT) Version 1.0. W3C Recommendation, November 1999. http://www.w3.org/TR/xslt/.

[ClM01] J. Clark and M. Murata. RELAX NG Specification. OASIS, December 2001.

[FaW04] D. C. Fallside and P. Walmsley. XML Schema Part 0: Primer Second Edition. W3C Recommendation, October 2004. http://www.w3.org/TR/xmlschema-0/.

[McL05] B. McLaughlin. JAXP validation. IBM developerWorks, 2005. http://www-128.ibm.com/developerworks/java/library/x-jaxpval.html.

[Sun06a] Sun Microsystems, Inc. Java API for XML Processing (JAXP), 2006. http://java.sun.com/webservices/jaxp/index.jsp.

[Sun06b] Sun Microsystems, Inc. Sun Multi-Schema XML Validator, 2006. http://www.sun.com/software/xml/developers/multischema/.

[Tra99a] B. E. Travis. Real-time XML Editor: A Technology Preview. TAG Newsletter, 13(4), 1999.

[Tra99b] B. E. Travis. Real-time XML Editor: Part II. TAG Newsletter, 13(5), 1999.



On-the-fly Validation of XML Markup Languages using off-the-shelf Tools

Mikko Saesmaa [University of Kuopio]
Pekka Kilpeläinen [University of Kuopio]