Validation algorithm for attribute-element constraints of RELAX NG

Makoto Murata
eb2m-mrt@asahi-net.or.jp
Haruo Hosoya
hahosoya@kurims.kyoto-u.ac.jp

Abstract

Patterns of RELAX NG can represent interdependencies between attributes and elements. Such patterns are useful for schema authors. However, they make validator implementation algorithmically challenging, since naive approaches easily blow up even for typical inputs. James Clark has provided a derivative-based validation algorithm for handling such interdependencies. This paper shows another algorithm, which is based on automata rewriting. When the validator encounters an attribute set, it rewrites automata that are constructed from patterns. Then, the validator examines an element sequence by executing the rewritten automata for the sequence. Advantages of our algorithm are compactness and portability.

Keywords: RelaxNG; Validating

Makoto Murata

Murata graduated from Kyoto University. He was a member of the original XML WG. He has advocated tree automata for SGML/XML since 1994. He designed RELAX Core and RELAX Namespace, which have led to RELAX NG of OASIS and ISO/IEC and DSDL Part 4, respectively.

Haruo Hosoya

Hosoya obtained his Ph.D. in computer science from Tokyo University in 2000. He has designed and implemented XDuce.

Validation algorithm for attribute-element constraints of RELAX NG

Makoto Murata [IBM, Tokyo Research Laboratory]
Haruo Hosoya [Kyoto University, Research Institute for Mathematical Sciences]

Extreme Markup Languages 2003® (Montréal, Québec)

Copyright © 2003 Murata Makoto & Haruo Hosoya. Reproduced with permission.

Introduction

Schema languages for XML have evolved by increasing expressiveness, allowing finer- and finer-grained controls over the structure of documents. At first, the designers of schema languages (DTD [BrPaSpe], W3C XML Schema [Fallside01], DSD [KlaMoSch], and RELAX Core [Murata] mainly paid attention to elements. On the other hand, the treatments of attributes have been rather simplistic in early schema languages. Recently, James Clark, in his schema language TREX [Clark01a], proposed a description mechanism for attributes that has a comparable power to existing ones for elements. RELAX NG [ClaMu] inherits this mechanism from TREX.

The kernel of Clark’s proposal is a uniform and symmetric mechanism for representing constraints on elements and those on attributes; constraints on elements and those on attributes can be freely mixed in RELAX NG patterns [Clark01b]. We call this mechanism attribute-element constraints. The expressive power yielded by attribute-element constraints is quite substantial. For example, by using this mechanism, we can specify a constraint that allows different subelements, depending on attribute values. Such a situation may arise when we want to specify different permissible subelements for <div class="chapter"> and <div class="section">, where the former case allows section in the subelements while the latter case does not. As another example, we can specify that either an attribute or element is used to represent some piece of information. This ability would be needed when we want to allow an attribute name to represent a simple string name or an element name to represent a composite value of first and last names.

Exactly because of the ability to mix constraints for elements and those for attributes, development of validation algorithms becomes challenging. As an example, we consider the following RELAX NG pattern. Observe that attribute patterns follow element patterns and that the presence or absence of attributes determines the presence or absence of elements. If we intend to avoid exhaustive backtracking, validation against this pattern is not straightforward.

((element a {empty}+, attribute a{empty})?,
                (element b {empty}+, attribute b{empty})?

Naively, one may attempt to use traditional automata techniques, as usually done in the case of element-only validation. This is possible since elements are ordered sequences. However, as XML defines, attributes are unordered sets; therefore, we cannot easily find a valid ordering of attributes and elements. For example, given an attribute set {a="", b=""} and an element sequence <a/><b/>, we have to find one of the possible orderings, namely <a/> a="" <b/> b="", and then validate this ordered sequence against the above pattern.

Another attempt might be to transform attribute-element constraints so that attribute constraints are isolated from element ones. This transformation is done by moving attribute constraints to the beginning of patterns. Such separation allows validation of {a="", b=""} followed by that of <a/><b/>. However, this turns out to blow up in common examples, as will be discussed in Section 5.

We have developed a validation algorithm that overcomes this difficulty. Our validation algorithm uses a "two-phase" strategy. We use a variant of traditional automata that mix both attribute and element constraints. In the first phase, we validate attributes sets, rewriting the automaton into one that represents only the element constraints. In the second phase, we validate element sequences against the rewritten automaton. By this technique, we can achieve quite a simple validation algorithm that does not incur a blow-up.

Related works

James Clark [Clark02] has designed a different validation algorithm for attribute-element constraints and implemented it in his validator (i) for TREX and RELAX NG. His algorithm is based on derivatives of regular expressions ([Brzozowski], [BerSet]) and proceeds by rewriting a pattern each time it encounters an attribute or element. Meanwhile, our algorithm allows preprocessors to generate automata from a given schema. Validators use these automata rather than the original RELAX NG schema.

Our algorithm provides two advantages. First, validators can be very compact, since handling of schemas is delegated to preprocessors. In fact, we have build a compact implementation (Miaow [Murata02]) that works on cellular phones. Second, we can support many platforms by porting validators only. An implementation (Bali [Kawaguchi]) by Kohsuke Kawaguchi supports several platforms.

The derivative-based algorithm handles not only attribute-element constraints but also interleave of RELAX NG. On the other hand, our algorithm has to be further extended for handling interleave. Such an extension can be provided by shuffle automata, but is outside the scope of this paper. (Shuffle automata date back to 1970, but we reference to a recent paper [JedSze] for convenience.) As of this writing, Bali uses shuffle automata for implementing interleave, but Miaow does not support interleave yet.

An earlier version [HoMu02a] of this paper, which was presented at an informal workshop PLAN-X, covers both validation and boolean operations. A full version [HoMu02b] is also publicly available. However, this paper is intended to be more accessible and useful for implementors of RELAX NG validators. Wherever possible, we use the compact syntax of RELAX NG. We also show which restriction in RELAX NG allows our validation algorithm.

RELAX NG schemas

We use a variation of the simple syntax of RELAX NG. Any RELAX NG schema can be normalized to an equivalent schema in the simple syntax, as specified in Section 4 of [ClaMu].

For simplicity, we do not consider text, data, value, list, and mixed, although they can easily be handled in our framework. Furthermore, we do not consider interleave of RELAX NG.

We introduce two changes to the simple syntax.

First, every ref element appears as the second child of some element or attribute element. For example, consider a RELAX NG schema (in the compact syntax) as below.

default namespace = ""

start = foo
foo = element FOO { bar | foo }
bar =
  element BAR {
    attribute baz { empty },
    foo
  }

This schema is rewritten as below:

default namespace = ""

start = element FOO { _foo }
_foo =
  element BAR { _bar }
  | element FOO { _foo }
_bar =
  attribute BAZ { _baz },
  element FOO { _foo }
_baz = empty

In other words, (1) we do not specify element elements as children of define elements, but rather specify them as parents of ref elements, and (2) we introduce a define for the content of each attribute element. Is is obvious that this rewriting is possible.

Second, every attribute occurs as the child of oneOrMore and this oneOrMore does not have another oneOrMore as an ancestor. For example, the above schema is rewritten as below. Observe that the addition of + does not change the semantics of this pattern.

default namespace = ""

start = element FOO { _foo }
_foo =
  element BAR { _bar }
  | element FOO { _foo }
_bar =
  attribute BAZ { _baz }+
  element FOO { _foo }
_baz = empty

This rewriting is guaranteed by “7.1.2. oneOrMore pattern” and the second paragraph of “7.3. Restrictions on attributes” in the RELAX NG specification. For example, the following patterns cannot be rewritten, but the former is prohibited by 7.1.2 and the latter is prohibited by 7.3.

(element FOO {_foo}, attribute BAZ{_baz})+ 
(attribute * {_emp})

Attribute-free validation

We first consider validation for attribute-free schemas. To validate values against element expressions, we introduce element automata, which are variations of usual string automata. Element automata play critical roles in most validation algorithms [MuLeMa]. Later in this section, we introduce attribute-element automata by extending element automata.

An attribute-free schema is a RELAX NG schema without attribute elements. Since the body pattern of each define consists of group, choice, oneOrMore, and empty, we can assume that body pattern is a regular expression whose atoms are of the form N[x], where N is a name class and x is a name of define. For example, in the last schema shown in Section 3, the first define has an atom Foo[_foo].

From each of the body pattern of define element, we can construct an automaton whose transition labels are also of the form N[x]. Formally, we first define automata in the usual way: an automaton M on an alphabet Σ is a tuple (Q, q init,Q fin , δ) where Q is a set of states, q init (∈Q) is an initial state, Q fin (⊂Q) are a set of final states, and δ⊂ (Q ×Σ× Q) ∪ (Q × Q) is a transition relation [HoUl]. Note that we have allowed null transitions by incorporating Q × Q. Then, an element automaton is an automaton over {N[x] | NS, xX} , whereS is a set of name sets and X is a set of variables. Well-known algorithms for creating automata from regular expressions can be used for creating element automata from element expressions by assuming N[x] as symbols.

We give a brief sketch of a validation algorithm for attribute-free schemas, although understanding of this sketch is not required to read the rest of this paper. We validate a value by assigning variables to every element (including every descendant element) in this value. The variable assignment must “conform” to the schema. That is, for each element, let x be the variable assigned to this element and M be the element automaton constructed from the element expression defined asx in the schema. We execute the automaton for the children of the element in such a way that each child element a[v] matches a transition labeled N[x] when N contains a and x is assigned to this element a[v]. If we can successfully assign variables to all elements, we report that the value is valid. Although this algorithm may be too naive in that it tests all possible assignments, there is a known algorithm that takes only linear time in the size of the input value. Interested readers are referred to a note [MuLeMa].

Validation by separating attributes and elements

Attribute-element constraints pose a significant challenge. We cannot directly use element automata described above since our values are set-sequence pairs whereas element automata accept just sequences.

A naive solution is to separate constraints on attribute sets and those on element sequences. For example, we can handle

((attribute a {emp}+ | element a {emp}),
 (attribute b {emp}+ | element b {emp}))
by first converting it to the following expression
(((attribute a {emp}+, attribute b {emp}+),    empty) |
 (attribute a {emp}+,  element b {emp}) |
 (attribute b {emp}+,  element a {emp}) |
 (empty,    (element a {emp}, element b {emp}))) 
where emp is defined as emp = empty.

Observe that this pattern is a choice of c-e pairs, where c describes sets of attributes and e describes sequences of elements. After separating attribute constraints and element constraints, we can easily validate attribute sets and element sequences. Given an attribute setα and an element sequence β, we select those c-e pairs such that α is valid against c. We then validate β against e of each pair. Just like validation for attribute-free schemas, this is done by constructing element automata and executing it against β.

However, this approach causes combinatory explosion easily. For example, the above example created 4 (= 22) pairs, since one is constructed for each subset of {attribute a{emp}+, attribute b{emp}+}. If we apply the same separation to

((attribute a {emp}+ | element a {emp}),
 (attribute b {emp}+ | element b {emp}),...
 (attribute y {emp}+ | element y {emp}),
 (attribute z {emp}+ | element z {emp}))
we will have 226 pairs.

Attribute-element automata

We extend our attribute-free validation technique for handling attributes as well as elements. This extension introduces three steps: (1) we introduce special atoms (called “non-existent attribute declaration”) to the body pattern of each define element, (2) we create an automaton (called “attribute-element automaton”) from the define body, and (3) given an attribute set, we create an element automaton from this attribute-element automaton by rewriting some of its transitions. In an actual implementation, we can perform the first and second steps statically, i.e., before receiving XML documents . On the other hand, the third step can only be done dynamically. The rest of validation is the same as in attribute-free validation.

Below, for the ease of explanation, we first present the second and third steps. If we perform these two steps only, we sometimes report invalid values valid. We then present the first step as a remedy to this problem.

The second step creates an attribute-element automaton. Recall that an attribute always occurs as the child of some oneOrMore and that every ref element appears as the second child of some element or attribute element. We can, thus, assume that “atoms” of define bodies are of the form @N[x]+ or N[x]. For example, in the last schema shown in Section 3, the second define has atomsFOO[_foo]. and@BAR[_bar].

We can construct an automaton from the body pattern of each define. Some transitions of this automaton have @N[x]+ as labels, and others have N[x] as labels.

The third step creates an element automaton by rewriting some transitions of the constructed automaton. Given an attribute set α, we replace each @N[x]-transition by a null transition if (1) more than one attribute in α is valid against @N[x]+ and (2) no other attributes in α have names in N. Otherwise, we remove this transition. We preserve those transitions having N[x] as labels.

To illustrate, we use the aforementioned expression:

((attribute a {emp}+ | element a {emp}),
 (attribute b {emp}+ | element b {emp}))

An automaton constructed from this pattern is shown below:

[Link to open this graphic in a separate page]

Suppose that we would like to validate <... a=""><b/></...>. Since a="" is the only attribute, we replace the transition labeled @a[emp]+ by a null transition, and remove the transition labeled @b[emp]+. The element automaton obtained by this rewriting is shown below. Since the element sequence <b/> is accepted by this element automaton, we can correctly report that<... a="";><b/></...> is valid.

[Link to open this graphic in a separate page]

However, the element automaton also accepts <a/> <b/> and we, thus, mistakenly report that <... a=""><a/><b/><...> is valid. Furthermore, <... f=""><a/><b/><...> (where f is undeclared) is also mistakenly reported valid: although we remove the transitions for @a[emp] and @b[emp], the element automaton still accepts <a/> <b/>. What we missed is that, although we ensured that attributes required by patterns are present in instances, we did not ensure that attributes not required by patterns are absent in instances.

To overcome this problem, we introduce the first step. This step saturates the body pattern of each define by introducing non-existent attribute declarations. A non-existent attribute declaration in the compact syntax is of the form non-existent-attribute Nwhere N is a nameclass of RELAX NG. Informally, non-existent-attribute N means that no attributes have names inN. First, for each define in a given schema, we prepend a non-existent attribute declaration to the body pattern of the define so that this non-existent attribute declaration ensures that attributes not required by the body pattern are absent. Second, for each pattern of the form c1 | c2 occuring in c, we prepend a non-existent attribute declaration to c1 so that this non-existent attribute declaration ensures that attributes required by c2 but not required by c1 are absent. Likewise, we prepend another non-existent attribute declaration to c2.

Accordingly, we modify the second and third steps. We assume that non-existent-attribute N is an atom !@N. Then, an element-attribute automaton constructed by the second step has transitions having !@N as labels and transitions having N[x] or @N[x]. The third step is to rewrite these transitions. Given an attribute set α, we replace each !@N-transition by a null transition if no attributes in α have names in N. Otherwise, we remove this transition.

As an example, we again consider:

((attribute a {emp}+ | element a {emp}),
 (attribute b {emp}+ | element b {emp}))

By introducing non-existent attribute declarations, we saturate this pattern as below:

((non-existent-attribute (*-(a|b)),
 (attribute a {emp}+ | (non-existent-attribute a, element a {emp})),
 (attribute b {emp}+ | (non-existent-attribute b, element b {emp}))
where (*-(a|b)) denotes the set of names except a and b. From the saturated expression, we create an automaton shown below.

[Link to open this graphic in a separate page]

Now, this automaton has transitions labeled @a[emp]+ and @b[emp]+, as well as transitions labeled !@a, !@b, and !@(* - (a|b)). We call such automata attribute-element automata.

Suppose that we want to validate <... a=""><a/><b/></...>. As previously, we replace the transition labeled @a[emp]+ by a null transition and remove the transition labeled @b[emp]+. We further replace the transition labeled !@b by a null transition, since there are no attributes named b. We also replace the transition labeled !@(* - (a|b)) by a null transition, since no attributes have names other than a or b. But we remove the transition labeled !@a, since there is an attribute named a. The element automaton obtained by this rewriting (shown below) accepts <b/> only. Thus, we do not mistakenly report that <... a=""><a/><b/></...> is valid.

[Link to open this graphic in a separate page]

Let us make sure that undeclared attributes lead to invalidity. Suppose that an element has an attribute named f. Then, the transition labeled !@(* - (a|b)) is removed. Since this transition is the only transition from the start state, the created element automaton accepts no element sequences.

In actual implementations, the third step and the execution of element automata does not have to be separated. We can execute the third step while examining element sequences. That is, for each attribute or non-existent-attribute transition starting from one of the current states, we determine whether or not we can use that transition. In fact, this is the algorithm used for implementing Miaow. On the other hand, if we would like to report validation errors as soon as possible, our third step has to be more elaborated: whenever we encounter an attribute, we have to execute the third step and then check that the rewritten automata accepts a non-empty set.

We do not prove correctness of our algorithm even informally. However, it is important to point out that this correctness depends on the first paragraph of “7.3. Restrictions on attributes” in the RELAX NG specification. This paragraph prohibits duplicate attributes, such as (attribute a {empty}, attribute a{empty}). Had the RELAX NG not introduced this restriction, our algorithm would not work correctly.

It remains to formally describe the three steps.

The first step is provided by a saturating operator. This operator introduces non-existent attribute declarations for each alternative of patterns.

sat(element N {x}) = element N {x},
sat(attribute N{x}+) = attribute N{x}+,
sat(c1 | c2) = (non-existent-attribute (att(c2) - att(c1)), sat(c1))
                | (non-existent-attribute (att(c1) - att(c2)), sat(c2)),
sat(c1, c2) = (sat(c1), sat(c2)),
sat(empty) = empty,
sat(e+) = e+

Next, we extend the saturating operator for schemas. For each define in a given schema, we replace the body pattern p of this define by

(non-existent-attribute (* - att(p)), p)

The second step creates an attribute-element automaton from the body pattern of each define element. Formally, an attribute-element automaton M is an automaton over symbols N[x], @N[x]+, and !@N. We assume that a saturated pattern is a regular expression whose atoms are N[x], @N[x]+, or !@N. By applying a standard algorithm, we can construct an attribute-element automaton from the body pattern of each define, which is saturated.

sat(G)(x) for everyx

Finally, we present the third step as an operator for rewriting attribute-element automata. Given a set α of attributes, the rewriting operator creates an element automaton from an attribute-element automaton (Q, q init,Q fin, δ). The element automaton borrowsQ, q init, and Q fin as the state set, start states, and final states, respectively. However, the transition relation of the element automaton is created by rewriting δ.

First, we preserve transitions for elements. These transitions have labels of the form N[x.] Second, we rewrite transitions for attributes. For convenience, we use αN to denote a subset of α such that αN contains all attributes in α having names in N. For each transition in δ having @N[x] as a label, we introduce a null transition if αN is non-empty and the value of any attribute in αN is valid against x. Third, we rewrite transitions for non-existent attribute declarations: for each transition in δ having a label of the form !@N, we introduce a null transition if αN is empty.

Conclusions and future work

In this paper, we have presented an automaton-based algorithm for handling attribute-element contraints of RELAX NG. On the basis of this algorithm, we have implemented a RELAX NG validator called Miaow. We have also used our algorithm for the XDuce language [HosoyaPierce00]. In our experiences, our algorithm allows simple, compact, and portable implementations.

An important omission is interleave of RELAX NG. We believe that shuffle automata provide a solid basis for implementing interleave, and Kohsuke Kawagauchi has already used shuffle automata for implementing his RELAX NG validator called Bali.


Acknowledgments

We are grateful to the members of the RELAX NG TC of OASIS. In particlar, Kohsuke Kawaguchi implemented our algorithm.


Bibliography

[BerSet] Berry, G., and R. Sethi. “From regular expressions to deterministic automata”. Theoretical Computer Science, 48(1):117–126, 1986.

[BrPaSpe] W3C. Extensible Markup Language (XML). T. Bray, J. Paoli, C. M. Sperberg-McQueen, and E. Maler, eds. 2000. http://www.w3.org/XML/.

[Brzozowski] Brzozowski, J. “Derivatives of regular expressions”. Journal of the ACM, 11(4):481–494, Oct. 1964.

[ClaMu] Clark, J., and M. Murata. RELAX NG. OASIS Committee Specification. 2001. http://www.relaxng.org.

[Clark01a] Clark, J. TREX: Tree Regular Expressions for XML. 2001. http://www.thaiopensource.com/trex/.

[Clark01b] Clark, J. The Design of RELAX NG. 2001. http://www.thaiopensource.com/relaxng/design.html.

[Clark02] Clark, J. 2002. http://www.thaiopensource.com/relaxng/implement.html.

[Fallside01] W3C. XML Schema Part 0: Primer. W3C Recommendation. D. Fallside, ed. 2001. http://www.w3.org/TR/xmlschema-0/

[HoMu02a] Hosoya, H., and M. Murata. Validation and boolean operations for attribute-element constraints. PLAN-X, 2002.

[HoMu02b] Hosoya, H., and M. Murata. Validation and boolean operations for attribute-element constraints. http://www.kurims.kyoto-u.ac.jp/~hahosoya/papers/attelm.ps.

[HosoyaPierce00] Hosoya, H., and B. Pierce. XDuce: A Typed XML Processing Language, In Int’l Workshop on the Web and Databases (WebDB), 2000.

[HoUl] Hopcroft and Ullman. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, 1979.

[JedSze] Jedrzejowicz, J., and A. Szepietowski. “Shuffle languages are in P”. Theoretical Computer Science, 2001.

[Kawaguchi] Kawagauchi, K. 2002. http://www.kohsuke.org/relaxng/bali/doc/.

[KlaMoSch] Klarlund, N., A. Møller, and M. I. Schwatzbach. DSD: A Schema Language for XML. In ACM SIGSOFT Workshop on Formal Methods in Software Practice, 2000.

[MuLeMa] Murata, M., D. Lee, and M.Mani. Taxonomy of XML schema languages using formal language theory. In Proceedings of Extreme Markup Language 2001, 2001.

[Murata] Murata, M. RELAX (REgular LAnguage description for XML). 2001. http://www.xml.gr.jp/relax/.

[Murata02] Murata, M. XML ’02. http://www.idealliance.org/papers/xml02/ slides/makoto/makoto.ppt.



Validation algorithm for attribute-element constraints of RELAX NG

Makoto Murata [IBM, Tokyo Research Laboratory]
eb2m-mrt@asahi-net.or.jp
Haruo Hosoya [Kyoto University, Research Institute for Mathematical Sciences]
hahosoya@kurims.kyoto-u.ac.jp