Supporting UPA and restriction on an extension of XML Schema

Matthew Fuchs, Ph.D.
Allen Brown, Ph.D.

Abstract

We describe an extension of regular expresions using arbitrary all-groups and numeric exponents (as in W3C XML Schema). We are able to provide polynomial-time algorithms for testing UPA and content model subsumption for these extended regular expressions. The algorithm handles numeric exponents (as opposed to the Kleene operators of traditional regular expressions), but comparisons of exponents are exponential in the depth of exponent nesting. The algorithm for UPA testing with all groups in USCM is polynomial in the size of the expression, but uses push-down automata — much more complex machinery than used here for that purpose, and current known algorithms for subsumption involve unrolling the numeric exponents. This is exponential in the size of the exponents — in other words, large exponent values are much more onerous than small ones. As XML Schema uses numeric exponents this is a significant issue for validators.

We also show how to extend these algorithms to handle XML Schema’s wildcards and substitution groups.

Keywords: Validating; XSD/W3C Schema

Matthew Fuchs, Ph.D.

Matthew Fuchs, Ph.D., is senior architect at Westbridge Technology. Before that, he was chief scientist for XML-related technologies at CommerceOne, an industry leader in electronic commerce. He co-authored the “Schema for Object Oriented XML” and designed its object-oriented features (and the software that exploits them). He received his Ph.D. from NYU in 1995, where his work on mobile object systems started his fixation on using XML (and its SGML predecessor) as a metalanguage for describing agent communication languages. Dr. Fuchs was a founding member of the W3C working group that created XML and is a member of the XML Schema Working Group. Before CommerceOne, he was a researcher at Walt Disney Imagineering and at WVU’s Concurrent Engineering Research Center.

Allen Brown, Ph.D.

Dr. Allen L. Brown, Jr. is a Software Architect at Microsoft with a primary focus on the development of international standards related to Web Services. In the years immediately prior to joining Microsoft, he co-founded and developed a venture-backed startup enterprise focused on digital media management. More than half of Dr. Brown’s nearly four decades of professional involvement in computing was spent at the Xerox Corporation where he served, among other things, as a Research Fellow in its Corporate Research and Technology Division and as CTO of its XSoft Division. While much the Xerox era was devoted to co-authoring 75+ papers and 24 U.S. patents, Dr. Brown also contributed directly to products, including architecting some of the key facilities of the landmark Xerox ‘Star’ system. He has also held academic appointments as Professor of Computer and Information Science at Syracuse University and James Welling Clark Professor at the George Washington University. Dr. Brown earned undergraduate degrees in Mathematics and Chemical Engineering, and a doctoral degree in Artificial Intelligence, all from the Massachusetts Institute of Technology.

Supporting UPA and restriction on an extension of XML Schema

Matthew Fuchs, Ph.D. [Westbridge Technolgy]
Allen Brown, Ph.D. [Microsoft Corporation]

Extreme Markup Languages 2003® (Montréal, Québec)

Copyright © 2003 Matthew Fuchs, Ph.D. & Allen Brown, Ph.D. Reproduced with permission.

Introduction

While XML Schema (WXS) provides many facilities not available with DTDs, it also provides some features that are problematic for developing validators. At the same time, there are desirable features which it has not provided. In the former category are:

  • Numeric exponents – the ability to specify ranges for the number of times a part of a content model may be repeated. A simple example would be the following element declaration:
    <element name="foo" type="bar" minOccurs="10" maxOccurs="20"/>

    This describes an element that must appear at least 10 times, but no more than 20 times.
  • Restriction – the XML Schema type system makes it possible to relate two different content models, where the second (the derived type) is intended to match a subset of the element sequences matching the first (the base type). The rules for specifying when a content model is a valid restriction of another are famously baroque and do not cover all pairs where one is a subset of the other. In other words, restriction is overly restrictive.
  • The UPA [Unique Particle Attribution] Constraint – uniquely identifies a particle in the content model for each element or attribute validated. Given any particle in a content model and an input, there is at most one terminal matching the input. Algorithms of complexity O(n2) are known for this problem when the content model is an ordinary regular expression (using sequence, choice, and the Kleene operations *, ?, and +).
These items interact in that the best (published) algorithm available to determine the subset question when numeric exponents or all-groups (generally refered to as &-groups) are present has been to expand the exponents into explicit disjuncts of sequences — a process exponential in the size of the input — before determining the subset relationship, which itself is then only polynomial. The latter category includes:
  • General all groups – all groups in content models (groups where every member of the group must appear once, but in any order) existed in SGML but were initially banned in XML. XML Schema permits limited all groups at the top level of a content model but no where else. All groups were initially banned because they translate into ordinary regular automata (where there is no memory of which elements have been previously matched) by generating transitions for all combinations — once again an exponential blowup.

Finally, there is empirical evidence, demonstrated by the number of inconsistencies among validators, that the initial specification is not completely clear on any number of points.

It is imperative for XML Schema to determine if there are reasonable algorithms for determining these questions. Otherwise, the specification has, essentially, unusable features.

This paper describes algorithms for calculating whether a content model obeys UPA and whether one content model correctly restricts another based on the model built. The restriction algorithm has an exponential component where two chains of nested numeric exponents need to be compared, but it is only exponential in the length of the chains, rather than the size of the exponent values.

We will demonstrate our algorithms on an extension of traditional regular expressions. In the remainder of the paper, we wil explain the extended syntax and define a few important properties of the terminals in these expressions. Then we will provide the UPA algorithm. This works for simple terminals. Next, we will show how to extend this for substitution groups and wildcards as found in XML Schema. Next, we examine the subsumption issue, first for expressions with &-groups, and then for expressions with numeric exponents.

Evaluating the Unique Particle Attribute constraint

We start by having a set of terminals, which we will represent using az. These correspond to element names in XML Schema.

A schema regular expression is defined by the following grammar:

particle -> terminal
| particle{m, n} (Where 0 ≤ m ≤ n ≤ unbounded — we will never consider m = unbounded or n = 0)
| (particle1, ..., particlen)
| (particle1| ...| particlen)
| (particle1& ...& particlen)
We will use α – ω to refer to particles generally. This extends traditional regular expressions by adding the &-group and allowing m to take values other than 0 or 1, and n to take values other than 1 and unbounded (exponents {0, 1} corresponds to ?, {0, unbounded} to *, and {1, unbounded} to +).

In order to determine if a regular expression follows UPA, we will need the following five properties of a particle:

  1. The particles contained in p, particles(p), where a particle is any terminal or particle recursively defined.
    particles(p) => if p = (α1,...,αn) then {p} ∪ (∪ni = 1 particles(αi))
    else if p = (α1&...&αn) then {p} ∪ (∪ni = 1 particles(αi))
    else if p = (α1|...|αn) then {p} ∪ (∪ni = 1 particles(αi))
    else if p = α{m,n} then {p} ∪ particles(α)
    else if p is a terminal, {p}
  2. The opacity of a particle, opaque(p). A particle is opaque if it cannot match the empty string. Otherwise, it is transparent. This is defined as follows:
    opaque(p) => if p = α{0, n}, (α any particle) then false
    else if p = (α1, ..., αn) or p = (α1 & ... & αn) then \/ni = 1 opaque(αi)
    else if p = (α1| ... | αn) the /\ni = 1 opaque(αi)
    else if p = α{m, n}, m > 0, then opaque(α)
    else if p = a, a a terminal, then true
    else if p is the empty sequence, then false.
  3. The first set, first(p), represents all those particles that can match the first terminal in a string matching this particle. We define it as follows:
    first(p) => if p = a, a a terminal, then {a}
    else if p = α{m, n} then first(α)
    else if p = (α1|...|αn) or p = (α1&...&αn) then ∪ni = 1first(αi)
    else if p = (α1,...,αn) then
    if opaque(α1) then first(α1)
    else first(α1) ∪ first((α2,...,αn))
  4. The follow set, follow(p), represents all those particles that can match a letter in a string immediately following a portion of the string matched by the particle, p. Unlike the other properties defined here, follow(p) can only be defined in terms of the expression containing p. We, therefore, define follow(Γ, p, s), where Γ is a schema regular expression and s a set, and define follow(p) = follow(Γ, p, {}), where Γ is the entire regular expression under consideration.
    follow(Γ, p, s) => if p = Γ then s
    else if Γ = Δ{m, m}, and opaque(Δ) then follow(Δ, p, s)
    else if Γ = Δ{m, n}, m < n or not opaque(Δ) then follow(Δ, p, s ∪ first(Δ))
    else if Γ = (α1|...|Δ|...|αn), where p ∈ particles(Δ) then follow(Δ, p, s)
    else if Γ = (α1&...&Δ&...;&αn), where p ∈ particles(Δ) then follow(Δ, p, s)
    else if Γ = (α1,...,Δ,αm...,αn), where p ∈ particles(Δ) then
    if opaque((αm,...,αn)) then follow(Δ, p, first((αm,...,αn)))
    else if not opaque((αm,...,αn)) then follow(Δ, p, s ∪ first((αm,...,αn)))
  5. The confusion set, confusion(p) gives the set of particles in p which could be in conflict with follow(p). In other words, α ∈ particles(p) shows up in confusion(p) if there is some particle, β ∈ particles(p), such that follow(β) ⊃ follow(p) and α ∈ follow(β). The set confusion(p) is the union of the follow sets of particle of P having β in their follow sets (i.e., particles that could be confused with a particle in the follow set).
    confusion(p) => (∪q ∈ particles(p) follow(p, q, {δ})) - {δ} where not (δ ∈ particles(p)) and δ ∈ follow(p, q, {δ})

As an example, consider the following regular expression:

((a | b),(((c | d) & (e, f{0,1}){2,5}){6,6} | (g & h{0,5})))

In order to better discuss it, let’s first subscript every particle in the expression:

((a0 | b1)2,(((c3 | d4)5 & (e6, f7{0,1}8)9{2,5}10)11{6,6}12 | (g13 & h14{0,5}15)16)17)18

In this, we have the entire expression being opaque, but particles 8 and 15 are not. Because the entire expression is a sequence, and the first particle in it (2) is opaque, the first set of the expression is {a, b}. The follow set of 2 — coincidentally, the first set of 17 — is the first set of the rest of the sequence. Since it is a choice, it is the union of the first sets of the branches (particles 12 and 16). Both branches are &-groups, so they return the union of their first sets, or {c3, d4, e6, g13, h14}. Only some particles have non-empty confusion sets. Particle 9 has {f7} in its confusion set, as f7 is in follow(e6). Because of the optionality, confusion(10) has both e6 and f7 in it, as does confusion(12).

Plan of attack

Ultimately, a regular expression is for matching strings of terminals to determine membership in a language. This is done by matching terminals in the input against successive terminal particles in the expression. A regular expression has the UPA if, for any input string, at any point there is only particle in the regular expression which could be used to match the next item in the input. The regular expression (a, b) has UPA because it will only match an ‘a’ followed by a ‘b’. The regular expression (a{0,1}, a) does not because when the first ‘a’ shows up, it could match either the first a? particle or the second a particle. Determining if a regular expression has UPA is not difficult if we use standard regular expressions as described above, but numeric exponents complicate the picture. In the first case, we need only check that there are no two different particles with the same terminal either in the first set of the expression or in the follow set of any particle in the expression. (As an example of why first sets alone are insufficient, consider (a, a{0,1}, (b | a))). In the latter case, though, we have examples like:

  1. (a0{4,8},a1)
  2. (a0{8,8},a1)
  3. ((a0,b1{0,1}){8,8},b2)
  4. ((a0 & b1 & (c2, d3{0,1})), d4)
  5. ((a0 & b1 & c2), b3)
  6. (a0 & b1 & (c2, b3{0,1}))

In the first case, we have a UPA violation because the fourth match of a0 can be followed by either a fifth repetition or a1. The follow function rightly puts both in follow(a0). However, in the second case, there is no ambiguity (8 repetitions of a0 followed by one a1). But in the third case, we again have an ambiguity. On the eighth repetition, a0 can be followed by either b1 or b2 — but if we make example two work, then b2 is not in follow(a0).

In the fourth case, we have a UPA violation between d3 and d4. Therefore, we’d want the follow set of the &-group to be in the follow set of the individual branches. However, if we do that, then we’d also need to make sure that the first sets of the other branches are not also included, because, otherwise, c2 would have both b1 and b3 in its follow set in the fifth example. But that would then miss the UPA violation in the sixth example between b1 and b3.

Sources of UPA violations

Because of the issues with the two problematic constructs of &-groups and numeric exponents of the form {m,m} we need to delve further into aspects of UPA violations. There are two sources of UPA violations involving a particle that are particularly germane to our analysis.

External violations

Given a particle, there are a limited number of reasons that it can cause a UPA violation. Take the particle, P, in the following:

(α, β{0,1}, P, γ) or (α, (β | P), γ) or (α, (β & P), γ)

A violation can only result from a confusion among terminals. Either:

  1. One terminal is before P and one is within P. If so, then the terminal in front of P must be optional (i.e., β). For that to be the case, there must be a member of first(β) that’s also in first(P). Note that we only need first(P) to check this.
  2. Both terminals are in P. Then P itself violates UPA regardless of context.
  3. One terminal is in P and one is in follow(P). This means there is an optional terminal, o, somewhere in P, and it is in the follow set of some other terminal, t, whose follow set also includes γ. Terminal t can either be inside P or precede P (such as β). If t is inside P, then o must be in confusion(P). If t precedes P, then P must be optional, and first(P) shares a terminal with follow(P).
  4. Neither terminal is in P. Then P is optional, and (in this case) first(β) shares a terminal with first(γ).
Note that if P itself does not violate UPA, then we only need consider its opacity, its first set, and its confusion set.

Internal ambiguity

Suppose we have a particle, P, and P obeys UPA. Under what circumstances does P{m,n} not have UPA? At a minimum, if P{m,n} does not have UPA, then (P, P) does not have it either, although the reverse is not necessarily true (because of repeated particles). But if P obeys UPA, and (P, P) does not, then there must be some particle in P whose follow set would include the first set of the second P, and that follow set already includes a terminal in first(P). By definition, that follow set is included in follow(P). This is invariant with the number of repetitions of P.

Therefore, we can test for whether P{m,n} obeys UPA by intersecting first(P) and confusion(P).

Likewise, we can test for (α1 & ... & αn). Assuming each αi obeys UPA, either ∪ni = 1first(αi) contains the same terminal with different subscripts, or there are αj and αk, j not = k, where (αj, αk) violates UPA. Once again, this means there is some terminal in both confusion(αj) and first(αk). However, because a branch in an &-group never follows itself, it is not an error if (αj, αj) violates UPA because that would never occur in a single instance of the group — it would only happen if the &-group has an exponent, and then confusion(αj), being a subset of the confusion set of the group as a whole, would be compared with the first set of the group as a whole, exposing the issue. So we only need compare confusion(αj) with the first sets of the other branches.

Final UPA algorithm

We, therefore, end up with the following algorithm:

UPA(α) => if α = a then
if bi ∈ follow(a)and bj ∈ follow(a) -> i=j then true
else if α = β{m,n} then UPA(β) /\ (first(β) ∩ confusion(β) = {})
else if (α = (β1,...,βn) or α = (β1|...|βn)) and ∩ni = 1 first(βi) = {} then /\ni = 1UPA(βi)
else if α = (β1&...& βn) and ∩ni = 1 first(βi) = {} then /\ni=1(UPA(βi) /\ (confusion(βi) ∩ (∪j not = ifirst(βj)) = {})
else false.

Testing subsumption

As with UPA testing, new features like numeric exponents and &-groups add to the complexity of testing a subsumption relationship between two regular expressions, where by subsumption we mean all the strings accepted by the subsumed expression are also valid in the subsuming expressing, but not necessarily the reverse.

Without these features, given two regular expressions, α and β, that obey UPA, it is possible to determine whether α subsumes β in a straightforward way. If we consider the set of strings accepted by α L(α) and the set of strings accepted by β, L(β) as subsets of the set of all strings, then L(β) subset L(α) iff L(β) ∩ ~L(α) = {}. In other words, every string in L(β) should also be in L(α). ~L(α) is the set of strings not in L(α), so there should be no string that’s both in ~L(α) and L(β). This can be determined by constructing an automaton from each regular expression, inverting the final states of the automaton from α, and constructing the cross-product automaton, an automaton whose states are pairs of states, one from each source automaton. This can be done in time quadratic with the two input automata, as there is only one state which has both start states, and from any state pair and for any terminal, there can be at most one successor state pair. Since we’ve inverted the accepting states in the one automaton, the subsumption holds if there is no state pair in the resulting automaton where both members of the pair is an accepting state from their respective automata.

We will approach this issue in two stages. We will first extend the above algorithm to handle &-groups and then extend that to additionally handle numeric exponents.

Subsumption with &-groups

The straightforward subsumption story is complicated by &-groups because, even with UPA, at any point there may be more than one terminal that can match from the other machine. Consider the following pair:

  1. (a{0,1}, (d & c & b), b)
  2. (c, b, d, b)

Clearly, the second expression is subsumed by the first. The a is optional, (b, c, d) is one of the six possible paths through the & group, and b is the final terminal to be matched. However, the c and d branches of the &-group both have two transitions to a b, so the naive algorithm fails. A naive fix would be to break apart the group, listing all the alternatives:

(a{0,1}, ((d , c , b) | (d , b , c) | (c , d , b) | (c , b , d)(b , c , d)(b , d , c)), b)

However, this is exponential (actually O(n!)) in the size of the orignal expression, although the resulting expression still obeys UPA.

Any solution not expanding the content model needs to keep track of which branches of the &-group have been traversed at any point and only move from the &-group when all the required branches have been matched. Such a solution would also need to keep track of nested &-group — unlike choice, &-groups are commutative but not associative, i.e., ((a & b) & (c & d)) is not the same as ((a & c) & (b & d)), as the second allows (a, c, b, c), but the first requires that a be adjacent to b. This would imply the use of a stack.

To summarize, we need a means of indicating all the branches of an &-group such that we can keep track of which branches have been traversed and which have not. Further, this should not require rewriting the content model — either in advance or during execution of the algorithm — as that would return us to exponential behavior.

We will handle each &-group as a bit-vector of the length of the number of branches. We will divide this vector in two parts — the upper part corresponds to the opaque branches and the lower part to the transparent branches. For example, suppose we have the &-group (a? & b & c &d?) and want to see if it subsumes the sequence (b, a, c). We have two opaque branches and two transparent ones. We rewrite this as (b3 & c2 & d1? &a0?), corresponding to the bit-vector 1111, or 15, for all four branches, or 12, for just the opaque branches. We assign each branch to the corresponding bit. Let us call these the max and min values for the group. On entering the &-group we start with a bit-vector of 0000 — no branches completed. We cannot “leave” the group unless we’ve matched at least the required branches, so we’d need a bit vector of at least 1100. At each branch check, the bit vector, as a binary value, must increase so we know we’ve finished a new branch. Transitions corresponding to choices where the appropriate bit is 0 are active. Those where the bit is 1 are inactive.

To show the process, we first match b with b3 and xor the bit-vector with 23, where the power is the number of the matched branch. This gives a bit vector or 1000. Next we match the a with a0, and the bit vector becomes 1001. Finally, we match the c with c2, giving a bit-vector of 1101. As 1100 ≤ 1101 ≤ 1111, the subsumption holds.

On the other hand, if we try to check (b, a, d), our final bit-vector is 1011. Since 1011 < 1100, the subsumption doesn’t hold. Finally, if we try to check (b, b, d), we’d check the first b and get 1000. Checking the second b, we’d xor 1000 with 1000 with a result of 0000. As 0000 < 1000, we’d know we’ve tried to match the same branch twice and the subsumption fails.

At this point, we need to describe how to build an automaton for an extended regular expression. In addition to the usual transitions, we need to keep track of the following when traversing a terminal:

  • When entering an &-group, push an empty bit-vector on the stack, or push().
  • For any state transition from the end of a branch, update the top bit-vector, or update(n), where n is a power of 2. This returns true if the result of an update increases the value of the bit-vector and false otherwise.
  • For any state transition leaving the group, ensure the bit-vector’s value is between the min and the max and then pop it, or test(min, max).

Note that there may be more two transitions from a state with the same terminal. However, one will return to the &-group (due to UPA) and the other will leave it. Each has an associated test condition, and only one can succeed.

Each transition, t, in our automaton, therefore, has properties:

  1. pre(t) is a state, the state the machine must currently be in for this transition to be applied.
  2. post(t) is the state the machine is in after the transition.
  3. terminal(p) is the terminal that this transition matches.
  4. action(p) is a list of push(), update(n), or test(min, max). It may be empty if no actions are to be performed for that transition.
This automaton does not have provision for numeric exponents — they’ll be handled in the sequel.

Consider the following expression: (0a1, ((b02 & c13){0,1}0 & ((d4, e5) | e6)1), e7), where the subscripts are for identifying the states we’ll produce for the automaton and the superscripts are for the branches of the &-groups. By the discussion above, we start at state 0 and produce the following transitions:

  1. (0, 1, a, [])
  2. (1, 2, b, [push(), push(), update(0)])
  3. (1, 3, c, [push(), push(), update(1)])
  4. (1, 4, d, [push()])
  5. (1, 6, e, [push()])
  6. (2, 3, c, [update(1)])
  7. (2, 4, d, [test(3,3), update(0)])
  8. (2, 6, e, [test(3,3), update(0)])
  9. (2, 7, 3 [est(3,3), update(0), test(2,3)])
  10. (3, 2, b, [update(0)])
  11. (3, 4, d, [test(3,3), update(0)])
  12. (3, 6, e, [test(3,3), update(0)])
  13. (3, 7, e, [test(3,3), update(0), test(2,3)])
  14. (4, 5, e, [])
  15. (5, 2, b, [update(1), push()])
  16. (5, 3, c, [update(1), push()])
  17. (5, 7, e, [update(1), test(2, 3)])
  18. (6, 7, e, [update(1), test(2,3)])
The only accepting state is 7.

Comparing two automata

When comparing two automata, we are building a new automaton that is a subset of the product of the two input automata. Let us call the automaton for the initial content model B (for base) and the one we are testing as R (for restriction). At each point, there will be some set of possible transitions from a current state in the automaton for R for which corresponding transitions must exist in B. There are two sets of important subcases:

  • At least one of the transitions is into an &-group.
  • None of the transitions is into an &-group.
The latter case corresponds to traditional automata — the former to our special cases. All in all, this creates nine cases. For each, transitions can be one of:
  • All “ordinary” — there are no push commands associated with the transition, so it does not enter an &-group. We will call these simple transitions.
  • All lead into &-groups. The number of groups may actually be nested. We will call transitions leading into an &-group “&-group transitions”.
  • A mix of the above, where there was a choice (or transparent particle) in the original content model between an &-group and some other particles.

There is also an important asymmetry between B and R for determining valid subsumptions. In particular, if some state in R is matched to a state in B by the algorithm, and one transition from the state in R goes to an &-group, then one of the following must be true:

  • All the transitions in B are simple choices. For example, (a & b & c) is subsumed by (a | b | c){1, unbounded}. Likewise, ((a & b & c) | d | e) is subsumed by (a | b | c | d | e){1, unbounded}. Note, additionally, there must be exactly one branch in B for each branch in R — (a & b & c) is not subsumed by (a | (b, c)) because it doesn’t allow the ordering (b, a, c).
  • At least the transitions in B corresponding to an &-group transitions in R to the same &-group are also to an &-group, and no other choices may be in that &-group. For example, ((a & b & c) | d | e) is subsumed by ((a & b & c & f{0,1}) d | e){1, unbounded} but not by ((a & b & c & d) | e){1, unbounded}, as the latter wouldn’t allow the d to come after the e. Therefore, if the algorithm is matching transitions from an &-group in R and an &-group in B, it can’t leave the &-group in R unless it also leaves B. To aid in this, we introduce two predicates on states:
    1. &-group-only(q) returns true iff all the unmatched transitions leaving state q enter an &-group. Matched transitions are those that have already been examined by the altgorithm and used in a transition in the output.
    2. mixed(q) returns true if some of the unmatched transitions leaving state q enter an &-group, while others do not.
Finally, if the current transition in R is part of an &-group, then clearly there must always be transitions in B corresponding to all the untraversed choices in the &-group in R. Because of the structure of choice groups, this means one of the following must be true:
  1. The choice group in B is one or more groups, each containing all the choices in R, i.e., either (a | ... |b){1,unbounded} or ((a |...| b)...(a |...| b)), where the second is a list as long as the choices in the &-group in R.
  2. The choice group is a tree of all the alternatives, where at each level there are at least those choices left from the previous level, i.e, if the grup in R is (a & b & c), then the group in B would be ((a, ((b,c)|(c,b))) | (b, ((a,c)|(c,a))) | (c, ((b,a)|(a,b)))).
If there is a sequence of choices similar to the first choice above, but not all of them contain all possible choices, then one of them, say the 3rd, doesn’t contain some choice, say c. Then the sequence doesn’t support those alternatives with c in the 3rd position. Therefore, every choice group in the sequence must contain all the alternatives.

As mentioned this is very asymmetric — if all transitions from R are simple, then it doesn’t matter whether transitions in B are simple or &-group transitions. If the transition in R matches a simple transition in B, then we proceed as usual. If it matches an &-group transition in B, then the successor states in R must eventually match all the branches in B, but they only need to match one ordering of them. As a result, when traversing states for an &-group, it is not necessary to check all sequences.

This provides enough information to generate an algorithm. Because we need to check all alternatives for each state, we need to explicitly manage the &-group stacks for both B and R. As with the traditional algorithm, we invert the states in the base, create a machine from the cross-product of the states of the input machines, and determine that there are no states in the resulting machine that include accepting states from both input machines.

cross-product(qR, qB, R-stack, B-stack, output) is
if (&-group-only(qR) and mixed(qB)) then fail.
if &-group-only(qR) and not &-group-only(qB then
either
for each transition (qR,q'R,α,actionsR) there is a transition (qR,q'R,α,actionsR)
or
for each active transition (qR,q'R,α,actionsR) there is a transition (qR,q'R,α,actionsR) and not ((qR,qB),(q'R,q'B)α) ∈ output
or fail.
for each active transition (qR, α, q'R, actionsR) do
save top(R-stack) -> hold-R. (As the list of commands may include multiple push() and update() commands, this may involve storing more than one item from the stack. The process is reversed after the recursion.)
find transition an active transition (qB, α, q'B, actionsB) or fail.
save top(B-stack) -> hold-B.
if (not ((qR,qB), α, (q'R,q'B)) ∈ output) then
if the number of test() operations in actionsR ≠ number of test() operations in actionsB then fail.
add ((qR,qB), α, (q'R,q'B)) to output.
apply actionsR to R-stack.
apply actionsB to B-stack.
cross-product(q'R, q'B, R-stack, B-stack, output).
update R-stack from hold-R.
update B-stack from hold-B
We start the algorithm with the start states from both machines, two empty stacks, and an empty set for the output states.

An example

As a simple example, we test whether the expression above subsumes (a, ((d, e) | e), e). If we show it as above, (0a1, ((d2, e3) | e4), e5) we can see that the transitions are:

  • (0, 1, a, [])
  • (1, 2, d, [])
  • (1, 4, e, [])
  • (2, 3, e, [])
  • (3, 5, e, [])
  • (4, 5, e, [])
The only accepting state is 5.

We start at state 0 for both machines. Since we invert the accepting states of the base machine, its accepting states are 0 – 6, but not 7.

  1. There is only one transition in R for state 0, and only one for B. We add ((0, 0), (1, 1), a) to the output. Neither stack is changed.
  2. We recurse with R state 1 and B state 1. There are two transitions in R from state 1. We start with (1, 2, d, []). This matches (1, 4, d, [push()]) in B. We add ((1,1), (2,4), d) to the output machine and push 0 on the B stack.
  3. We recurse with R state 2 and B state 4. There is one transition in R - (2, 3, e, []). There is, likewise, only one transition in B - (4, 5, e, []). We add ((2,4), (3, 5), e) to the output.
  4. We recurse with R state 3 and B state 5. The transition in R is (3, 5, e, []). In B, we have (5, 7, e, [update(1), test(2,3)]). We need to test the B transition. We xor 10 binary with the top of the B stack, or 00, to get 10. This is in the acceptable range (10 ≤ 10 ≤ 11). As we are transitioning out of the group, we pop the B stack (now empty) and add ((3,5), (5,7), e) to the output.
  5. We recurse again with R’s state 5 and B’s state 7. There are no transitions, so we return to the nearest branch (fixing the stacks as we go up), at 2 above. We take R’s other transition, (1, 4, e, []). This matches (1, 6, e, [push()]) in B. We add ((1, 1), (4, 6), e) to the output, and push 0 on the B stack.
  6. We recurse with 4 and 6. In R, there is only (4, 5, e, []). In B, this matches (6, 7, e, [update(1), test(2, 3)]). As with step 4, we arrive at the end, adding ((4, 6), (5, 7), e) to the output.
  7. We terminate as there are no choices left.
The states of the output machine are ((0,0), (1,1), (2,4), (3, 5), (5,7), (4,6)). As the only accepting state in R is 5, and the only non-accepting state in B is 7, we conclude that B subsumes R.

Subsumption with numeric exponents

Numeric exponents add significant complexity to subsumption testing. In particular, the nesting of numeric exponents, both in B and R, potentially require testing a large number of alternatives to ensure for each path through R there is at least one path through B. And yet, in many cases, this is not necessary because of overlap in the areas covered by the exponents.

We will first give a couple of example of exponents and valid and invalid content models. From that, we will be able to extract some important aspects of the problem.

For example, consider the following: (a{4,5}{2,3}). Here we have a sequence of either 4 or 5 repetitions of a, repeated 2 or 3 times. This means the following are legal:

  • aaaaaaaa
  • aaaaaaaaa
  • aaaaaaaaaa
  • aaaaaaaaaaaa
  • aaaaaaaaaaaaa
  • aaaaaaaaaaaaaa
  • aaaaaaaaaaaaaaa
In other words, sequences of 8 – 10 or 12 – 15 a’s. However, if we have, instead, (a{4,5}{6,7}), then we have a series of 24 – 30 (from 6 X 4 to 6 X 5) and a series of 28 – 35. But since these overlap, we can just consider this a series of 24 to 35. It actually takes a certain amount of effort to construct examples that do not have this overlap.

More generally, regarding subsumption, consider two content models, (ρ{m0,,n0}...{mi,ni}) and (β{m'0,n'0}...{m'j,n'j}). We wish to establish if the latter subsumes the former. Clearly we must first establish if β subsumes ρ. Then we must determine that all possible ranges of ρ are also possible ranges of β. We can enumerate the ranges by choosing all the different m and n for each subscript — from (m0*m1*...*mi, n0*m1*...*mi) to (m0*n1*...*ni, n0*n1*...*ni). The maximum number of distinct ranges of ρ is 2i and of β is 2j. If β subsumes ρ, then we only need to determine that each range of ρ is contained within a range of β. We will use ranges(n) to refer to the ranges of the first n values in a list of exponents.

The more complex case is when the scoping is not so clear. For example, if B is (a | b){4,12} and R is (a{2,3},b{5,7}), then the iterations of a are insuficient without the additional iterations of b. Also consider B as (((a | b){3,5},c{0,1}){6,9},d) and R as ((a,b){20,25},c,d). Here, the repetitions of (a,b) in R must account for both the inner and outer exponents in B.

We can consider each terminal and each transition in B as having a nesting depth, depending on how many exponents are crossed from source to sink. Before a transition may be taken, all the “obligations” at higher levels must be fulfilled. For example, in B above, the d is at level 0 (not in the scope of any exponent) while a, b, and c are at level two. Therefore, the transition from a to d is at level 0, and the exponent at levels 1 and 2 must have been satisfied before that transition could be crossed. At the other end, there is a transition from a to b at level 2, so that is counted against the exponent at that level ({3,5}), as well as a transition at level 1, so the number of iterations of a’s and b’s is either between 3 and 5, or over 18.

As with automata for &-group, we will need a stack for each exponent and some extra commands to attach to each transition. Each entry on the stack has an exponent and a counter to track the number of iterations completed. The counter has a pair to keep track of min and max iterations from the other automaton and a single number to track the number of matches from the innermost particle. In this case, there are three kinds of command:

  1. push({m,n}) pushes an entry on the stack, initially ({m,n}, 0). These commands are put on transitions entering the scope of an exponent.
  2. increment increases the innermost counter. These commands are put on transitions that complete the particle contained in an exponent.
  3. check(n) tests whether the counter satisfies the innermost n exponents.
  4. upto(m) gives the maximum nesting level that a transition can affect.
It is possible for a single transition to have commands of all three types.

For example, the expression (0((a1 | b2){10,11}2,c3{0,1}2){6,9}1,d4) translates to the following transitions:

  1. (0, 1, a, [push({6,9}), push({10,11}), increment])
  2. (0, 1, b, [push({6,9}), push({10,11}), increment])
  3. (1, 1, a, [increment])
  4. (1, 2, b, [increment])
  5. (1, 3, c, [check(2), upto(1), increment])
  6. (1, 4, d, [check(1)])
  7. (2, 1, a, [increment])
  8. (2, 2, b, [increment])
  9. (2, 3, c, [check(2), upto(1), increment])
  10. (2, 4, d, [check(1)])
  11. (3, 1, a, [push({10,11}), increment])
  12. (3, 2, b, [push({10,11}), increment])
  13. (3, 4, d, [check(1)])

The algorithm follows:

cross-product(qR, qB, stackR, stackB, output) is
for each transition (qR,q'R, α, actionsR) do
for each transition (qB,q'B, α, actionsB) where
((qR,qB),(q'R,q'B), α) not ∈ output
do
({mR,nR}, iterR) = top(stackR);
({mB,nB}, iterB) = top(stackB);
if check(iR) = actionsR[0] and check(iB) = actionsB[0] then do
for {kR,lR} in rangesR(i)
for {kB,lB} in rangesB(iB)
if lR * iterB > kB * iterR and upto(jB) = actionsB[1] then
newk = max(0, kR * iterB - kB * iterR)
newl = lR * iterB - kB * iterR
for {newm, newn} in rangeB(jB + 1)
pop stackB to jB
({mB,nB},iterB) = top(stackB);
set top of stackB to ({mB - (iterB + newk div newm),nB - (iterB + newl div newm)}, 0)
add ((qR, qB), (q'R, q'B), α) to output.
pop stackR
apply remaining actions to stacks.
cross-product(q'R, q'B, stackR, stackB, output)
restore top of stacks.
else if iterR * kB > iterB * kR
add ((qR, qB), (q'R, q'B), α) to output.
pop stackR
set top of stackB to ({(iterR * kB - iterB * kR) div iterR, iterR * lB - (iterB * lR) div iterR}, 0)
if (iterR * lB - iterB * lR) - (iterR * kB - iterB * kR) ≥ 0 then
cross-product(q'R, q'B, stackR, stackB, output)
else if iterR * kB ≤ iterB * kR ≤ iterB * lR ≤ iterR * lB then
pop stackB
pop stackR
apply remaining actions to stacks.
cross-product(q'R, q'B, stackR, stackB, output)
restore top of stacks.
else if check(i) = actionsR[0] and q'B ≤ qB (iteration) then
if iterB * nR < nB do
save top of both stacks.
apply actionsR to stackR.
set top of stackB to ({mB - iterB * mR,nB - iterB * mR}, 0)
apply actionsB after the initial increment action to stackB.
cross-product(q'R, q'B, stackR, stackB, output)
restore top of stack.
else if check(i) = actionsR(0) and
mR ≤ iterR ≤ nR then
save top of both stacks.
apply actions to stacks.
cross-product(q'R, q'B, stackR, stackB, output)
restore both stacks.
else if check(j) = actionsB[0] then do
save top of both stacks.
apply actions to stacks.
if mB ≤ iterB ≤ nB then
cross-product(q'R, q'B, stackR, stackB, output)
restore both stacks.
else
save top of both stacks.
apply actions to stacks.
cross-product(q'R, q'B, stackR, stackB, output)
restore both stacks.

An example

We’ll now give an example of this algorithm in action. We’ll use the expression ((a, b){40,43}, c, d), which translates to the following automaton:

  1. (0, 1, a, [push({40, 43})]
  2. (1, 2, b, [increment])
  3. (2, 1, a, [])
  4. (2, 3, c, [check(1)])
  5. (3, 4, d, [])

First, we flip accepting states in B. As the only accepting state was 4, we now have 4 as the only non-accepting state. We then start with state 0 from both machines. For each, there is one transition across a. We push ({40, 43}, {0, 0}) on the R stack and both ({6, 9}, {0, 0}) and ({10, 11}, {0, 0}) on the B stack, then increment B, so the B stack is [({10, 11}, {1, 1}),({6, 9}, {0, 0})]. We are in states 1R and 1B. We add ((0, 0), (1, 1), a) to the output. Now the only transitions are across b in both. We execute (1, 2, b, [increment]) in R and then B. The R stack is now [({40, 43}, {1, 1})] and the B stack is [({10, 11}, {2, 2}),({6, 9}, {0, 0})]. We add ((1, 1), (2, 2), b) to the output.

There are now two possibilities in R, either transition 2 or 3 above. If we traverse the a path, we bring ourselves back to state 1 in both machines, but we add ((2, 2), (1, 1), a) to the output.

If we traverse the c, we find ourselves in the most complex part of the algorithm. The R transition is 4, above. The B transition is 5. Top of the R stack is ({40, 43}, {1, 1}), while top of the B stack is ({10, 11}, {2, 2}). As 40 * 2 > 10 * 1, and 1 < 2 in check(2, 1), we take the first branch. We set newk to 70 and newl to 76. The only values for newm and newn are 10 and 11. As 6 ≤ 7 ≤ 7 ≤ 9, we set the B stack to [({0, 2}, {0, 0})]. We add ((2, 2), (3, 3) c) to the output and continue. We perform the last command from the B transition, increment, and pop the top of the R stack. The R stack is now empty, and the B stack is [({0, 2}, {1, 1})]. From here, there is one transition in R, across d, and one corresponding one in B. In R, we end up in state 4 with an empty stack. In B, we check the first entry on the stack. As 0 ≤ 1 ≤ 1 ≤ 2, we pass and pop the stack. So we add ((3, 3), (4, 4), d) to the output. Checking the output, we see there is no state with an accepting state from both input machines, so we conclude that B subsumes R.

Combining the two algorithms

It is possible to combine the two algorithms to provide a subsumption test that allows arbitrary mixing of &-groups and numeric exponents. The central notion is to iterate alternatively through transitions with actions for &-groups and those with actions for numeric exponents. Consider (a & b{3, 5}){6, 8}. The b particle must meet the minimum for the inner numeric exponent before considering the a, so the update transition through the a comes after the check(2) command. Conversely, it must complete all the branches of the &-group before advancing in the outer one, so the test command would be come before the check command. Therefore, an algorithm combining the two must first need to check all check commands to determine which transitions are active (after updating the stack) whenever the state in R is &-group-only.

Extension to XML Schema

The algorithms as described work on terminals. However, XSD content models generally use non-terminals, such as references to substitution groups and namespace wildcards, which resolve to sets of qualified names. It is the qualified names that are the actual terminals. In this section, we will outline how the algorithms given above can be extended to cover the XSD case.

We can see how XSD “terminals” reference sets in the following examples:

  1. The declaration <element ref="foobar"/> matches not just the element named foobar, but any element in its substitution group.
  2. The particle <any namespace="http://example.com"> matches any element in the given namespace.
  3. An element in the substitution group of foobar can also be in the namespace "http://example.com".

Clearly, in determining UPA or subsumption, we will need to consider these sets and their intersections.

While namespaces are flat, substitution groups form a tree — if R is in the substitution group of B, then the substitution group of R is a subset of the substitution group of B. Each node of this tree is a set; the structure is determined by declarations in the schema. From our perspective, we needn’t worry about the particulars.

For UPA, the issue is no longer whether the same terminal appears twice, but whether any two sets named contain a common value. Since every non-empty set has a non-empty intersection with itself, we can update the UPA algorithm so that whenever two set names appear in a first or follow set together, we check that they have an empty intersection. That way there can never be an occasion where the same qualified name appears in two transitions from the same state.

For subsumption testing, when we compare two transitions, such as (qR,q'R,ρ,[...]) from R and (qB,q'B,β,[...]) from B, it is not necessary that ρ and β be the same terminal, but only that ρ be a subset of β. That ensures that any qualified name matched by the transition in R is also matched by the transition in R.

It is an unfortunate side effect of this approach that UPA testing, in particular, needs to be done for every combination of element declarations a content model appears in, rather than just once. This limits the reusability of schema constructs. It also allows some very unexpected constructs; for example, a wildcard can be a restriction of a substitution group if it just so happens that all the elements defined in the namespace are also in the substitution group. There are ways to address this issue, but they are beyond the scope of this paper.

Conclusion

We’ve shown algorithms for UPA and subsumption testing for an extension to regular expressions covering numeric exponents and general all-groups. Numeric exponents are a new feature added by XSD, and all-groups were a feature of SGML dropped by XML 1.0 to simplify development. Both of these are important features where previously published algorithms were very expensive. While we’ve concentrated on the traditional notion of terminals, we have shown how XSD’s wildcards and substitution groups can be handled.

The run-time performance of these algorithms is very reasonable compared to alternatives. For example, naive algorithms that would unroll the numeric exponents and then determinise them in some form to test UPA would run in doubly exponential time — firstly, unrolling the exponents is O(n * ed), where n is the size of a group, e is the size of the exponents, and d is their nesting depth. Given minimal and maximal values, the resulting automaton is highly non-deterministic, even where the original had UPA (for example, (a{0,10}{0, 10}) expands to 100 (102) optional a’s). Any attempt to determinise this using the traditional subset construction method, AHU, is again exponential in the size of the input (as there may be an exponential blowup in states), giving O(2n * ed ), which is fairly large. Any subset determination on this result is huge compared to the input.

In our case, the performance of the UPA algorithm remains quadratic with the input — only slightly worse than the algorithm on just kleene algorithms. The performance of the subsumption algorithm is only exponential in the depth of the nesting of exponents, or O(n*2d). This is comparatively reasonable.


Bibliography

[AHU] Aho, Alfred, et al., The Design and Analysis of Computer Algorithms, Addison-Wesley Pub., 1974.

[CHID00] Chidlovskii, Boris, Using Regular Tree Automata as XML Schemas, In Proceedings of IEEE Advances in Digital Libraries 2000, J. Hoppenbrouwers et al., eds., Washington, USA, 2000.

[FD] Brown, Allen, et al., XML Schema: Formal Description, World Wide Web Consortium, Sept. 2001, http://www.w3.org/TR/xmlschema-formal/.

[MUR95] Murata Makoto, “Forest-Regular Languages and Tree-Regular Languages”, Technical Report, Fuji Xerox, Japan, 1995.

[MUR99] Murata Makoto, “Hedge Automata: A Formal Model for XML Schemata”, 1999.

[RNG] Relax NG Specification, James Clark and Murata Makoto, OASIS, 2001.

[TATA] Comon, Hubert, et al., Tree Automata Techniques and Applications, 1998.

[USCM] Neumann, Andreas, Unambiguity of SGML Content Models – Pushdown Automata Revisited, 3rd Int. Conf. on the Developments in Language Theory (DLT’97), July 1997, Thessaloniki, Griechenland, In Proceedings of the 3rd International Conference Developments in Language Theory, Symeon Bozapalidis, Hrsg., S. 507-518, Aristotle University of Thessaloniki, 1997.

[WXS] Thompson, Henry, et al., XML Schema Part One: Structures, World Wide Web Consortium, May 2001, http://www.w3.org/TR/xmlschema-1/.



Supporting UPA and restriction on an extension of XML Schema

Matthew Fuchs, Ph.D. [Westbridge Technolgy]
Allen Brown, Ph.D. [Microsoft Corporation]