Given a regular language L, the Brzozowski derivative of L with respect to some string s is a regular expression which defines what strings can follow s in strings appearing in L. The calculation of Brzozowski derivatives is a convenient and beautiful tool for the algebraic manipulation of content models. It can provide convenient solutions to several problems arising in developing, manipulating, and validating content models in W3C XML Schema: validation, checking the uniqueparticleattribution constraint, and testing whether the language accepted by one content model is a subset of that accepted by another.
Keywords: Modeling; Schema Languages; Processing
XML Source  PDF (for print)  Author Package  Typeset PDF 
One of the main advances of SGML and XML over earlier methods of document representation is the idea that userdefined validity conditions for a document type could be formulated explicitly, and that document validity could be assessed automatically. Strategies for understanding, analysing, and implementing those constraints are thus a core problem for the use of markuplanguage technologies. Many discussions of these topics limit themselves to a brief discussion of contentmodel notation and its meaning; deeper discussions may discuss the equivalence between content models and finitestate automata, possibly showing how to translate content models into automata. Vocabulary designers, users, and implementors would all benefit from more serious attention to other techniques of working with content models, notably including the notion of the Brzozowski derivative.
This paper describes some applications of Brzozowski derivatives to problems which arise in connection with schemas for XML vocabularies. It begins with a brief introduction to Brzozowski derivatives and a description of how to extend them to cover content models in XML Schema 1.0, and then shows how to use derivatives to validate document instances against content models, check the Unique Particle Attribution constraint, handle a weakened form of that constraint, check whether one content model accepts a subset of the language accepted by another, and define either of two competing meanings for extensions of the XML Schema 1.0 allgroup.
Software developers can use Brzozowski derivatives to reduce the cost of operations in certain circumstances, or to develop a simple wellunderstood reference implementation against which to check the behavior of faster but more complex code. Vocabulary developers can use Brzozowski derivatives to analyse and understand different ways of formulating particular content models; in particular, it may be very useful in vocabulary design to be able to check mechanically whether one content model accepts only a subset of what another content model accepts.
Brzozowski derivatives (in what follows they will often be called just “derivatives”) were introduced by Janusz Brzozowski in 1964 [Brzozowski 1964].
Brzozowski defines derivatives thus:
Given a set R of sequences and some finite sequence s, the derivative of R with respect to s is denoted by D_{s}R and is D_{s}R = {t  st ∈ R}.
In words, the derivative of R with respect to s is the set of strings t which can follow s in sentences of R, or: the set of strings t such that the concatenation of s and t is a sentence in R.
Regular sets of strings can, of course, be denoted by regular expressions, and Brzozowski's contribution was to show how, given (1) a regular expression E denoting the language R and (2) a string s, to calculate a regular expression D denoting the derivative of R with respect to s. He also proved (3) that of all the derivatives of an expression, only a finite number would be distinct from each other in terms of recognizing different languages, and (4) that even if equal expressions are not always detected, there will still be only a finite number of dissimilar derivatives, if certain simple tests of similarity are performed; he then showed (5) how to construct a finitestate automaton from the set of characteristic derivatives thus identified.
Brzozowski derivatives have applications to schemas and can simplify discussions (and implementations) of validation, unique particle attribution, and similar problems. They can also be used to define ‘weakened wildcards’ (see below).
In their work on content models, Anne BrüggemannKlein and Derick Wood have regularly pointed out the relations between regular expressions, finitestate automata (particularly Glushkov automata), and Brzozowski derivatives [BrüggemannKlein 1993a], [BrüggemannKlein 1993b], [BrüggemannKlein/Wood 1998]. An even earlier mention (with, however, no references to Brzozowski or to any other literature) appears in an unattributed white paper prepared at Software Exoterica [Wilmott 1991], where derivatives are treated as a tool for algebraic manipulations of content models and proofs of contentmodel equivalence.
The earliest suggestion, however, that Brzozowski derivatives ought to be applied in validation software appears to be due to Joe English [English 1999], who provides a terse summary and urges their obvious utility for handling complex content models. Surprisingly, Joe English's own Haskellbased XML parser, HXML, appears not to use Brzozowski derivatives; not being a validating parser, it has no occasion to use them. But the suggestion was taken up by a number of others: Martin Schmidt adapts English's code fragment and applies Brzozowski derivatives in a validating XML parser written as a Master's thesis project and forming part of the Haskell XML Toolbox [Schmidt 2002]. More prominently, James Clark uses derivatives in his RelaxNG validator, Jing [Clark 2002]. According to at least one report ([Foster 2003]), Brzozowski derivatives are also used in Sun's MSV validator [Kawaguchi 2003]. Bob Foster has described (mostly in blog articles) a number of applications of Brzozowski derivatives, including validation of expressions with integerrange exponents on subexpressions [Foster 2004].
Both English and Clark credit their acquaintance with the idea to Mark Hopkins [Hopkins 1994], who uses Brzozowski derivatives in a set of regularexpression tools posted to Usenet in 1994, and stresses the advantage of not having to build large finite state automata for short inputs.
Brzozowski derivatives have an obvious importance to those concerned with Relax NG: they help reduce the cost of Relax NG's nondeterminism and cooccurrence constraints. But they also have applications to other schema languages, and deserve to be better understood among those working with them, including especially those working with XML Schema (hereinafter XSD).^{1}
This section explains the extension of Brzozowski derivatives to XSD content models. No knowledge of Brzozowski's original formulation is required.
Brzozowski worked with a regularexpression notation which differs interestingly both from standard regular expressions and from XSD content model notation. So to apply the notion of derivatives to XSD content models, we need to extend Brzozowski's rules. This is easily done.
For concision, I will use the following notation as a shorthand instead of writing content models in the transfer syntax defined by the XML Schema specification [W3C 2001b]. A content model expression is one of the following:
These expressions denote regular sets of sequences of XML elements as follows:
In examples, the notations familiar from SGML and XML content models will occasionally be used: for some expression E, E* ≡ E{0, unbounded}, E+ ≡ E{1, unbounded}, E? ≡ E{0, 1}.
Content models are either contentmodel expressions or “all”expressions. An “all”expression is:
Some examples may help make things clear.
Note that while in XSD all of the operators  and , and & are nary (i.e. they take an arbitrary number of arguments), we can treat them without loss of generality as binary operators, all of which obey the associative law. That is, for any content model expressions E, F, and G, we have
(E  (F  G)) = ((E  F)  G) = (E  F  G)
and
(E, (F, G)) = ((E, F), G) = (E, F, G)
and
(E & (F & G)) = ((E & F) & G) = (E & F & G)
The last set of equations requires some discussion. They will seem foreign to those used to the SGML & operator, which is not associative, but natural to those accustomed to working with the interleave operator of Relax NG and some recent discussions of content models by computer scientists. In SGML, the expression (a & (b & c)) is not at all the same as ((a & b) & c): the former requires an a to appear at the beginning or end of the sequence, while the latter requires c to appear at the beginning or end, so that a c b is accepted by the first expression, but not the second. In RelaxNG, the two expressions are equivalent. In XSD, the issue does not arise, since XSD does not allow allexpressions to nest within allexpressions and so does not need to specify either the associative or the nonassociative interpretation.
The & operators of SGML and RelaxNG also differ in another way: as suggested by the term interleave, RelaxNG allows subparts of the left and righthand expressions to be interleaved. In RelaxNG, for example, the expression (a, b) & (d, e) accepts the sequence a d b e, as well as any other in which both a occurs before b and c occurs before d, and nothing else occurs. In SGML, the expression accepts only the two sequences “a b d e” and “d e a b”. Here, too, XSD restricts the operator in such a way that the XSD operator can be interpreted either way: in XSD, all arguments of the & operator must be simple contentmodel expressions, so that the question of interleaving subparts of the matching strings does not arise.
Let it suffice for now to observe that the validation semantics assigned to allexpressions below have been formulated in such a way that the binary &operator to which we translate XSD allexpressions does obey the associative law.
It is useful to know, for an expression E, whether ε is in L(E) (i.e. whether the language accepted by E includes the empty string). Following BrüggemanKlein [BrüggemannKlein 1993a], I will say that if ε is in L(E), then E is nullable, and define the predicate ν(E) which is true if and only if E is nullable.
The definition of nullable for contentmodel expressions is:
Some algebraic manipulations will become simpler if we define a related function, which following Brzozowski I call δ. For all regular expressions E,
First we define the derivative of an expression E with respect to a single expanded name x. This we specify to be identical by definition to the derivative with respect to a sequence consisting just of one element whose expanded name is x.
For positive integers, the  operator decrements, but it respects special cases: k = k  1 for integer k > 0, but 0 = 0, and unbounded = unbounded.^{5}
To calculate the derivative of E with respect to a sequence x_{1}, x_{2}, x_{3}, ..., x_{n}, we first take the derivative with respect to x_{1}, then the derivative of that expression with respect to x_{2}, and so on until we are at the end of the sequence. Algebraically,
D_{s} E = D_{<x1, x2, ... xn>} E = D_{xn} (... (D_{x3} (D_{x2} (D_{x1} E))) ...)
The following identities are useful. For any expressions E, F, and G,
In what follows, we'll (silently) make use of these identities to simplify expressions generated by taking derivatives.
Brzozowski proves that every regular expression is guaranteed to have a finite number of distinct (i.e. nonequivalent) derivatives. The set of distinct derivatives of E with respect to the strings in Σ* (for some input alphabet Σ) is called the characteristic derivatives of E.
Brzozowski also defines a test of similarity for regular expressions: expressions are similar if they can be transformed into each other using only the first three identities listed in section 34. Since in practice the detection of equivalence may be expensive, it is encouraging to note that Brzozowski also proves that every regular expression is guaranteed to have a finite number of dissimilar derivatives. The consequence is that even if we fail to detect all equivalences among the derivatives of an expression, we can still avoid having the number of characteristic derivatives grow without bound.
One complication should be addressed here: like all standard work on regular languages, Brzozowski assumes that the input alphabet Σ is finite. Content models, however, are regular expressions over an infinite set, namely XML element types (or element type names). The following discussion silently makes the usual adjustment: for content models without wildcards, it suffices to assume an alphabet including all the names used in the content model, plus one name which matches nothing in the content model. Since in the absence of wildcards all nonmatching names behave the same way, the single nonmatching name successfully represents the infinite set of possible inputs which match nothing in the content model. When wildcards are present, it suffices to include one otherwise unknown name in each namespace mentioned in any wildcard, plus one name in a namespace unmentioned in any wildcard. In this way we can work with finite input alphabets which correctly stand in for the infinite input alphabet provided by XML.
This section identifies some important tasks in XSD processing and shows how Brzozowski derivatives provide convenient and clear ways to formulate and perform the task.
We can use derivatives to validate without ever constructing anything recognizable as an automaton: for content model E and sequence s, E accepts s if and only if ν(D_{s} E).^{6}
For strongly deterministic expressions, the time it takes to calculate the derivative of E with respect to a single input symbol is at most linear in the size of E, and the derivative is never larger than E.^{7} This means that overall, validation of a sequence will be at worst linear in the size of the sequence. (Note that formal proofs of these claims remain a task for future work.)
For example, let us consider the strongly deterministic content model E =
( (script  style  meta)*, ( (title, (script  style  meta)*, (base, (script  style  meta)*)?)  (base, (script  style  meta)*, (title, (script  style  meta)*)) ) )
For nonstronglydeterministic expressions, the derivative may be larger than the original expression and will take the form of a disjunction. If the expression is weakly deterministic, then the number of disjuncts will correspond to the number of states the (nonstronglydeterministic) Glushkov automaton could be in at the given state in the parse.
For example, consider the content model E = (e{1,5}, b{0,2}){1,5}. A straightforward calculation of the derivatives shows how the second and later occurrences of “e” in an input string cause more and more growth in the size of the derivative:
(e{1,5}, b{0,1}){1,5}
e{0,4}, b?, (e{1,5}, b?){0,4}
e{0,3}, b?, (e{1,5}, b?){0,4}  e{0,4}, b?, (e{1,5}, b?){0,3}
e{0,2}, b?, (e{1,5}, b?){0,4}  e{0,4}, b?, (e{1,5}, b?){0,3}  e{0,3}, b?, (e{1,5}, b?){0,3}  e{0,4}, b?, (e{1,5}, b?){0,2}
(e{1,5}, b?){0,4}  (e{1,5}, b?){0,3}  (e{1,5}, b?){0,3}  (e{1,5}, b?){0,2}'
e?, b?, (e{1,5}, b?){0,4}  e{0,4}, b?, (e{1,5}, b?){0,3}  e{0,3}, b?, (e{1,5}, b?){0,3}  e{0,4}, b?, (e{1,5}, b?){0,2}  e{0,2}, b?, (e{1,5}, b?){0,3}  e{0,4}, b?, (e{1,5}, b?){0,2}  e{0,3}, b?, (e{1,5}, b?){0,2}  e{0,4}, b?, (e{1,5}, b?)?
SGML DTDs, XML 1.0 DTDs, and XSD 1.0 all share a rule which restricts the forms of regular expressions legal in content models, in order to ensure that they do not require unbounded lookahead. ISO 8879 requires that content models be “unambiguous”. The sense assigned to ambiguity is somewhat broader than the sense usual in formal language study: the expression a?, a is ambiguous in the SGML sense, because when a parser is confronted with a sequence beginning with an a, lookahead is required to know whether it matches the first a in the expression, or the second. It is not ambiguous in the usual formal sense: for any word in the language, there is only one parse tree. In part to avoid the confusion caused by this unusual usage of the term ambiguity, the XML 1.0 spec requires instead that content models be deterministic. Determinism, however, may be defined in either of two ways, weak or strong. To avoid the confusion caused by the two flavors of determinism, XSD 1.0, in turn, imposes what it calls the Unique Particle Attribution constraint. All three of these names denote essentially the same requirement: that it be possible, without lookahead in the document instance, to know which token (or particle) in a content model is matched by a token in the document instance, and that there be at most one such token. This is what BrüggemannKlein calls weak determinism, as opposed to strong determinism, which additionally requires that every repetition of a token be licensed by at most one occurrence indicator in the expression.^{11}
Previous formulations of this constraint have either provided an algorithm translating expressions into automata, required that this translation be performed, and appealed to properties of the resulting automata (so BrüggemannKlein and Wood in [BrüggemannKlein 1993a] and [BrüggemannKlein/Wood 1998]), or else they have refrained from requiring such a translation, but then found themselves unable to specify precisely what the constraint is, relying instead on nudges and winks and hopes that the reader understands what is meant (so ISO 8879 [ISO 1986] and the XML specification [W3C 2004]).^{12} Using Brzozowski derivatives, however, the UPA constraint of XSD 1.0 can be specified constructively and precisely, without reference to the corresponding automata.^{13} Two approaches offer themselves.
The first idea is this: we define a property which holds of some contentmodel expressions: determinism with respect to initial x, where x is an input signal (for us: an expanded name). The UPA constraint can now be expressed thus:
We can now define the function idet(x, E), which is true if and only if E is deterministic with respect to initial x.
This approach was anticipated in part by Sam Wilmott [Wilmott 1991], but the treatment of repetition and of sequences differs there.^{14}
A few simple examples may illustrate the method. First, let E be a{2,4}, a. If we test the first few derivatives of E for initial determinism with respect to a, we have:
When n = m, however, the situation is different, as shown by a second example. Let E be a{2,2}, a. There are five characteristic derivatives of E (for any alphabet containing a and at least one other symbol, here represented by b). We have:
As final example, let E = (a, a?){2,4}. We have:
In some cases, the work needed to test an expression for initial determinism can be reduced. This can be important for large maximum numbers of repetitions, especially when expressions with repetition operators nest. The expression (e{0,1000}){0,1000}, for example, has 1,000,002 (1000 × 1000 + 2) distinct derivatives.^{16}
The key to reducing the cost of the calculation is noticing that for 1 < n < m and any expression E, the expression E{n, m} is deterministic for initial x if and only if E{n, m} is deterministic for initial x. This means that n can be reduced to 1, and m to 2, without changing the result of the determinism calculation (but saving some arbitrarily large number of derivatives in the process). When the minimum and maximum numbers of repetitions are the same, then they can be reduced to 2 but not to 1. That is, if n = m > 2, then E{n, m} is deterministic for initial x if and only if E{n, m} is deterministic for initial x.
More formally: every subexpression E of the form F{n,m} can be replaced by another expression E′, constructed so that E obeys the UPA constraint if and only if E′ does, but E′ has fewer characteristic derivatives. The value of E′ is:
It's intuitively clear that for any expression E we can construct the E′ described. Informally, it's not hard to see that E is weakly deterministic if and only if E′ is: in expressions of the form F{n,m}, there are two possible sources of nondeterminism other than nondeterminism within F itself, each unaffected by the reduction of n to 1 or 2 and m to 2:
It would be desirable to have a more formal proof that the reduction in exponents (a) reduces the number of distinct derivatives and (b) preserves strong and weak determinism / nondeterminism. Such a formal proof remains a task for future work.
A second approach is to define a function c which counts the number of tokens in the expression which can match an initial symbol x:
The XSD Unique Particle Attribute constraint, the XML 1.0 determinism rule, and the SGML ambiguity rule then all amount to this:
To illustrate, let us repeat the first example from the previous section: let E be a{2,4}, a. For the first few derivatives of E, we have:
In discussions of XSD and the Unique Particle Attribution constraint, various alternative validation behaviors have occasionally been proposed; one of the most popular ideas is ‘weakened wildcards’. This is a weakened form of the Unique Particle Attribution constraint, which allows content models in which both an extended name and a wildcard can match an input element, with the proviso that in cases of such competition, it is the extended name, not the wildcard, which is used to match the input element. Competition between element tokens or between wildcards is still prohibited.
Either of the two approaches to the UPA constraint described above can be extended to handle weakened wildcards.
The idet predicate can be specialized in the obvious way into two predicates, edet (which is true just in case there is at most one element particle in the content model which can match the input symbol) and wdet (true if there is at most one matching wildcard). The weakened form of the constraint is then:
The c function can similarly be specialized to make functions that count just element tokens (ecount) or just wildcard tokens (wcount) which can match an initial x. The weak wildcard rule then says that for every member F of the characteristic derivatives of E and every symbol x in the input alphabet, it must be the case both that ecount(x,F) ≤ 1 and that wcount(x,F) ≤ 1.
This topic has already been treated, using different formalisms, by Henry Thompson and Matthew Fuchs [Thompson/Tobin 2003], [Fuchs/Brown 2003]. Brzozowski derivatives provide a different way of considering the problem, which seems in some respects more convenient and concise.
Section 441 below extends the notation described above in section 31, adding (as Brzozowski does) the operators 'and' and 'not' to the expression language and showing how to take their derivatives. They are redundant and do not extend the expressive power of the language, but they are convenient to have when considering subsumption.
Section 442 expresses the subsumption test in several ways using derivatives, notably:
B subsumes R if and only if for all strings s, δ(D_{s} (R and ~B)) = ∅
(N.B. this definition of subsumption corresponds to shallow local validation, not to deep validation.)
Section 443 describes a relatively simple way to test subsumption: find the set of characteristic derivatives of (R and ~B) and test each one for nullability. Some speculations about complexity are also included.
Section 444 works in detail through some examples.
Section 445 is a series of proofs showing that the various formulations of subsumption offered in section 442 are in fact equivalent.
It will be convenient, before we proceed, to extend the expression language defined earlier. In addition to the expressions defined above, the following are also legal expressions:
We need to define how to take derivatives with respect to expressions of these shapes. Brzozowski proves that for all strings s,
It is a well known result that for two regular expressions B and R, it is readily decidable whether B subsumes R, i.e. whether the language accepted by R is or is not a subset of that accepted by B. Since content models are so similar to regular expressions, it has seemed plausible that the same relation is decidable for content models. In the worst case, it is always possible to translate the two content models first into regular expressions and then into finite state automata exploit the standard algorithms. The presence of wildcards, the all connector, and repetition operators other than the Kleene star, however, has complicated the story somewhat. Uncertainty about the possible effect of these complications contributed to the decision to define complex type restriction not in terms of the subset relation between languages accepted by two types, but instead in terms of a complex set of constructive rules for checking content models of substantially similar structure. It is thus particularly significant that Brzozowski derivatives offer a way to define and check the subsumption relation between two content models.
To test whether expression B subsumes expression R, we want to test any one of the following propositions:
A proof that these expressions are equivalent is given in section 445 below.
The simplest way I can find to define the subsumption test using derivatives is this. We wish to establish that for all strings s over the input alphabet, δ(D_{s} (R and ~B)) = ∅.
Brzozowski proves that any expression E has "a finite number of types of derivatives", where two expressions E and F are of the same type if and only if L(E) = L(F), i.e. the two expressions are equivalent. The n different types (equivalence classes) of derivatives of E are called the characteristic derivatives of E. Brzozowski defines a straightforward way to generate all the characteristic derivatives of E, which involves examining increasingly long strings over the input alphabet until a fixed point is reached. The algorithm runs something like this:
We'll use variables n (current string length), X (the set of characteristic derivatives found so far), and f (a Boolean flag).
The finiteness of the set of characteristic derivatives gives us one simple way to test proposition (4):
If no characteristic derivative of (R and ~B) is nullable, then B subsumes R.
There may be some way to do this by analysing R and B separately, and performing some calculation on their two sets of characteristic derivatives, but in the general case this doesn't seem to simplify things; in the worst case, R has n characteristic derivatives and B has m, and we have to check n × m combinations. This is exactly what taking the characteristic derivatives of (R and ~B) does: since for any string s we have
D_{s} (R and ~B) = (D_{s} R) and (D_{s} ~B) = (D_{s} R) and ~(D_{s} B)
we end up calculating the derivatives of R in parallel to those of B, with the exception that we don't necessarily generate all n × m pairs of characteristic derivatives.
It seems clear that calculating the set of characteristic derivatives of (R and ~B) takes about as much work as (and is isomorphic to) calculating the M_{R} and M_{B} (the Glushkov automata of R and B) and then calculating, for all strings over the alphabet, whether the string is in B and not R.
Which in turn is about the same amount of work it takes to calculate the two Glushkov automata, negate the automaton for B, and calculate the automaton which is the intersection of the M_{R} and M_{~B}.
It may be possible to reduce the cost of translation by translating content models not into standard finite state automata but into what might be called ‘counterextended automata’ (described in an unpublished paper [SperbergMcQueen 2004]). In the worst case, however, counterextended automata explode in size in much the same way as a naive expansion of content models into standard regular expressions. I still don't know a good theory explaining when and how using extended automata will be cheaper than expanding to standard expressions and using standard automata, and by how much. This is another topic for further work.
Consider first the pair of expressions
We test by seeing whether among the characteristic derivatives of E = (R and ~B) there is any which is nullable. The characteristic derivatives are found thus:
For n = 0:
For n = 1:
For n = 2:
For n = 3, all strings produce D_{s} E = ∅, which is not nullable.
Note that (as expected) the sequences “ab” and “ba” constitute counterexamples: this derivation is not legal because they are locally valid against R but not against B.
If we rewrite the base type as B′ = (a? & b? & c?) then we get these results:
For n = 0, 1, 2:
For n = 3, all strings get D_{s} E = ∅. Nullable = false.
As may be seen, the characteristic derivatives of (R and ~B′) are:
We obtain the characteristic derivatives of E thus:
For n = 0, 1, 2:
For n > 3, for all strings s, D_{s} E = ∅. So the characteristic derivatives are those given above, none of them nullable. So B subsumes R. Q.E.D.
The characteristic derivatives are all reached by strings of length < 3:
B subsumes R. Q.E.D.
It may be worth working an example with larger content models, just to give an idea of how this algorithm behaves when the expressions are not quite so simple.
Let's take
B does subsume R (at least that was the intent when they were written), but it's not easy to see this at a glance. I won't go through the algorithm in full detail, but here are some notes.
Expression E has, if my calculations are correct, 18 characteristic derivatives, one each calculable as D_{s} E for s in the 13 prefixes of the string “abcdbcdbcdef”, four more for the last five prefixes of “abcdbcdbcdbcdef” (one of those prefixes has the same derivative as a known one), and one (the empty set) which applies to all other strings. Some of these are given below; there seemed little point in listing all of them.
The derivatives steadily get more and more complex as the string gets longer and exercises more and more heavily the nondeterminism in the exponents of the base type:
For n = 0,
For n = 1:
For n = 2:
(((c, d, b){2,3}), c, d, e, f) and ~( ( (c, d, ((b, c, d){0,4}), (e?), ((((b, c, d){0,5}), (e?)){0,3}))  (c, d, ((b, c, d){0,4}), (e?), ((((b, c, d){0,5}), (e?)){0,2}))  (c, d, ((b, c, d){0,4}), (e?), ((((b, c, d){0,5}), (e?))?))  (c, d, ((b, c, d){0,4}), (e?)) ), f)
For n = 3:
(d, b, ((c, d, b){1,2}), c, d, e, f) and ~( ( (d, ((b, c, d){0,4}), (e?), ((((b, c, d){0,5}), (e?)){0,3}))  (d, ((b, c, d){0,4}), (e?), ((((b, c, d){0,5}), (e?)){0,2}))  (d, ((b, c, d){0,4}), (e?), ((((b, c, d){0,5}), (e?))?))  (d, ((b, c, d){0,4}), (e?)) ), f )
When we reach n = 11, the derivative has reached alarming proportions:
(c, d, e, f) and ~( ( (c, d, ((b, c, d)?), (e?), ((((b, c, d){0,5}), (e?)){0,3}))  (c, d, ((b, c, d){0,4}), (e?), ((((b, c, d){0,5}), (e?)){0,2}))  (c, d, ((b, c, d){0,4}), (e?), ((((b, c, d){0,5}), (e?))?))  (c, d, ((b, c, d){0,4}), (e?))  (c, d, ((b, c, d){0,3}), (e?), ((((b, c, d){0,5}), (e?)){0,2}))  (c, d, ((b, c, d){0,4}), (e?), ((((b, c, d){0,5}), (e?))?))  (c, d, ((b, c, d){0,4}), (e?))  (c, d, ((b, c, d){0,3}), (e?), ((((b, c, d){0,5}), (e?))?))  (c, d, ((b, c, d){0,4}), (e?))  (c, d, ((b, c, d){0,3}), (e?))  (c, d, ((b, c, d){0,2}), (e?), ((((b, c, d){0,5}), (e?)){0,2}))  (c, d, ((b, c, d){0,4}), (e?), ((((b, c, d){0,5}), (e?))?))  (c, d, ((b, c, d){0,4}), (e?))  (c, d, ((b, c, d){0,3}), (e?), ((((b, c, d){0,5}), (e?))?))  (c, d, ((b, c, d){0,4}), (e?))  (c, d, ((b, c, d){0,3}), (e?))  (c, d, ((b, c, d){0,2}), (e?), ((((b, c, d){0,5}), (e?))?))  (c, d, ((b, c, d){0,4}), (e?))  (c, d, ((b, c, d){0,3}), (e?))  (c, d, ((b, c, d){0,2}), (e?))  (c, d, ((b, c, d)?), (e?), ((((b, c, d){0,5}), (e?)){0,2}))  (c, d, ((b, c, d){0,4}), (e?), ((((b, c, d){0,5}), (e?))?))  (c, d, ((b, c, d){0,4}), (e?))  (c, d, ((b, c, d){0,3}), (e?), ((((b, c, d){0,5}), (e?))?))  (c, d, ((b, c, d){0,4}), (e?))  (c, d, ((b, c, d){0,3}), (e?))  (c, d, ((b, c, d){0,2}), (e?), ((((b, c, d){0,5}), (e?))?))  (c, d, ((b, c, d){0,4}), (e?))  (c, d, ((b, c, d){0,3}), (e?))  (c, d, ((b, c, d){0,2}), (e?))  (c, d, ((b, c, d)?), (e?), ((((b, c, d){0,5}), (e?))?))  (c, d, ((b, c, d){0,4}), (e?))  (c, d, ((b, c, d){0,3}), (e?))  (c, d, ((b, c, d){0,2}), (e?))  (c, d, ((b, c, d)?), (e?)) ), f)
Alert eyes will note that the rudimentary expression simplifier used to produce the derivative just shown has not fully simplified some of these expressions: a smarter simplifier eliminates some of the subterms in the negated expression, producing
(c, d, e, f) and ~( ( (c, d, ((b, c, d)?), (e?), ((((b, c, d){0,5}), (e?)){0,3}))  (c, d, ((b, c, d){0,4}), (e?), ((((b, c, d){0,5}), (e?)){0,2}))  (c, d, ((b, c, d){0,4}), (e?), ((((b, c, d){0,5}), (e?))?))  (c, d, ((b, c, d){0,4}), (e?))  (c, d, ((b, c, d){0,3}), (e?), ((((b, c, d){0,5}), (e?)){0,2}))  (c, d, ((b, c, d){0,3}), (e?), ((((b, c, d){0,5}), (e?))?))  (c, d, ((b, c, d){0,3}), (e?))  (c, d, ((b, c, d){0,2}), (e?), ((((b, c, d){0,5}), (e?)){0,2}))  (c, d, ((b, c, d){0,2}), (e?), ((((b, c, d){0,5}), (e?))?))  (c, d, ((b, c, d){0,2}), (e?))  (c, d, ((b, c, d)?), (e?), ((((b, c, d){0,5}), (e?)){0,2}))  (c, d, ((b, c, d)?), (e?), ((((b, c, d){0,5}), (e?))?))  (c, d, ((b, c, d)?), (e?)) ), f)
For both n = 11, s = “abcdbcdbcde”, and n = 14 s = “abcdbcdbcdbcde”, there is a marked simplification of the derivative: D_{s} E =
f and ~( ( ((((b, c, d){0,5}), (e?)){0,3})  ((((b, c, d){0,5}), (e?)){0,2})  ((((b, c, d){0,5}), (e?))?)  ε ), f)
Note that the “e” at the end of each input string dramatically reduces the uncertainty in the expression.
The two strings which correspond to valid strings in R have a derivative of interesting form:
It's interesting to see that while we do get rather large expressions in the course of the work, we don't in fact seem to have quite as much growth as the exponents in the base type would have led us to expect. A rewriting of the base type as a standard regular expression will have a Glushkov automaton with 67 states, and the similar standard Glushkov automaton for R will have 16 states. The worst case size of the FSA for their intersection will have 1072 states (67 × 16). A cautious intersection method, which constructs states only when they prove to be needed, won't construct nearly that many, of course; I infer that working with Brzozowski derivatives gives us a similarly thrifty approach.
In section 442, I claimed that the following ways of formulating the subsumption test are all equivalent:
Note that for all strings s and all expressions E, s is in L(E) if and only if ε is in L(D_{s}E). This follows from the definition of derivatives: D_{s} E is defined specifically as the set of strings t such that the concatenation st is in L(E). For all strings s, the concatenation sε = εs = s, so sε is in L(E) if and only if s is.
By substitution of like for like in (1), we now have
Note that for all sets E, δ(E) = ε if and only if ε is in E. Substituting like for like in (1′), we get (2). Q.E.D.
An expression E denotes the empty set if and only if
By the definition of derivatives, we then have (5) equivalent to
The normal logical operations on (6) give us
Substituting the expression (R and ~B) for E, we have the following translation of (3), which will be shown equivalent to (2):
The definition of derivation with respect to and gives us:
The definition of δ turns this first into
De Morgan's law then give us:
The definition of δ(~E) then gives us
And this is equivalent, by the definition of material implication, to
which is identical to (2).
Q.E.D.
As noted above (section 31), the allexpressions of XSD 1.0 are restricted in such a way that they can readily be realized by, or interpreted as, either the & operator of ISO 8879 or the interleave operator of Relax NG.
It is sometimes suggested that the restrictions on XSD allexpressions be lifted; if this is done, it will be necessary to define precisely the intended semantics. This section shows how derivatives can be used to specify either the noninterleaving or the interleaving interpretation.
In section 33, a rule was given for constructing the derivatives of &expressions. If the arguments of the & are allowed to be arbitrary expressions, rather than only simple contentmodel expressions and &expressions, then the resulting expressions have an interleave semantics.
Consider the expression E = (a, b+) & ((c*  d+), e). In the interleaving interpretation of the operator, it should accept any string consisting of one a followed by one or more b symbols, interleaved with a single e preceded either by one or more d symbols or zero or more c symbols. So the strings “abeb” and “dabeb” (to name two) are included.
Tracing the validation process on the first of these examples may serve to illustrate the process.
Since ν(D_{abeb} E) = ν(b*) is true, it is clear that E accepts the sequence “abeb”.
Note the way in which the original subexpression “a, b+” is gradually consumed in steps 1 and 2, with the remainder reappearing in the &expression in the derivative. The subexpression “(c*  d*), e” reduces, in step 3, to the empty set (which is removed when the expression is simplified).
To suppress the interleave interpretation, it suffices to change the definition of the derivative D_{s} (F & G) to require that once F or G is started, it must be completed before the other expression can be recognized. The new rule is:
The result is that while the sequence “abde” is accepted, the sequence “abeb” is not:
It will be seen that the operator thus defined is not associative: in the expression “((F & G) & H)” the strings recognized by F and by G must be adjacent, since once the subexpression “(F & G)” is begun, it must be completed before the H can be recognized, while in the expression “(F & (G & H))” it is the G and the H which must be adjacent. It is an open question whether a noninterleaving & operator can be defined in an associative way.
This paper has reviewed the basic concepts of Brzozowski derivatives and illustrated their applicability to problems arising in checking and using XML Schema content models. Brzozowski derivatives provide a convenient way of validating documents against content models whose translation into finitestate automata would be prohibitively large. They allow us to define the SGML and XML rules governing determinism of content models concisely, in terms of properties of the content models themselves and without requiring that the expressions be translated into automata. Brzozowski derivatives can also be used in the concise formulation of alternative constraints, for example weakened wildcards or less restrictive versions of the XML Schema allexpression. And finally, derivatives provide a simple and direct way to test two content models to see whether one accepts any sequences not accepted by the other. In some cases, the use of Brzozowski derivatives may require less work than the use of finitestate automata; in others, even when the theoretical complexity of the problem is unaffected, the algorithm using derivatives may be more easily specified, coded, and understood than the corresponding algorithm using automata.
Despite their use in MSV and Bob Foster's essay on numeric exponents in content models [Foster 2004], Brzozowski derivatives appear not to be widely known among those working on XSD. I was unaware of Foster's work myself until after the first version of this paper had been completed. And MSV appears not to exploit all of their applications to XSDrelated problems: among the parts of XSD not implemented by MSV are two addressed here by means of derivatives. 

For convenience, I will use the same notation to refer both to individual sequences and to the singleton sets consisting solely of those sequences. so in what follows, ε will be used indifferently both for the set containing just the empty sequence and the empty sequence itself. When a sequence contains just a single item, the same notation may be used to denote that item, the lengthone sequence containing just that item, and the singleton set containing just that sequence. This does not seem to lead to any ambiguities, and it helps keep the notation simpler. But if any reader of this document finds any ambiguities, the author would be glad to be informed so I can mend my text, or my notation. 

In some contexts, it may be useful to regard expanded names as denoting elements with a particular expanded name and a particular type. 

Or, using δ, we can say more compactly: “D_{x} (F, G) = ((D_{x} F), G)  (δ(F), D_{x} G)” 

Hence the special case for E = F{0,unbounded}. 

Recall that D_{s} E denotes, by definition, the set of strings which when appended to s will produce members of E. If ν(D_{s} E), then ε is in D_{s} E, which means that if we concatenate s with ε, the result is in E. But ε is the identity with respect to concatenation: for all sequences t, tε = t. If concatenating s and ε gives a string in E, then s is already in E, without the concatenation. 

Informally, a regular expression is weakly deterministic if it obeys the rule known in SGML as the ‘ambiguity rule’, in XML as the ‘determinism rule’, and in XSD as the ‘Unique Particle Attribution constraint’, which requires that there never be any doubt or nondeterminism, in sequences which match the expression (or prefixes of such sequences), about which symbol in the expression matches each symbol in the sequence. A regular expression is strongly deterministic if it is not only deterministic which symbol in the expression is matched by each symbol in the input, but also deterministic which repetition operator fires, whenever more than one might apply. The expression “a?, a” is weakly nondeterministic (although wholly unambiguous), while “(a*)*” is weakly deterministic (there is only one symbol to match), but not strongly deterministic (since for the second and all subsequent occurrences of a it is nondeterministic whether the inner or outer * should fire). For more formal definitions of these terms, see BrüggemannKlein [BrüggemannKlein 1993a]. For a formal definition of deterministic (“oneunambiguous”) regular expressions, see BrüggemannKlein and Wood [BrüggemannKlein/Wood 1998]. 

For those who wish to follow the details:


The structure of the initial step is just as before: since E = (F, G) and F is nullable, this is
For the first alternative, we have
For the second alternative, we have


By the definition of derivatives for sequences, we have


In standard regular expressions, every expression which is weakly deterministic but not strongly deterministic has an equivalent which is both weakly and strongly deterministic: expressions like “(e*)*” can be reduced to “e*” without changing the language recognized. In XML Schema content models, similarly, “e{0,10}{1,10}” can be reduced to “e{0,100}”. In some cases, however, strong determinism cannot be so readily achieved. It seems unlikely that expressions like “(e{1,5},b{0,2}){1,5}” can be translated into strongly deterministic equivalents without either loss of weak determinism or exponential explosion in size, but as far as I know this is an open question. 

The XML specification comes close to having the worst of both approaches: the description in the text is mostly handwaving, and to get a precise idea of what is involved, the reader must follow the references to published algorithms for translating regular expressions into automata and then mentally perform the specified translation. 

It would be interesting to see if a precise definition of strong determinism could also be formulated in terms of the expressions and their derivatives, which avoids appeal to the construction of automata. It seems possible, but I do not have such a definition to offer. 

In some respects, the definition of ambiguity in Sam Wilmott's paper [Wilmott 1991] is very attractive: it does not explicitly require generating every characteristic derivative, and it requires checking over all input symbols only in the cases “F  G”, “F, G”, and “F*”. But the definition given there requires that every associative grouping of sequences be examined: “ab is ambiguous if ... for some c and d, where c ≠ a and d ≠ b, ab = cd, and cd is ambiguous.” As formulated, this appears to refer to the language accepted by the expressions ab and cd, but I believe it should be taken to mean only that an expression of the form ((E, F), G) needs also to be examined in the form (E, (F, G)), not that the language recognized by an expression of the form ab (in conventional regular expression notation) needs to be examined to see if it can be recognized by some other expression of the form cd, which would really be expecting a bit much. The example given is “(x(y+λ))y” (in content model notation this would be “(x, y?), y”), which needs to be refactored into “x((y+λ)y)” (i.e. “x, (y?, y)”) for the nondeterminism to be detected. The definition given here avoids the necessity for such qualifications and complications by specifying that the property must hold for all characteristic derivatives. It is unnecessary to refactor E = “(x, y?), y”, because D_{x}E = “(y?, y)”, and the nondeterminism will be detected there. A more serious drawback for markup purposes is that the definition in Wilmott [Wilmott 1991] covers only regular expressions, not content models; this is a drawback since as BrüggemannKlein and Wood [BrüggemannKlein/Wood 1998] point out, translation from content models into conventional regular expressions (i.e. expressions without the +, ?, or & operators) does not preserve oneunambiguity. 

By the rule for idet(F, G)). 

The original expression, plus one derivative each for strings of “e” of length 1 through 1,000,000, plus ∅, which the derivative with respect to any other string. 

In technical terms, tokens in F which are in the follow set of tokens in the final set of F. For definitions of start set, final set, etc., see [BrüggemannKlein 1993a] or [BrüggemannKlein/Wood 1998]. 
[BrüggemannKlein 1993a] BrüggemannKlein, Anne. 1993. “Regular expressions into finite automata.” Theoretical Computer Science 120.2 (1993): 197213.
[BrüggemannKlein 1993b] BrüggemannKlein, Anne. 1993. Formal models in document processing. Habilitationsschrift, Freiburg i.Br., 1993. 110 pp. Available at ftp://ftp.informatik.unifreiburg.de/documents/papers/brueggem/habil.ps (Cover pages archival copy also at http://www.oasisopen.org/cover/bruggDissertps.gz).
[BrüggemannKlein/Wood 1998] BrüggemannKlein, Anne, and Derick Wood. 1998. “Oneunambiguous regular languages.” Information and computation 140 (1998): 229253.
[Brzozowski 1964] Brzozowski, Janusz A. “Derivatives of regular expressions” Journal of the ACM 11.4 (1964): 481494.
[Clark 2002] Clark, James. 2002. “An algorithm for RELAX NG validation.” 13 February 2002. http://www.thaiopensource.com/relaxng/derivative.html
[English 1999] English, Joe. 1999. How to validate XML. http://www.flightlab.com/~joe/sgml/validate.html
[Foster 2003] Foster, Bob. 2003. “Derivative works [Parsing].” Blog entry, 27 January 2003, in Technical difficulties. http://bobfoster.com/plog/index.php?m=200301
[Foster 2004] Foster, Bob. 2004. “Derivatives of Bounded Rep[e]tition.” Blog entry, 17 May 2004, in City Life: Full Speed Ahead. http://jroller.com/comments/bobfoster/FullSpeedAhead/derivatives_of_bounded_repitition
[Fuchs/Brown 2003] Fuchs, Matthew, and Allen Brown. 2003. “Supporting UPA and restriction on an extension of XML Schema.” Extreme Markup Languages, Montréal, 2003. Available online at http://www.idealliance.org/papers/extreme03/html/2003/Fuchs01/EML2003Fuchs01.html
[Glushkov 1961] Glushkov, V. N. 1961. “The abstract theory of automata”, tr. J. M. Jackson, in Russian Mathematical Surveys, A translation of the survey articles and of selected biographical articles in Uspekhi matematicheskikh nauk, ed. J. L. B. Cooper, vol. 16 (London: CleaverHume, 1961), pp. 153.
[Hopkins 1994] Hopkins, Mark. 1994. “Regular Expression Software”. Posting to comp.compilers, 17 February 1994. http://compilers.iecc.com/comparch/article/9402109
[ISO 1986] International Organization for Standardization (ISO). 1986. ISO 88791986 (E). Information processing — Text and Office Systems — Standard Generalized Markup Language (SGML). International Organization for Standardization, Geneva, 1986.
[Kawaguchi 2003] Kawaguchi, Kohsuke. 2003. Sun MultiSchema XML Validator. Version 1.2, April 2003. Available from http://www.sun.com/software/xml/developers/multischema/
[Schmidt 2002] Schmidt, Martin. 2002. Design and Implementation of a validating XML parser in Haskell. Master's thesis, Fachhochschule Wedel (Germany).
[SperbergMcQueen 2004] SperbergMcQueen, C. M. 2004. “Notes on finite state automata with counters.” http://www.w3.org/XML/2004/05/msmcfa.html
[Thompson/Tobin 2003] Thompson, Henry S., and Richard Tobin. 2003. “Using finite state automata to implement W3C XML Schema content model validation and restriction checking.” XML Europe 2003, London. Available online at http://www.idealliance.org/papers/dx_xmle03/papers/020205/020205.html (imperfectly rendered) or at http://www.ltg.ed.ac.uk/~ht/XML_Europe_2003.html.
[W3C 2001a] World Wide Web Consortium (W3C). “XML Schema Part 0: Primer”, ed. David Fallside. W3C Recommendation, 2 May 2001. [Cambridge, SophiaAntipolis, Tokyo: W3C] http://www.w3.org/TR/xmlschema0/.
[W3C 2001b] World Wide Web Consortium (W3C). 2001. XML Schema Part 1: Structures, ed. Henry S. Thompson, David Beech, Murray Maloney, and Noah Mendelsohn. W3C Recommendation 2 May 2001. [Cambridge, SophiaAntipolis, and Tokyo]: World Wide Web Consortium. http://www.w3.org/TR/2001/RECxmlschema120010502/
[W3C 2001c] World Wide Web Consortium (W3C). 2001. XML Schema Part 2: Datatypes, ed. Biron, Paul V. and Ashok Malhotra. W3C Recommendation 2 May 2001. [Cambridge, SophiaAntipolis, and Tokyo]: World Wide Web Consortium. http://www.w3.org/TR/2001/RECxmlschema220010502/
[W3C 2004] World Wide Web Consortium (W3C). Extensible Markup Language (XML) 1.0 (Third Edition), ed. Tim Bray, Jean Paoli, C. M. SperbergMcQueen, Eve Maler (Second Edition), François Yergeau (Third Edition). W3C Recommendation 4 February 2004. Published by the World Wide Web Consortium at http://www.w3.org/TR/RECxml/.
[W3C 2004] World Wide Web Consortium (W3C). Namespaces in XML 1.1, ed. Tim Bray, Dave Hollander, Andrew Layman, and Richard Tobin. W3C Recommendation 4 February 2004. Published by the World Wide Web Consortium at http://www.w3.org/TR/xmlnames11.
[Wilmott 1991] [Wilmott, Sam]. [1991]. Content model algebra. Exoterica white paper. [Dated and attributed from external evidence. Original at http://www.omnimark.com/white/cma/cma.html has disappeared, but copies appear to be available at http://www.oasisopen.org/cover/omnimarkContentModelAlgebra200101.html and http://home.chello.no/~mgrsby/sgmlintr/sgmlcont.htm]