Querying XML documents is a basic task for many XML applications. Typically, matches of queries are located individually, i.e. a match is a single node in a queried document. kary queries, in contrast, simultaneously identify k locations in the document, which are connected via some specified relation. For example, in rulebased transformations (like XSLT transformations), a node to which a rule is applied and another node which is selected by the rule in the context of the first node could be identified together by a binary query defining the relation between these two nodes. Such a binary query would replace the match and the select pattern used to identify the first and the second node respectively. By eliminating the need for select patterns, binary queries would not only reduce the number of patterns in a transformation, but would also allow that all the pattern matching is done statically, i.e. before the transformation actually begins. We use the formalism of regular forest grammars to define queries of arbitrary arities. Pushdown forest automata have been used in the past to efficiently implement unary queries expressed by using forest grammars. The main contribution of this paper is an algorithm based on pushdown forest automata which evaluates binary queries. In the worst theoretical case the algorithm evaluates binary queries in time proportional to the square of the size of the input document. In practice however, the complexity is mostly rather linear in the input size. Queries for which the right context does not need to be checked can be evaluated even along with the parsing of the input document. Rather than using forest grammars, queries can be also expressed in a more intuitive pattern language using a syntax similar to XPath.
XML Source  PDF (for print)  Author Package  Typeset PDF 
Searching in documents for components with specific properties, is one of the most fundamental tasks in XML document processing applications. Various pattern languages have been designed for identifying properties of document components, see e.g. [FSW99][W3C99c][Fxg02]. Besides the basic use of retrieving parts of XML documents, querying has been also used in XML transformation languages for referring to parts of XML documents manipulated by the transformation languages [W3C99a][Fxt01]. Clearly, in order to cover the wide range of possible applications, the language of XML query patterns should be both very expressive and also efficiently implementable.
Perhaps the most widely known language of query patterns is XPath [W3C99c][W3C99d]. The XPath patterns describe unary queries, i.e. they locate individual nodes in the input. The same is also true for the query language supported by fxgrep [Fxg02]. In this paper, we are going to lift this limitation. Generalizing the regular approach to querying of fxgrep, we propose a concept of recognizable kary queries and present techniques for the efficient implementation for the special case of binary queries. Recognizable kary queries simultaneously locate ktuples of nodes in the input document which simultaneously satisfy a specific property.
In particular, the result of a binary query can be interpreted as a tabulation of relative links between parts of documents. This may be exploited in rulebased XML transformation languages. XSLT transformations for example [W3C99a][W3C99b], use unary queries, expressed through XPath patterns, for two purposes. The first purpose is to differentiate components of the input document requiring different processing, i.e. given a node, to identify the applicable rule for it. Patterns used for this purpose are called match patterns. Secondly, within a rule, patterns are used for selecting nodes relative to the node to which the rule applies for further processing. Patterns used for this purpose are called select patterns. The match patterns can be queried once before the actual transformation. In contrast, select patterns must be repeatedly queried during the transformation phase. Note also that the class of select patterns is stronger (and more difficult to implement) as these allow for arbitrary navigation in the document.
Our key application for binary queries is to eliminate the dynamic select queries by combining the unary match pattern of a rule and a select pattern within this rule into a (statically tabulated) binary query.
Example: Consider the following XML input document:
<company> <url>spice.girls</url> <empl><name>Mel A.</name></empl> <empl><name>Mel B.</name></empl> <empl><name>Mel C.</name></empl> </company>
The following XSLT rule produces a homepage element for each employee:
<xsl:template match="company[url]/empl"> <homepage> <body>Under construction. See the company's page:<link><xsl:copyof select="../url"/></link> </body> </homepage> </xsl:template>
A binary match could simultaneously locate an employee and the url of her company. Let us suppose that binary queries were possible in XPath. Let the second element of a binary match be specified by preceding the corresponding node in the pattern with the % symbol, and referred within the rule by using the same symbol. The rule above could then be expressed as follows:
<xsl:template match="company[%url]/empl"> <homepage> <body>Under construction. See the company's page:<link>%</link> </body> </homepage> </xsl:template>
In the reminder of this paper we show that an expressive class of binary queries can be evaluated efficiently. Many select queries relative to the dynamic current context can be collected into one binary query whose evaluation can be performed statically, i.e., preceding the transformation of the input document.
XSLT keys contribute one special case of binary matches. Basically, a key is a pair consisting of the node which has the key and the value of the key (a string). The node is identified using a match pattern, while the value is given by a select pattern evaluated in the context of the node. Thus, the binary matching algorithm presented here can also be used to implement the XSLT keys.
The latest drafts of XPath [W3C99d] and XQuery [W3C99e] provide a for and a FLWR expression, respectively, which allow variables to be bound to nodes which are matches of unary queries. These nodes can be used in the scope of the expressions as context for the evaluation of further unary queries. This use of for expressions also qualifies for an implementation which uses binary queries to subsume the two unary queries.
XML documents are textual representations of sequences of trees which we call forests^{1}. As usual, the trees t and forests f over an alphabet Σ are defined as:
The definition naturally captures XML elements without attributes and having as subtrees only element nodes. The definition captures, though, also realistic XML elements. Text nodes can be represented as special elements containing an empty subelement whose name is the respective text. Processing instructions can be represented as special elements containing two text nodes, one for the processor part and one for the instruction part. Finally, attributes can be collected under a special element as pairs of text nodes containing the name and the value for each attribute.
As in XML, we abbreviate <a></a> to <a/>. Given a tree t = <a>f</a>, we say that a is the label of t or that t is labeled by a. We write this as label(t) = a.
An important task in document processing consists in verifying a structural property of a document tree. For XML, various schema languages have been proposed for this purpose. Besides the documenttype declaration of a document, people have proposed, e.g., XMLSchema [W3C99f], DSD [KMS00], RelaxNG [Oas01]. In essence, all these formalisms specify regular forest languages. For a discussion on this see [MLM01]. Regular forest languages constitute a very expressive and theoretically robust formalism for specifying properties of forests [BW98][Mur95][BW98]. Validating against one of these XML schema languages is therefore a test for membership in a forest regular language. Forest regular languages can be specified using forest grammars.
Definition: A forest grammar over Σ is a tuple G = (X,r_{0},R), where X is a set of variables, r_{0}, a regular expression over variables, is the start expression, and R is a finite set of rules of the form x → <a>r</a> with x ∈ X, a ∈ Σ and r a regular expression over variables.
Example: Consider the grammar G = ({x_{T},x_{1},x_{a},x_{b},x_{c}},(x_{1}x_{a}),R) over {a,b,c} with the following rules:
The first three rules make x_{T} account for trees with arbitrary content. As specified by rule number 5, x_{a} stands for the a element with a b and a c child. Rules 6 and 7 say that these children can have arbitrary content. Finally, rule 4 specifies that the a node specified by 5 can be at arbitrary depth in the input, and all its ancestors must be labeled a.
A grammar is the formal specification of the set of forests which can be derived from a variable string conforming to r_{0}, by using rules from R in a number of steps. In each step, a variable symbol x for which a rule x → <a>r</a> exists is replaced by <a>w</a>, where w is a variable string conforming to r.
In the examples throughout the rest of the paper we will use the grammar G from above and the following input document:
<a> <a> <b/> <c/> </a> <a> <b/> </a> <a> <b/> <c/> </a> </a>
The tree representation of the input document
The possible derivations of the input under G are presented in Figure 2. Note that a derivation annotates each node of the input with a variable x from X.
Patterns for unary queries, conceptually, consist of two parts. The contextual part constrains the surrounding context of the subtrees of interest, whereas the structural part describes properties of the subtrees themselves. In [Neu00], forest grammars have been used to specify both the contextual and the structural part of unary queries. The idea is that a variable x of a grammar is used to characterize the set of trees which can be generated from x. Thus, given a grammar G, we may specify the structural part of a pattern simply through a variable x. In order to capture the context, we use the grammar itself which (given a variable for the whole document) constrains the places where the variable x may occur in derivations.
Thus, a unary query is a pair Q = (G,X_{targets}) consisting of a forest grammar G = (X,r_{0},R) and a set of target variables X_{targets} ⊆ X. A subtree t of an an input forest f is a match for Q if and only if there is a derivation of f with respect to G in which the root of t is labeled with a symbol from X_{targets}.
Example: The query Q_{1} = (G,{x_{a}}) locates the a nodes having only a ancestors and as children, a b followed by a c. In our input document, these are the leftmost and the rightmost children of the root element. One can see that both are indeed labeled with x_{a} by the first and the second derivation of the input document, respectively.
In [NS98], a new class of automata, pushdown forest automata, has been introduced which allows the efficient implementation of such unary queries [Neu00][NS98] (see below).
In this paper we extend the notion of unary queries as presented in the last subsection to kary queries for k ≥ 1. The idea is very simple. Instead of using a set of target variables, we now have a set of ktuples of variables. Thus, a kary query is a tuple Q = (G,T) consisting of a forest grammar G = (X,r_{0},R) and T a set of ktuples of variables (x_{target1},...,x_{targetk}) ∈ X^{k}.
Given a query Q and an input forest f, the matches of the query Q are the ktuples of trees from f, for which a tuple of variables exists in T, such that there is one derivation which labels the root of each tree in the tuple of trees with the corresponding variable in the tuple of variables.
Example: The query Q_{2} = (G,{(x_{b},x_{c})} locates the pairs of b and c nodes having as father the same node a, and only a ancestors. The leftmost b and c nodes of our input form a match, as they are labeled with x_{b} and x_{c}, respectively, by the first derivation of the input (see Figure 2). Similarly, the rightmost b and c form a match, as it can be seen in the second derivation.
Various classes of automata have been introduced for the implementation of forest grammars and querying.
Classically, the bottomup finitestate forest automata have been used for accepting regular forest languages. The processing model of these automata is a bottomup forest traversal as illustrated in Figure 3. A bottomup forest automaton has two types of states, tree states and forest states, and two types of transitions, Side and Up. In order to assign a state to a tree t = <a>t_{1}...t_{n}</a>, a tree state p_{i} is first assigned to each t_{i}. Then an initial forest state q_{1} is chosen from the set of all possible initial forest states. By traversing the word p_{1}...p_{n} from left to right and performing a sidetransition at each step, a forest state q_{n+1} is obtained from q_{1}. An uptransition for q_{n+1} and a yields a tree state p for t.
The processing model of a bottomup forest automaton
Given a forest grammar G, a nondeterministic bottomup automaton can be constructed which accepts the forest language defined by the grammar. Note that the implementation of a forest automaton however, requires that the automaton is made deterministic. Indeed, for every nondeterministic bottomup forest automaton, an equivalent deterministic automaton can be obtained by a subset construction. This construction, however, may lead to an exponential blowup of the number of states of the automaton.
A class of forest automata equally expressive as the bottomup forest automata, but much more concise and efficient to implement are the pushdown forest automata as introduced by Neumann and Seidl [NS98]. These automata combine features of bottomup and topdown automata. Besides Side and Up transitions as in the case of bottomup automata, pushdown automata also have a transition relation Down. The processing model of a pushdown forest automaton is depicted in Figure 4. A pushdown automaton visits the nodes of a document in the same order in which they are encountered by an XML parser. When arriving at some node a with a forest state q, the Down transition is used to select a sensible initial forest state q_{1} for the traversal of the children of the node. q_{1} is chosen only if it can lead to a successful Up transition followed by a successful Side transition from the node a.
The processing model of a pushdown forest automaton
A lefttoright pushdown forest automaton A = (P,Q,I,F,Down,Up,Side) consists of a set of tree states P, a set of forest states Q, a set of start states I ⊆ Q, a set of final states F ⊆ Q, an uprelation Up ⊂ Q×Σ×P, a siderelation Side ⊂ Q×P×Q and a downrelation Down ⊂ Q×Σ×Q.
Based on Down, Up and Side, the behavior of A is described by the relations δ_{F} ⊂ Q×F_{Σ}×Q and δ_{T} ⊂ Q×T_{Σ}×P as follows:
[Neu00] shows that deterministic and nondeterministic pushdown forest automata are equally expressive. Given a grammar G, with the notations from the previous sections, a deterministic pushdown forest automaton A_{G}, accepting the set of forests specified by G can be constructed as follows: Let G = (X,r_{0},R) and {r_{1},...,r_{l}} be the set of regular expressions different from r_{0} occurring on the righthand sides of rules in R. For each j ∈ {0,..,l}, let A_{j} = (Y_{j},y_{0,j},F_{j},δ_{j}) be the nondeterministic finite automaton (NFA) accepting the language of r_{j}, as obtained by a BerrySethi construction [BS86]. Y_{j} is the set of NFA states, y_{0,j} ∈ Y_{j} the initial state, F_{j} ⊆ Y_{j} the set of final states, and δ_{j} ⊆ Y_{j}×Σ×Y_{j} the transition relation of A_{j}. Let Y_{i}∩Y_{j} = {} for i≠j. Let Y = Y_{0}∪...∪Y_{l} and δ=δ_{0}∪...∪δ_{l}. Then A_{G} = (2^{X},2^{Y},{q_{0}},F,Down,Up,Side) where:
Example: Figure 5 depicts the finite automata obtained by BerrySethi constructions for the regular expressions occurring in G. The start state of each automaton is marked with a thick point. The final states are depicted in grey. Note that the BerrySethi construction always yields an automaton in which for every state y, all the incoming transitions are labeled with the same symbol x.
The finite automata for the regular expressions in grammar G
The run of the A_{G} automaton on our input document is depicted in Figure 6.
The forest automata presented in the previous section can be used to test the membership of a forest to a language. Testing membership to a forest language essentially means testing the conformance to a schema or, in other words, verifying the structure of trees. For querying, however, we need to be able to locate subtrees which not only have a specified structure, but also occur in a specified context. Neumann and Seidl have successfully applied pushdown forest automata for answering unary queries [NS98][Neu00][Fxg02].
In order to explain their technique of query evaluation we need to relate the run of a pushdown automaton over a forest with the set of possible derivations of the forest. By comparing the run of the pushdown automaton in Figure 6 and the derivations in Figure 2, the following observation can be made. If a derivation annotates a node with a variable symbol x, then x is contained in the tree state p for that node. The converse however does not hold. There are nodes whose tree state p contains some x, but such that no derivation exists in which the node is annotated with x. Consider for example the node b in the middle of the input document. Its tree state contains x_{b}, but no derivation labels this node with x_{b}. A tree state assigned by A_{G} to a node is thus a superset of the set of symbols x with which any possible derivations labels the node. The reason why these two sets may not be equal is that by the time the pushdown automaton finishes visiting a node, it has not seen all the context of the node. More precisely, it has not seen the right context. In order to account for the right context, a second automaton must proceed from right to left.
A deterministic righttoleft pushdown automaton B_{G}^{←} runs on the input forest in which every node is annotated by the run of the first automaton A_{G} with the tree state p^{→} synthesized for the node and the forest state q^{→} obtained by the side transition at that node. B_{G}^{←} operates on the same sets of NFA states as A_{G}. The processing model of B_{G}^{←} is depicted in Figure 7. Note that we use the superscript ^{→} to refer to the states of the lefttoright automaton A_{G}.
In contrast to A_{G}, when descending to the children of a node, B_{G}^{←} selects final instead of initial NFA states, and follows the NFA transitions in reverse, while traversing the children from right to left. Furthermore, when executing its Down, Side and Up transitions, B_{G}^{←} only takes into account NFA states that were also reached by A_{G}. Up computes a tree state for a node from the forest state q_{1} computed for its children and, additionally as compared to A_{G}, from the forest state q by which B_{G}^{←} reaches the node.
B_{G}^{←} is defined as (P,Q,F_{0},{},Down,Up,Side), where P, Q and F_{0} are as in the definition of A_{G} and:
The righttoleft automaton constructed above assigns to each node in the input a tree state p such that x ∈ p if and only if there is a derivation which labels the node with x. Thus, accordingly to the definition of a unary query (G,X_{targets}), a node is a match if and only if ∃x ∈ p and x ∈ X_{targets}.
Example: Figure 8 illustrates the run of B_{G}^{←} on the input annotated by A_{G}. Note that the tree state assigned by B_{G}^{←} to the middle node b contains now only x_{T}, the only symbol by which this node is labeled in both of the possible derivations.
Let us now consider a binary query (G,T). For simplicity, let us first assume that T equals the Cartesian product X_{targets1}×X_{targets2}. Given a binary match (n_{1},n_{2}), we call n_{1} primary match, and n_{2} secondary match. The tree states of B_{G}^{←} can be used to find all primary and all secondary matches. Primary matches are the nodes annotated with a tree state p such that ∃x ∈ p, x ∈ X_{targets1}. Similarly, secondary matches are the nodes annotated with some p such that ∃x ∈ p and x ∈ X_{targets2}. However, finding all the primary and all the secondary matches is not enough for binary query answering. In order to report a binary match we must be able to tell whether a primary match and a secondary match are defined with respect to the same derivation.
Example: Consider the XML input and the query Q_{2} as above. By looking at the tree states of B_{G}^{←} in Figure 8, one can not tell which primary match b node and which secondary match c node belong together to a binary match. More information is needed in the annotation made by the pushdown automata.
We enhance the second pushdown forest automaton as follows:
Algorithm: Answering binary queries
function match(Q,f) { f1 := annotation_of_f_by_the_lefttoright_pushdown_automaton(Q,f) (q,M) := deltaF(f1,I,{}) return M } function deltaF(f,q,M) { case f of emptyForest => return (q,M)  forest f1 followed by tree <(a,pl,ql)>f2</(a,pl,ql)> => /*pl and ql are the annotations of the lefttoright automaton*/ (p,M) := deltaT (<(a,pl,ql)>f2</(a,pl,ql)>,q,M) foreach (y,l1,l2) in q with y in ql { foreach (x,l1',l2') in p with (y1,x,y) in delta { foreach m1 in l1 { foreach m2 in l2'{ M := M ∪ (m1,m2) } } foreach m1 in l1' { foreach m2 in l2{ M := M ∪ (m1,m2) } } } } q' := Side((a,pl,ql),(q,p)) return deltaF(f1,q',M) } function deltaT(<(a,pl,ql)>f</(a,pl,ql)>,q,M) { qn := Down((a,pl,ql),(a,q)) (q1,M) := deltaF(f,qn,M) p := Up((a,pl,ql),(q,q1)) if ∃ (x,l1,l2) ∈ p and x ∈ Xtargets1∩Xtargets2 then M := M ∪ (<a>f</a>,<a>f</a>) return (p,M) }
Example: Figure 9 shows the run of B_{G}^{2←} on our input. The nodes have been numbered in order to be able to refer them in the list of matches. Elements (x,l_{1},l_{2}) and (y,l_{1},l_{2}) are represented as and , respectively. (x,[],[]) and (y,[],[]) are represented as x and y, respectively.
The proof of the correctness of the algorithm is sketched in the Appendix section.
In order to assess the complexity of our algorithm given some input document t, let t be the size of the input, s the maximum of the number of primary and the number of secondary matches in t, and r the number of binary matches. The annotation of the input by the lefttoright automaton has the complexity O(t). The second automaton visits each node of the document once. For each node, it recomputes for each component of the state the two lists L_{1} and L_{2}. This amounts to time O(s). Overall this contributes time O(s·t). Furthermore, the second pass has to dump the binary matches. As each match pair is considered only a constant number of times, the overall cost for dumping amounts to O(r). Altogether this gives the time complexity O(r+s·t).
Note that r ≤ t^{2} and s ≤ t. The theoretical worst therefore is O(t^{2}). In practice however, it is likely that in most cases the number of primary, secondary and binary matches is unimportant as compared to the input size. Then the complexity of the algorithm is rather linear than quadratic.
In the algorithm from the last section, we assumed that the relation T of the binary query Q = (G,T) is a Cartesian product. Let us briefly sketch how the algorithm can be generalized also to arbitrary binary relations T.
In this general case we simply decompose T into the union of binary Cartesian products T_{1},...,T_{m}. Now we may successively run our algorithm for each of the T_{i} and collect all the matches. Or, more efficiently, we just perform one lefttoright pass and then execute the m righttoleft passes corresponding to the T_{i} strictly in parallel. Note that this extension increases the run time only by a constant factor.
The idea of our algorithm for binary queries can be generalized to kary queries. For that, we modify the second pass by maintaining not just lists for primary and secondary matches. Instead, we maintain separate lists for each nonempty subset A ⊆ {1,...,k} of at most k1 elements. The list for A then contains all tuples of subtrees which form a partial match corresponding to the indices in A.
As the complexity of the resulting algorithm grows exponentially in k, we have refrained from implementing this generalization.
A large class of queries can be answered even more efficiently than by the algorithm presented so far. For many queries all matches can be found already during the first lefttoright pass. In particular, this holds for queries for which the rightcontext of the matches does not have to be verified. This can be checked prior to the query evaluation by analyzing the query [Neu00]. Rightcontextignoring queries, both unary and binary, can therefore be answered in one single pass. Besides saving the second pass, this allows to search a document without constructing the document tree in memory. Moreover, the querying can be done along with the parsing of the document via an eventbased parser.
We have implemented the algorithm presented above and used it to extend the fxgrep querying tool ( http://www.informatik.unitrier.de/~aberlea/Fxgrep/) to allow for XML binary queries.
The formalism of forest grammars as presented in this paper ignores many XML specific aspects. In order to obtain a practical querying tool, various practical aspects must be considered.
XML elements can contain besides other elements also character data. In order to deal with this, each segment of character data is seen as a single node. To allow text patternmatching, the forest automata are enhanced with an external predicate for each text pattern. A special Down relation selects the text patterns to be checked for a text node, as those which might contribute to a succeeding side transition. The tree state for the text nodes are determined from the set of fulfilled predicates using a special Up relation.
The pushdown automata are efficiently implemented by computing their transitions only as they are needed. Transitions which are not required for the traversal of the input are not computed. This avoids the computation of possibly exponentially large transition tables. The number of transitions that are actually computed is at most linear in the size of the input document. In practice only few transitions need to be computed even for large XML documents.
The queries specified using forest grammars can be very expressive, though their power is not easily exploitable by nonexpert users. Therefore, our querying tool fxgrep allows to specify queries also by using a more intuitive pattern language.
The pattern language of fxgrep resembles in its syntax to XPath, but allows more precise specification of paths as well as of the context of nodes. Structural conditions for a node may be expressed by using regular expressions over the children of the node. Structural conditions are given between brackets following the node to which they refer. For example, the pattern a[b*c[d*]] is fulfilled by an a element that has one or more b children followed by a c child that itself has only d children. Contextual conditions for a node may be specified as structural conditions for nodes lying on the path from the root to that node. For example //a[#c]/b identifies b nodes which have only one right sibling c and whose father is an a node. The # here is a placeholder for the tree in which the matches are to be found. Furthermore, paths can be also specified as regular expressions. For example, (a/)+b identifies a b node, where each ancestor (at least one) is an a node. The unary example query Q_{1} from above can be specified as (a/)+a[#c]/b.
In our enhancement of fxgrep, we also extended this base pattern language to allow the specification of binary queries. In order to make this extension as simple and intuitive as possible, we just provide one extra symbol % which may be placed anywhere inside the pattern to indicate the secondary match position. Thus, the binary query Q_{2} can be expressed as (a/)+a[#%c]/b.
Consider the unary query //book[_ (author/"escu$") _]/title (_ is a wildcard matching an arbitrary forest). The query locates all the book titles whose author's names end in "escu". The binary query to simultaneously report the titles as above and their authors is: //book[_ (%author/"escu$") _]/title.
Queries on trees can be also specified as logical formulas. In [NS00] a unary query is specified as a formula in a fragment of the monadic second order logic. It is shown that an algorithm exists that evaluates the query in time linear in the input size. [Sch00] defines an extension of the first order logic in which queries of arbitrary arity can be expressed. It is shown that unary queries can be also theoretically evaluated in linear time. Binary queries however need evaluation time cubic in the input size.
In this paper we showed how regular forest grammars can be used to specify XML queries of arbitrary arities. We presented how unary queries can be efficiently implemented using pushdown forest automata. The main contribution of the paper is an efficient algorithm for evaluating binary queries. Furthermore, we indicated how binary queries can increase expressivity and efficiency of XML transformation and querying languages.
Proof of correctness of the algorithm: (sketch)
Let us consider an NFA transition (y_{1},x,y) tracked by the righttoleft pushdown automaton at one of its Side transitions at a node n. Let L_{1} be the list of primary matches associated with x and L_{2} be the list of secondaries associated with y.
According to the invariants maintained by the list constructions (see the section “Binary Queries”), for every n_{1} ∈ L_{1} there exists a derivation D_{1} such that
From the definition of derivations, (2) and (4) we can construct a derivation D_{3} which is obtained from D_{2} by replacing the derivation steps for n from D_{2} with those from D_{1}.
Because of (1), n_{1} is a primary match with respect to D_{3}. Because of (3), n_{2} is a secondary match with respect to D_{3}. Thus (n_{1},n_{2}) is a binary match with respect to D_{3}.
The document element actually constitutes a single tree which, however, may be preceded or followed by processing instructions. Therefore, in general, an XML document is a forest rather than a tree. 

The BerrySethi construction guarantees that all the incoming transitions to a state y are labeled with the same x. 
[BS86] Gerard Berry and Ravi Sethi. From Regular Expressions to Deterministic Automata. Theoretical Computer Science, 48, pages 117126, 1986
[BW98] AnneBrüggemannKlein and Derick Wood. Regular Tree Languages over NonRanked Alphabets. Unpublished draft, April 1998. Available online at http://www.oasisopen.org/cover/regTreeLanguagesps.gz.
[BW98] AnneBrüggemannKlein, Makoto Murata and Derick Wood. Regular Tree and Regular Hedge Languages over NonRanked Alphabets. Unfinished Research Report HKUST200105, HKUST Theoretical Computer Science Center, April 2001. Available online at http://www.cs.ust.hk/tcsc/RR/200105.ps.gz.
[FSW99] M.Fernandez, J.Siméon and P.Wadler. XML Query Languages: Experiences and exemplars. Available online at http://wwwdb.research.belllabs.com/user/simeon/xquery.html.
[Fxg02] Andreas Neumann, Alexandru Berlea. fxgrep 1.4.5. Source Code, 2002. Available online at http://www.informatik.unitrier.de/~aberlea/Fxgrep.
[Fxp01] Andreas Neumann. fxp 1.4.4. Source Code, 2001. Available online at http://www.informatik.unitrier.de/~aberlea/Fxp
[Fxt01] Alexandru Berlea. fxt 2.0. Online Documentation, 2001. Available online at http://www.informatik.unitrier.de/~aberlea/Fxt.
[Fxt02] Alexandru Berlea. fxt 2.0. Online Documentation, 2002. Available online at http://www.informatik.unitrier.de/~aberlea/Fxt.
[KMS00] N.Klarlund, A.Moller and M.I.Schwatzbach. DSD: A Schema Language for XML. In ACM SIGSOFT Workshop on Formal Methods in Software Practice, Portland, USA, August 2000.
[MLM01] Makoto Murata, Dongwon Lee and Murali Mani. Taxonomy of XML Schema Languages Using Formal Language Theory, August 2001. In Extreme Markup Languages 2001.
[Mur95] Makoto Murata. Forest Regular Languages and Tree Regular Languages. Unpublished manuscript, 1995.
[Neu00] Andreas Neumann. Parsing and Querying XML Documents in SML. Ph.D. Thesis, University of Trier, Trier, 2000. Available online at http://www.informatik.unitrier.de/~neumann/Papers/thesis.ps.gz.
[NS00] Frank Neven and Thomas Schwentick. Expressive and efficient pattern languages for treestructured data. In Proc. 19th Symposium on Principles of Database Systems (PODS 2000), pages 145156, 2000.
[NS98] Andreas Neumann and Helmut Seidl. Locating Matches of Tree Patterns in Forests In V.Arvind and R.Ramamujan ,editors, Foundations of Software Technology and Theoretical Computer Science, (18th FST&TCS), volume 1530 of Lecture Notes in Computer Science, pages 134145. Springer, Heidelberg, 1998
[Oas01] James Clark and Makoto Murata, editors. RelaxNG Specification, December 2001. Available online at http://www.oasisopen.org/committees/relaxng/.
[Sch00] Thomas Schwentick. On diving in trees. In Proc. 25th Symposium on Mathematical Foundations of Computer Science (MFCS 2000), pages 660669, 2000.
[W3C99a] James Clark, editor. XSL Transformations (XSLT) Version 1.0. W3C Recommendation, World Wide Web Consortium, November 1999. Available online at http://www.w3.org/TR/xslt.
[W3C99b] Michael Kay, editor. XSL Transformations (XSLT) Version 2.0. W3C Working Draft, World Wide Web Consortium, December 2001. Available online at http://www.w3.org/TR/xslt20/.
[W3C99c] James Clark and Steve DeRose, editors. XML Path Language (XPath) Version 1.0. W3C Recommendation, World Wide Web Consortium, November 1999. Available online at http://www.w3.org/TR/xpath.
[W3C99d] A.Berglund, S.Boag, D.Chamberlin, M.F.Fernandez, M.Kay, J.Robie, J.Siméon, editors. XML Path Language (XPath) Version 1.0. W3C Recommendation, World Wide Web Consortium, November 1999. Available online at http://www.w3.org/TR/xpath20/.
[W3C99e] S.Boag, D.Chamberlin, M.F.Fernandez, D.Florescu, J.Robie, J.Siméon, M.Stefanescu, editors. XQuery 1.0: An XML Query Language W3C Working Draft, World Wide Web Consortium, December 2001. Available online at http://www.w3.org/TR/xquery/.
[W3C99f] H.S.Thompson, D.Beech, M.Maloney and N.Mendelsohn, editors XML Schema Part 1: Structures. W3C Recommendation, May 2001. Available online at http://www.w3.org/TR/xmlschema1/.