Binary Queries

Alexandru Berlea
aberlea@psi.uni-trier.de
Helmut Seidl
aberlea@psi.uni-trier.de

Abstract

Querying XML documents is a basic task for many XML applications. Typically, matches of queries are located individually, i.e. a match is a single node in a queried document. k-ary queries, in contrast, simultaneously identify k locations in the document, which are connected via some specified relation. For example, in rule-based transformations (like XSLT transformations), a node to which a rule is applied and another node which is selected by the rule in the context of the first node could be identified together by a binary query defining the relation between these two nodes. Such a binary query would replace the match and the select pattern used to identify the first and the second node respectively. By eliminating the need for select patterns, binary queries would not only reduce the number of patterns in a transformation, but would also allow that all the pattern matching is done statically, i.e. before the transformation actually begins. We use the formalism of regular forest grammars to define queries of arbitrary arities. Pushdown forest automata have been used in the past to efficiently implement unary queries expressed by using forest grammars. The main contribution of this paper is an algorithm based on pushdown forest automata which evaluates binary queries. In the worst theoretical case the algorithm evaluates binary queries in time proportional to the square of the size of the input document. In practice however, the complexity is mostly rather linear in the input size. Queries for which the right context does not need to be checked can be evaluated even along with the parsing of the input document. Rather than using forest grammars, queries can be also expressed in a more intuitive pattern language using a syntax similar to XPath.

Keywords: XPath; XQuery; XSLT; Querying

Alexandru Berlea

Alexandru Berlea is a Ph.D. Student working with Helmut Seidl at the University of Trier, Germany. He has graduated in 1999 from the Computer Science and Engineering Department of the Politehnica University of Bucharest, Romania.

Helmut Seidl

Helmut Seidl is a Professor with the University of Trier, Germany.

Binary Queries

Alexandru Berlea [Research Assistant; University of Trier, Department of Computer Science]
Helmut Seidl [Professor; University of Trier, Department of Computer Science]

Extreme Markup Languages 2002® (Montréal, Québec)

Copyright © 2002 Alexandru Berlea and Helmut Seidl. Reproduced with permission.

Introduction

Searching in documents for components with specific properties, is one of the most fundamental tasks in XML document processing applications. Various pattern languages have been designed for identifying properties of document components, see e.g. [FSW99][W3C99c][Fxg02]. Besides the basic use of retrieving parts of XML documents, querying has been also used in XML transformation languages for referring to parts of XML documents manipulated by the transformation languages [W3C99a][Fxt01]. Clearly, in order to cover the wide range of possible applications, the language of XML query patterns should be both very expressive and also efficiently implementable.

Perhaps the most widely known language of query patterns is XPath [W3C99c][W3C99d]. The XPath patterns describe unary queries, i.e. they locate individual nodes in the input. The same is also true for the query language supported by fxgrep [Fxg02]. In this paper, we are going to lift this limitation. Generalizing the regular approach to querying of fxgrep, we propose a concept of recognizable k-ary queries and present techniques for the efficient implementation for the special case of binary queries. Recognizable k-ary queries simultaneously locate k-tuples of nodes in the input document which simultaneously satisfy a specific property.

In particular, the result of a binary query can be interpreted as a tabulation of relative links between parts of documents. This may be exploited in rule-based XML transformation languages. XSLT transformations for example [W3C99a][W3C99b], use unary queries, expressed through XPath patterns, for two purposes. The first purpose is to differentiate components of the input document requiring different processing, i.e. given a node, to identify the applicable rule for it. Patterns used for this purpose are called match patterns. Secondly, within a rule, patterns are used for selecting nodes relative to the node to which the rule applies for further processing. Patterns used for this purpose are called select patterns. The match patterns can be queried once before the actual transformation. In contrast, select patterns must be repeatedly queried during the transformation phase. Note also that the class of select patterns is stronger (and more difficult to implement) as these allow for arbitrary navigation in the document.

Our key application for binary queries is to eliminate the dynamic select queries by combining the unary match pattern of a rule and a select pattern within this rule into a (statically tabulated) binary query.

Example: Consider the following XML input document:

<company>
  <url>spice.girls</url>
  <empl><name>Mel A.</name></empl>
  <empl><name>Mel B.</name></empl>
  <empl><name>Mel C.</name></empl>
</company>
     

The following XSLT rule produces a homepage element for each employee:

<xsl:template match="company[url]/empl">
  <homepage>
    <body>Under construction. 
       See the company's page:<link><xsl:copy-of select="../url"/></link>
    </body>
  </homepage>
</xsl:template>
     

A binary match could simultaneously locate an employee and the url of her company. Let us suppose that binary queries were possible in XPath. Let the second element of a binary match be specified by preceding the corresponding node in the pattern with the % symbol, and referred within the rule by using the same symbol. The rule above could then be expressed as follows:

<xsl:template match="company[%url]/empl">
  <homepage>
    <body>Under construction. 
       See the company's page:<link>%</link>
    </body>
  </homepage>
</xsl:template>
      

In the reminder of this paper we show that an expressive class of binary queries can be evaluated efficiently. Many select queries relative to the dynamic current context can be collected into one binary query whose evaluation can be performed statically, i.e., preceding the transformation of the input document.

XSLT keys contribute one special case of binary matches. Basically, a key is a pair consisting of the node which has the key and the value of the key (a string). The node is identified using a match pattern, while the value is given by a select pattern evaluated in the context of the node. Thus, the binary matching algorithm presented here can also be used to implement the XSLT keys.

The latest drafts of XPath [W3C99d] and XQuery [W3C99e] provide a for and a FLWR expression, respectively, which allow variables to be bound to nodes which are matches of unary queries. These nodes can be used in the scope of the expressions as context for the evaluation of further unary queries. This use of for expressions also qualifies for an implementation which uses binary queries to subsume the two unary queries.

Preliminaries

XML documents are textual representations of sequences of trees which we call forests1. As usual, the trees t and forests f over an alphabet Σ are defined as:

  • t ::= <a>f</a>, a ∈ Σ
  • f ::= t1...tn, n ≥ 0

The definition naturally captures XML elements without attributes and having as sub-trees only element nodes. The definition captures, though, also realistic XML elements. Text nodes can be represented as special elements containing an empty sub-element whose name is the respective text. Processing instructions can be represented as special elements containing two text nodes, one for the processor part and one for the instruction part. Finally, attributes can be collected under a special element as pairs of text nodes containing the name and the value for each attribute.

As in XML, we abbreviate <a></a> to <a/>. Given a tree t = <a>f</a>, we say that a is the label of t or that t is labeled by a. We write this as label(t) = a.

An important task in document processing consists in verifying a structural property of a document tree. For XML, various schema languages have been proposed for this purpose. Besides the document-type declaration of a document, people have proposed, e.g., XML-Schema [W3C99f], DSD [KMS00], RelaxNG [Oas01]. In essence, all these formalisms specify regular forest languages. For a discussion on this see [MLM01]. Regular forest languages constitute a very expressive and theoretically robust formalism for specifying properties of forests [BW98][Mur95][BW98]. Validating against one of these XML schema languages is therefore a test for membership in a forest regular language. Forest regular languages can be specified using forest grammars.

Definition: A forest grammar over Σ is a tuple G = (X,r0,R), where X is a set of variables, r0, a regular expression over variables, is the start expression, and R is a finite set of rules of the form x → <a>r</a> with x ∈ X, a ∈ Σ and r a regular expression over variables.

Example: Consider the grammar G = ({xT,x1,xa,xb,xc},(x1|xa),R) over {a,b,c} with the following rules:

  1. xT<a>xT*</a>
  2. xT<b>xT*</b>
  3. xT<c>xT*</c>
  4. x1<a>xT*(x1|xa)xT*</a>
  5. xa<a>xbxc</a>
  6. xb<b>xT*</b>
  7. xc<c>xT*</c>
This grammar describes documents in which there is a path from the root to an a node whose children are a b and a c node, and all the nodes on this path are labeled a.

The first three rules make xT account for trees with arbitrary content. As specified by rule number 5, xa stands for the a element with a b and a c child. Rules 6 and 7 say that these children can have arbitrary content. Finally, rule 4 specifies that the a node specified by 5 can be at arbitrary depth in the input, and all its ancestors must be labeled a.

A grammar is the formal specification of the set of forests which can be derived from a variable string conforming to r0, by using rules from R in a number of steps. In each step, a variable symbol x for which a rule x<a>r</a> exists is replaced by <a>w</a>, where w is a variable string conforming to r.

In the examples throughout the rest of the paper we will use the grammar G from above and the following input document:

<a>
  <a>
    <b/>
    <c/>
  </a>
  <a>
    <b/>
  </a>
  <a>
    <b/>
    <c/>
  </a>
</a>
The tree-like representation of the input document is depicted in Figure 1

Figure 1
[Link to open this graphic in a separate page]

The tree representation of the input document

The possible derivations of the input under G are presented in Figure 2. Note that a derivation annotates each node of the input with a variable x from X.

Figure 2
[Link to open this graphic in a separate page]

Derivations of the input document

Queries

Unary queries

Patterns for unary queries, conceptually, consist of two parts. The contextual part constrains the surrounding context of the sub-trees of interest, whereas the structural part describes properties of the sub-trees themselves. In [Neu00], forest grammars have been used to specify both the contextual and the structural part of unary queries. The idea is that a variable x of a grammar is used to characterize the set of trees which can be generated from x. Thus, given a grammar G, we may specify the structural part of a pattern simply through a variable x. In order to capture the context, we use the grammar itself which (given a variable for the whole document) constrains the places where the variable x may occur in derivations.

Thus, a unary query is a pair Q = (G,Xtargets) consisting of a forest grammar G = (X,r0,R) and a set of target variables XtargetsX. A sub-tree t of an an input forest f is a match for Q if and only if there is a derivation of f with respect to G in which the root of t is labeled with a symbol from Xtargets.

Example: The query Q1 = (G,{xa}) locates the a nodes having only a ancestors and as children, a b followed by a c. In our input document, these are the leftmost and the rightmost children of the root element. One can see that both are indeed labeled with xa by the first and the second derivation of the input document, respectively.

In [NS98], a new class of automata, pushdown forest automata, has been introduced which allows the efficient implementation of such unary queries [Neu00][NS98] (see below).

k-ary Queries

In this paper we extend the notion of unary queries as presented in the last sub-section to k-ary queries for k ≥ 1. The idea is very simple. Instead of using a set of target variables, we now have a set of k-tuples of variables. Thus, a k-ary query is a tuple Q = (G,T) consisting of a forest grammar G = (X,r0,R) and T a set of k-tuples of variables (xtarget1,...,xtargetk)Xk.

Given a query Q and an input forest f, the matches of the query Q are the k-tuples of trees from f, for which a tuple of variables exists in T, such that there is one derivation which labels the root of each tree in the tuple of trees with the corresponding variable in the tuple of variables.

Example: The query Q2 = (G,{(xb,xc)} locates the pairs of b and c nodes having as father the same node a, and only a ancestors. The leftmost b and c nodes of our input form a match, as they are labeled with xb and xc, respectively, by the first derivation of the input (see Figure 2). Similarly, the rightmost b and c form a match, as it can be seen in the second derivation.

Accepting Regular Forest Languages - Forest automata

Various classes of automata have been introduced for the implementation of forest grammars and querying.

Bottom-up forest automata

Classically, the bottom-up finite-state forest automata have been used for accepting regular forest languages. The processing model of these automata is a bottom-up forest traversal as illustrated in Figure 3. A bottom-up forest automaton has two types of states, tree states and forest states, and two types of transitions, Side and Up. In order to assign a state to a tree t = <a>t1...tn</a>, a tree state pi is first assigned to each ti. Then an initial forest state q1 is chosen from the set of all possible initial forest states. By traversing the word p1...pn from left to right and performing a side-transition at each step, a forest state qn+1 is obtained from q1. An up-transition for qn+1 and a yields a tree state p for t.

Figure 3
[Link to open this graphic in a separate page]

The processing model of a bottom-up forest automaton

Given a forest grammar G, a non-deterministic bottom-up automaton can be constructed which accepts the forest language defined by the grammar. Note that the implementation of a forest automaton however, requires that the automaton is made deterministic. Indeed, for every non-deterministic bottom-up forest automaton, an equivalent deterministic automaton can be obtained by a subset construction. This construction, however, may lead to an exponential blowup of the number of states of the automaton.

Pushdown Forest Automata

A class of forest automata equally expressive as the bottom-up forest automata, but much more concise and efficient to implement are the pushdown forest automata as introduced by Neumann and Seidl [NS98]. These automata combine features of bottom-up and top-down automata. Besides Side and Up transitions as in the case of bottom-up automata, pushdown automata also have a transition relation Down. The processing model of a pushdown forest automaton is depicted in Figure 4. A pushdown automaton visits the nodes of a document in the same order in which they are encountered by an XML parser. When arriving at some node a with a forest state q, the Down transition is used to select a sensible initial forest state q1 for the traversal of the children of the node. q1 is chosen only if it can lead to a successful Up transition followed by a successful Side transition from the node a.

Figure 4
[Link to open this graphic in a separate page]

The processing model of a pushdown forest automaton

A left-to-right pushdown forest automaton A = (P,Q,I,F,Down,Up,Side) consists of a set of tree states P, a set of forest states Q, a set of start states IQ, a set of final states FQ, an up-relation UpQ×Σ×P, a side-relation SideQ×P×Q and a down-relation DownQ×Σ×Q.

Based on Down, Up and Side, the behavior of A is described by the relations δFQ×FΣ×Q and δTQ×TΣ×P as follows:

  • (q,ε,q) ∈ δF for all qQ
  • (q1,ft,q2) ∈ δF(q1,f,q) ∈ δF, (q,t,p) ∈ δT and (q,p,q2)Side for some qQ, pP
  • (q,a<f>,p) ∈ δT ⇔ (q,a,q1) ∈ Down, (q1,f,q2) ∈ δF and (q2,a,p) ∈ Up for some q1,q2 ∈ Q
The language accepted by an automaton A is LA = {f | (q1,f,q2) ∈ δF for some q1I, q2F}.

[Neu00] shows that deterministic and non-deterministic pushdown forest automata are equally expressive. Given a grammar G, with the notations from the previous sections, a deterministic pushdown forest automaton AG, accepting the set of forests specified by G can be constructed as follows: Let G = (X,r0,R) and {r1,...,rl} be the set of regular expressions different from r0 occurring on the right-hand sides of rules in R. For each j{0,..,l}, let Aj = (Yj,y0,j,Fj,δj) be the non-deterministic finite automaton (NFA) accepting the language of rj, as obtained by a Berry-Sethi construction [BS86]. Yj is the set of NFA states, y0,jYj the initial state, FjYj the set of final states, and δjYj×Σ×Yj the transition relation of Aj. Let YiYj = {} for ij. Let Y = Y0∪...∪Yl and δ=δ0∪...∪δl. Then AG = (2X,2Y,{q0},F,Down,Up,Side) where:

  • q0 = {y0,0}
  • F = {q | q∩F0{}}
  • Down(a,q) = {y0,j | yq, (y,x,y1) ∈ δ, x → <a>rj</a> for some x, y1}
  • Up(a,q) = {x | x → <a>rj</a> and q∩Fj{}}
  • Side(q,p) = {y1 | yq, xp and (y,x,y1) ∈ δ}.

Example: Figure 5 depicts the finite automata obtained by Berry-Sethi constructions for the regular expressions occurring in G. The start state of each automaton is marked with a thick point. The final states are depicted in grey. Note that the Berry-Sethi construction always yields an automaton in which for every state y, all the incoming transitions are labeled with the same symbol x.

Figure 5
[Link to open this graphic in a separate page]

The finite automata for the regular expressions in grammar G

The run of the AG automaton on our input document is depicted in Figure 6.

Figure 6
[Link to open this graphic in a separate page]

The run of AG on the input document

Querying XML Using Pushdown Forest Automata

The forest automata presented in the previous section can be used to test the membership of a forest to a language. Testing membership to a forest language essentially means testing the conformance to a schema or, in other words, verifying the structure of trees. For querying, however, we need to be able to locate sub-trees which not only have a specified structure, but also occur in a specified context. Neumann and Seidl have successfully applied pushdown forest automata for answering unary queries [NS98][Neu00][Fxg02].

In order to explain their technique of query evaluation we need to relate the run of a pushdown automaton over a forest with the set of possible derivations of the forest. By comparing the run of the pushdown automaton in Figure 6 and the derivations in Figure 2, the following observation can be made. If a derivation annotates a node with a variable symbol x, then x is contained in the tree state p for that node. The converse however does not hold. There are nodes whose tree state p contains some x, but such that no derivation exists in which the node is annotated with x. Consider for example the node b in the middle of the input document. Its tree state contains xb, but no derivation labels this node with xb. A tree state assigned by AG to a node is thus a superset of the set of symbols x with which any possible derivations labels the node. The reason why these two sets may not be equal is that by the time the pushdown automaton finishes visiting a node, it has not seen all the context of the node. More precisely, it has not seen the right context. In order to account for the right context, a second automaton must proceed from right to left.

Right-to-left Pushdown Automata

A deterministic right-to-left pushdown automaton BG runs on the input forest in which every node is annotated by the run of the first automaton AG with the tree state p synthesized for the node and the forest state q obtained by the side transition at that node. BG operates on the same sets of NFA states as AG. The processing model of BG is depicted in Figure 7. Note that we use the superscript to refer to the states of the left-to-right automaton AG.

Figure 7
[Link to open this graphic in a separate page]

The processing model of BG

In contrast to AG, when descending to the children of a node, BG selects final instead of initial NFA states, and follows the NFA transitions in reverse, while traversing the children from right to left. Furthermore, when executing its Down, Side and Up transitions, BG only takes into account NFA states that were also reached by AG. Up computes a tree state for a node from the forest state q1 computed for its children and, additionally as compared to AG, from the forest state q by which BG reaches the node.

BG is defined as (P,Q,F0,{},Down,Up,Side), where P, Q and F0 are as in the definition of AG and:

  • Down((a,p,q),q) = {y2 | yqq, (y1,x,y) ∈ δ, x → <a>rj</a> and y2Fj}
  • Up((a,p,q),(q,q1)) = {x | x → <a>rj</a>, yq1, y=y0,j, y1qq, (y2,x,y1) ∈ δ}.
  • Side((a,p,q),(q,p)) = {y1 | (y1,x,y) ∈ δ, y ∈ q∩q, xp}

The right-to-left automaton constructed above assigns to each node in the input a tree state p such that xp if and only if there is a derivation which labels the node with x. Thus, accordingly to the definition of a unary query (G,Xtargets), a node is a match if and only if ∃xp and xXtargets.

Example: Figure 8 illustrates the run of BG on the input annotated by AG. Note that the tree state assigned by BG to the middle node b contains now only xT, the only symbol by which this node is labeled in both of the possible derivations.

Figure 8
[Link to open this graphic in a separate page]

The run of BG on the input document

Binary Queries

Let us now consider a binary query (G,T). For simplicity, let us first assume that T equals the Cartesian product Xtargets1×Xtargets2. Given a binary match (n1,n2), we call n1 primary match, and n2 secondary match. The tree states of BG can be used to find all primary and all secondary matches. Primary matches are the nodes annotated with a tree state p such that ∃xp, xXtargets1. Similarly, secondary matches are the nodes annotated with some p such that ∃xp and xXtargets2. However, finding all the primary and all the secondary matches is not enough for binary query answering. In order to report a binary match we must be able to tell whether a primary match and a secondary match are defined with respect to the same derivation.

Example: Consider the XML input and the query Q2 as above. By looking at the tree states of BG in Figure 8, one can not tell which primary match b node and which secondary match c node belong together to a binary match. More information is needed in the annotation made by the pushdown automata.

We enhance the second pushdown forest automaton as follows:

  1. To each element y of a forest state q, we attach two lists L1 and L2 of primary and secondary matches, respectively. Whenever we reach a node in a forest state q and (y,L1,L2)q, then L1 contains all primary matches seen so far which may occur in a derivation which annotates the present node with x, where x is the label of the incoming NFA transitions to the y state2. Analogously for L2.
  2. Also, to each element x of a tree state p, we attach two lists L1 and L2 of primary and secondary matches, respectively. Whenever we reach a node in a tree state p and (x,L1,L2)p, then L1 contains all primary matches seen below (or at) the node, which may occur in a derivation which annotates the present node with x. Analogously for L2.
The new pushdown automaton BG2 over Σ×P×Q is defined as as follows: Bg2 = (P,Q,I,{},Down,Up,Side), where, with the same notations as in the definition of the ordinary right-to-left automaton Bg:
  • Initially, all lists are empty:
    I = {(y0,[],[]) | y0F0}
  • When going down, we initialize the lists for matches below as empty lists:
    Down((a,p,q),q) = {(y2,[],[]) | (y,L1,L2)q, y ∈ q, (y1,x,y) ∈ δ, x → <a>rj</a>, and y2 ∈ Fj}
  • When going up, we propagate the found matches possibly adding a match of the current node n
    Upn((a,p,q),(q,q1)) = {(x,L1,L2) |
    1. x → <a>rj</a>, (y,l1,l2)q1, y = y0,j, (y1,l1′,l2′)q, y1q, (y2,x,y1) ∈ δ
    2. if xXtargets1 then L1 = l1+[n] else L1 = l1; if xXtargets2 then L2 = l2+[n] else L2 = l2 }
  • At side transitions, we combine the lists from the subtree below with the matches from the already visited part to the right:
    Side((a,p,q),(q,p)) = {(y1,L1,L2) |
    1. (y1,x,y) ∈ δ, yq, (y,l1,l2)q, (x,l1′,l2′)p
    2. L1 = l1+l1; L2 = l2+l2 }
Given the construction above, matches are detected at side transitions of the automaton. For every NFA transition (y1,x,y) considered at a side transition, every primary match associated with x (from list l1) is paired with every match associated with y (from list l2). Every such pair is a binary match. We proceed similarly for secondaries at x (from list l2) and primaries at y (from list l1). Binary matches of the form (n,n) are detected at Up transitions for xXtargets1Xtargets2.

Algorithm: Answering binary queries

  • Input: binary query Q = (G,Xtargets1×Xtargets2), input forest f
  • Output: set of binary matches M defined by the binary query in f

function match(Q,f) {
  f1 := annotation_of_f_by_the_left-to-right_pushdown_automaton(Q,f)
  (q,M) := deltaF(f1,I,{})
  return M
}

function deltaF(f,q,M) {
  case f of
     emptyForest => return (q,M)

   | forest f1 followed by tree <(a,pl,ql)>f2</(a,pl,ql)> =>
      /*pl and ql are the annotations of the left-to-right automaton*/
       (p,M) := deltaT (<(a,pl,ql)>f2</(a,pl,ql)>,q,M)
       foreach (y,l1,l2) in q with y in ql {
         foreach (x,l1',l2') in p with (y1,x,y) in delta {
           foreach m1 in l1 {
             foreach m2 in l2'{
               M := M ∪ (m1,m2)
             }
           }
           foreach m1 in l1' {
             foreach m2 in l2{
               M := M ∪ (m1,m2)
             }
           }
         }
       }
       q' := Side((a,pl,ql),(q,p))
       return deltaF(f1,q',M)
}

function deltaT(<(a,pl,ql)>f</(a,pl,ql)>,q,M) {
  qn := Down((a,pl,ql),(a,q))
  (q1,M) := deltaF(f,qn,M)
  p := Up((a,pl,ql),(q,q1))
  if ∃ (x,l1,l2) ∈ p and x ∈ Xtargets1∩Xtargets2 then
    M := M ∪ (<a>f</a>,<a>f</a>)
  return (p,M)
}
     

Example: Figure 9 shows the run of BG2← on our input. The nodes have been numbered in order to be able to refer them in the list of matches. Elements (x,l1,l2) and (y,l1,l2) are represented as and , respectively. (x,[],[]) and (y,[],[]) are represented as x and y, respectively.

Figure 9
[Link to open this graphic in a separate page]

The run of BG2← on the input document

The proof of the correctness of the algorithm is sketched in the Appendix section.

In order to assess the complexity of our algorithm given some input document t, let |t| be the size of the input, s the maximum of the number of primary and the number of secondary matches in t, and r the number of binary matches. The annotation of the input by the left-to-right automaton has the complexity O(|t|). The second automaton visits each node of the document once. For each node, it recomputes for each component of the state the two lists L1 and L2. This amounts to time O(s). Overall this contributes time O(s·|t|). Furthermore, the second pass has to dump the binary matches. As each match pair is considered only a constant number of times, the overall cost for dumping amounts to O(r). Altogether this gives the time complexity O(r+s·|t|).

Note that r|t|2 and s|t|. The theoretical worst therefore is O(|t|2). In practice however, it is likely that in most cases the number of primary, secondary and binary matches is unimportant as compared to the input size. Then the complexity of the algorithm is rather linear than quadratic.

Extensions

General Binary Queries

In the algorithm from the last section, we assumed that the relation T of the binary query Q = (G,T) is a Cartesian product. Let us briefly sketch how the algorithm can be generalized also to arbitrary binary relations T.

In this general case we simply decompose T into the union of binary Cartesian products T1,...,Tm. Now we may successively run our algorithm for each of the Ti and collect all the matches. Or, more efficiently, we just perform one left-to-right pass and then execute the m right-to-left passes corresponding to the Ti strictly in parallel. Note that this extension increases the run time only by a constant factor.

Extension to Larger k-s

The idea of our algorithm for binary queries can be generalized to k-ary queries. For that, we modify the second pass by maintaining not just lists for primary and secondary matches. Instead, we maintain separate lists for each non-empty subset A{1,...,k} of at most k-1 elements. The list for A then contains all tuples of subtrees which form a partial match corresponding to the indices in A.

As the complexity of the resulting algorithm grows exponentially in k, we have refrained from implementing this generalization.

One-pass Matching

A large class of queries can be answered even more efficiently than by the algorithm presented so far. For many queries all matches can be found already during the first left-to-right pass. In particular, this holds for queries for which the right-context of the matches does not have to be verified. This can be checked prior to the query evaluation by analyzing the query [Neu00]. Right-context-ignoring queries, both unary and binary, can therefore be answered in one single pass. Besides saving the second pass, this allows to search a document without constructing the document tree in memory. Moreover, the querying can be done along with the parsing of the document via an event-based parser.

Practical Aspects

We have implemented the algorithm presented above and used it to extend the fxgrep querying tool ( http://www.informatik.uni-trier.de/~aberlea/Fxgrep/) to allow for XML binary queries.

The formalism of forest grammars as presented in this paper ignores many XML specific aspects. In order to obtain a practical querying tool, various practical aspects must be considered.

Text Nodes

XML elements can contain besides other elements also character data. In order to deal with this, each segment of character data is seen as a single node. To allow text pattern-matching, the forest automata are enhanced with an external predicate for each text pattern. A special Down relation selects the text patterns to be checked for a text node, as those which might contribute to a succeeding side transition. The tree state for the text nodes are determined from the set of fulfilled predicates using a special Up relation.

Avoiding State Explosion

The pushdown automata are efficiently implemented by computing their transitions only as they are needed. Transitions which are not required for the traversal of the input are not computed. This avoids the computation of possibly exponentially large transition tables. The number of transitions that are actually computed is at most linear in the size of the input document. In practice only few transitions need to be computed even for large XML documents.

The Pattern Language

The queries specified using forest grammars can be very expressive, though their power is not easily exploitable by non-expert users. Therefore, our querying tool fxgrep allows to specify queries also by using a more intuitive pattern language.

The pattern language of fxgrep resembles in its syntax to XPath, but allows more precise specification of paths as well as of the context of nodes. Structural conditions for a node may be expressed by using regular expressions over the children of the node. Structural conditions are given between brackets following the node to which they refer. For example, the pattern a[b*c[d*]] is fulfilled by an a element that has one or more b children followed by a c child that itself has only d children. Contextual conditions for a node may be specified as structural conditions for nodes lying on the path from the root to that node. For example //a[#c]/b identifies b nodes which have only one right sibling c and whose father is an a node. The # here is a placeholder for the tree in which the matches are to be found. Furthermore, paths can be also specified as regular expressions. For example, (a/)+b identifies a b node, where each ancestor (at least one) is an a node. The unary example query Q1 from above can be specified as (a/)+a[#c]/b.

In our enhancement of fxgrep, we also extended this base pattern language to allow the specification of binary queries. In order to make this extension as simple and intuitive as possible, we just provide one extra symbol % which may be placed anywhere inside the pattern to indicate the secondary match position. Thus, the binary query Q2 can be expressed as (a/)+a[#%c]/b.

Consider the unary query //book[_ (author/"escu$") _]/title (_ is a wild-card matching an arbitrary forest). The query locates all the book titles whose author's names end in "escu". The binary query to simultaneously report the titles as above and their authors is: //book[_ (%author/"escu$") _]/title.

Final Remarks

Queries on trees can be also specified as logical formulas. In [NS00] a unary query is specified as a formula in a fragment of the monadic second order logic. It is shown that an algorithm exists that evaluates the query in time linear in the input size. [Sch00] defines an extension of the first order logic in which queries of arbitrary arity can be expressed. It is shown that unary queries can be also theoretically evaluated in linear time. Binary queries however need evaluation time cubic in the input size.

In this paper we showed how regular forest grammars can be used to specify XML queries of arbitrary arities. We presented how unary queries can be efficiently implemented using pushdown forest automata. The main contribution of the paper is an efficient algorithm for evaluating binary queries. Furthermore, we indicated how binary queries can increase expressivity and efficiency of XML transformation and querying languages.

Appendix

Proof of correctness of the algorithm: (sketch)

Let us consider an NFA transition (y1,x,y) tracked by the right-to-left pushdown automaton at one of its Side transitions at a node n. Let L1 be the list of primary matches associated with x and L2 be the list of secondaries associated with y.

According to the invariants maintained by the list constructions (see the section “Binary Queries”), for every n1L1 there exists a derivation D1 such that

  • (1): n1 is a primary match with respect to D1
  • (2): D1 labels n with x
Similarly, for every n2L2 there exists a derivation D2 such that
  • (3): n2 is a secondary match with respect to D2
  • (4): D2 labels n with x

From the definition of derivations, (2) and (4) we can construct a derivation D3 which is obtained from D2 by replacing the derivation steps for n from D2 with those from D1.

Because of (1), n1 is a primary match with respect to D3. Because of (3), n2 is a secondary match with respect to D3. Thus (n1,n2) is a binary match with respect to D3.

Notes

1.

The document element actually constitutes a single tree which, however, may be preceded or followed by processing instructions. Therefore, in general, an XML document is a forest rather than a tree.

2.

The Berry-Sethi construction guarantees that all the incoming transitions to a state y are labeled with the same x.


Bibliography

[BS86] Gerard Berry and Ravi Sethi. From Regular Expressions to Deterministic Automata. Theoretical Computer Science, 48, pages 117-126, 1986

[BW98] Anne-Brüggemann-Klein and Derick Wood. Regular Tree Languages over Non-Ranked Alphabets. Unpublished draft, April 1998. Available online at http://www.oasis-open.org/cover/regTreeLanguages-ps.gz.

[BW98] Anne-Brüggemann-Klein, Makoto Murata and Derick Wood. Regular Tree and Regular Hedge Languages over Non-Ranked Alphabets. Unfinished Research Report HKUST-2001-05, HKUST Theoretical Computer Science Center, April 2001. Available online at http://www.cs.ust.hk/tcsc/RR/2001-05.ps.gz.

[FSW99] M.Fernandez, J.Siméon and P.Wadler. XML Query Languages: Experiences and exemplars. Available online at http://www-db.research.bell-labs.com/user/simeon/xquery.html.

[Fxg02] Andreas Neumann, Alexandru Berlea. fxgrep 1.4.5. Source Code, 2002. Available online at http://www.informatik.uni-trier.de/~aberlea/Fxgrep.

[Fxp01] Andreas Neumann. fxp 1.4.4. Source Code, 2001. Available online at http://www.informatik.uni-trier.de/~aberlea/Fxp

[Fxt01] Alexandru Berlea. fxt 2.0. Online Documentation, 2001. Available online at http://www.informatik.uni-trier.de/~aberlea/Fxt.

[Fxt02] Alexandru Berlea. fxt 2.0. Online Documentation, 2002. Available online at http://www.informatik.uni-trier.de/~aberlea/Fxt.

[KMS00] N.Klarlund, A.Moller and M.I.Schwatzbach. DSD: A Schema Language for XML. In ACM SIGSOFT Workshop on Formal Methods in Software Practice, Portland, USA, August 2000.

[MLM01] Makoto Murata, Dongwon Lee and Murali Mani. Taxonomy of XML Schema Languages Using Formal Language Theory, August 2001. In Extreme Markup Languages 2001.

[Mur95] Makoto Murata. Forest Regular Languages and Tree Regular Languages. Unpublished manuscript, 1995.

[Neu00] Andreas Neumann. Parsing and Querying XML Documents in SML. Ph.D. Thesis, University of Trier, Trier, 2000. Available online at http://www.informatik.uni-trier.de/~neumann/Papers/thesis.ps.gz.

[NS00] Frank Neven and Thomas Schwentick. Expressive and efficient pattern languages for tree-structured data. In Proc. 19th Symposium on Principles of Database Systems (PODS 2000), pages 145-156, 2000.

[NS98] Andreas Neumann and Helmut Seidl. Locating Matches of Tree Patterns in Forests In V.Arvind and R.Ramamujan ,editors, Foundations of Software Technology and Theoretical Computer Science, (18th FST&TCS), volume 1530 of Lecture Notes in Computer Science, pages 134-145. Springer, Heidelberg, 1998

[Oas01] James Clark and Makoto Murata, editors. RelaxNG Specification, December 2001. Available online at http://www.oasis-open.org/committees/relax-ng/.

[Sch00] Thomas Schwentick. On diving in trees. In Proc. 25th Symposium on Mathematical Foundations of Computer Science (MFCS 2000), pages 660-669, 2000.

[W3C99a] James Clark, editor. XSL Transformations (XSLT) Version 1.0. W3C Recommendation, World Wide Web Consortium, November 1999. Available online at http://www.w3.org/TR/xslt.

[W3C99b] Michael Kay, editor. XSL Transformations (XSLT) Version 2.0. W3C Working Draft, World Wide Web Consortium, December 2001. Available online at http://www.w3.org/TR/xslt20/.

[W3C99c] James Clark and Steve DeRose, editors. XML Path Language (XPath) Version 1.0. W3C Recommendation, World Wide Web Consortium, November 1999. Available online at http://www.w3.org/TR/xpath.

[W3C99d] A.Berglund, S.Boag, D.Chamberlin, M.F.Fernandez, M.Kay, J.Robie, J.Siméon, editors. XML Path Language (XPath) Version 1.0. W3C Recommendation, World Wide Web Consortium, November 1999. Available online at http://www.w3.org/TR/xpath20/.

[W3C99e] S.Boag, D.Chamberlin, M.F.Fernandez, D.Florescu, J.Robie, J.Siméon, M.Stefanescu, editors. XQuery 1.0: An XML Query Language W3C Working Draft, World Wide Web Consortium, December 2001. Available online at http://www.w3.org/TR/xquery/.

[W3C99f] H.S.Thompson, D.Beech, M.Maloney and N.Mendelsohn, editors XML Schema Part 1: Structures. W3C Recommendation, May 2001. Available online at http://www.w3.org/TR/xmlschema-1/.



Binary Queries

Alexandru Berlea [Research Assistant, University of Trier, Department of Computer Science]
aberlea@psi.uni-trier.de
Helmut Seidl [Professor, University of Trier, Department of Computer Science]
aberlea@psi.uni-trier.de