Containment of XPath expressions: an inference and rewriting based approach

Jean-Yves Vion-Dury
Jean-Yves.Vion-Dury@xrce.xerox.com
Nabil Layaïda
Nabil.Layaida@inria.fr

Abstract

XPath is simple query language for XML documents which allows navigating in XML trees and returning a set of matching nodes. It is used in XML Schema to define keys and in XLink and XPointer to reference portions of documents. XPath is a fundamental part of the XSLT and XQuery languages as it allows definition of matching expressions for patterns and provides node selectors to filter elements in the transformations.

We propose to study the containment and equivalence of XPath expressions using an inference system combined with a rewriting system. The inference system allows assertion and proof of properties on a class of expressions. In order to keep the proof system compact, we propose a re-writing architecture which allows transformation of remaining expressions in a disjunctive normal form compatible with this class. In contrast with model based approaches, the inference and rewriting systems are applied to the XPath language directly. We believe this will help understanding the underlying issues of deciding containment on the language itself.

Keywords: XPath

Jean-Yves Vion-Dury

Jean-Yves holds an engineering degree from the Conservatoire National des Arts et Métiers (1993) and a Ph.D. in Computer Science from the Université Joseph Fourier (1999). As an invited researcher, he recently joined the WAM project at INRIA, in charge of the scientific development of Omega, a new transformation language strongly based on XPath.

His previous research at Xerox Research Center Europe (XRCE, Grenoble) focused on the theoretical and practical design of Circus-DTE, a specialized programming language created around the paradigm of generalized structural pattern matching and typed transformations, as applied to XML document processing; as a result of his research, three international patents have been submitted by Xerox Innovation Group and are still in progress.

Nabil Layaïda

Dr. Nabil Layaïda received a Ph.D. degree in computer science from the University of Grenoble in 1997. Since 1998, he has been working as a research officer at INRIA [Institut National de Recherche en Informatique et en Automatique] . He manages presentation services for the multimedia systems group within the WAM project. His research interests are in the fields of interactive structured and multimedia document processing, adaptive documents, multimedia applications for mobile devices, structure transformations, temporal scheduling and synchronization, and their applications on the Web. He is also a member of the SYMM working group and co-author of the SMIL 1.0 and SMIL 2.0 recommendations (Synchronized Multimedia Integration Language) of the World Wide Web Consortium.

Containment of XPath expressions: an inference and rewriting based approach

Jean-Yves Vion-Dury [INRIA Rhône-Alpes, Xerox Research Centre Europe http://www.xrce.xerox.com]
Nabil Layaïda [INRIA Rhône-Alpes]

Extreme Markup Languages 2003® (Montréal, Québec)

Copyright © 2003 Jean-Yves Vion-Dury & Nabil Layaïda. Reproduced with permission.

XPath containment and equivalence

XPath is a simple yet powerful query language for XML documents, which allows navigating in XML trees and the return of a set of matching nodes. It is used in XML Schema to define keys and in XLink and XPointer to reference portions of documents. More notably, it represents a fundamental part of the XSLT and XQuery languages, as it allows definition of matching expressions for patterns and provides node selectors to filter elements in the transformations.

For a given XPath expression p and an XML input tree t, p(t) denotes the set of nodes from t returned by the evaluation of p. More precisely, an expression p is evaluated relatively to a particular node of t, called the context node (noted x). Thus, the selection of a node set by p in a tree t with respect to a context node x of t is written px(t). The containment relation between two XPath expressions p1 and p2 (denoted p1 p2) holds true when, for any XML tree t and any context node x, the set of nodes selected by p1x(t) is included in the set selected by p2x(t).

[Link to open this graphic in a separate page]

It is worth noting that the equivalence relation between the two expressions (denoted p1 p2) can be expressed using two containment relations. (i.e., p1 p2 and p2 p1 ), and given a suitable algorithm, equivalence is reducible in polynomial time to containment [Miklau/Suciu].

The containment and equivalence relations for XPath expressions are essential in each of the previously mentioned areas. In addition, several fundamental problems reduce directly to containment or equivalence, such as expression optimization and keys inference. For example, if an expression p1 represents a key for a given Schema, then all p expressions such that ( p p2 ) are also keys.

Containment is also a key component for the static analysis of XSLT Transformations. It can reveal two aspects of the expressions valuable for transformation designers and query programmers: consistency and performance. In practice, complex XPath expressions turns out to be difficult to interpret; therefore, errors can be easily introduced. The consistency of an expression p can be verified by checking if it is contained in the empty path1 ( p ). Performance issues can also be easily understood with the following example: given a path p p1 p2 , if we have p1 p2 , then evaluating p1 is redundant since all nodes have already been selected in the evaluation of p2 . Therefore, re-formulating p by the equivalent expression p2 represents an optimization of the evaluation of p . Query optimization aims at reducing unnecessary operations, such as tree navigation, in the evaluation and can be more complex to determine than in the previous example.

Related work

In [Miklau/Suciu], the authors focus on a complexity analysis of the containment problem, based on a rudimentary fragment of XPath. The considered subset is based on four constructs, namely: the child axis (/), descendant axis (//), wild-cards ({*}), and qualifiers ([ ]). XPath expressions are then modeled as tree patterns. The containment problem is then (re)formulated in terms of tree pattern homomorphism search. This search is used to determine the complexity of the containment for a range of languages, which are subsets of the four constructs.

In [Deutsch/Tannen], the containment is studied using DTD information translated into SXIC [Simple XPath Integrity Constraints] . The paper shows that for particular SXIC constraints, namely unbounded SXICs, the containment problem is undecidable. For the general problem, it indicates that the decidability of the containment for full-fledged XPath remains an open question. Neven and Schwentick [Neven/Schwentick] established recently that containment under DTD constraints is in fact undecidable. If this is certainly a significant outcome, the proof relies on a somehow complex reduction to the PCP [Post Correspondence Problem] [Hopcroft et al.], where the basic tree linearization function yield plays a key role; unfortunately, it brings in our opinion unsufficient help in understanding the deep nature of the containment problem.

In [Olteanu et al.], the work examines the symmetry of most of XPath axes in order to optimize XML database queries. An equivalence of location paths involving “reverse axes” is first established. This equivalence relation is used to re-write XPath expressions into reverse axes free equivalents which reduces the cost of the evaluation process. The paper shows that most of the “duplicate” axes can be removed and that for some others an equivalent formulation in the XQuery language is possible.

The work found in these studies can be split in two broad categories:

  1. Model-based approaches. It relies on a modeling of the expressions as tree patterns or tree automatons. Containment is then handled in terms of low level operations on these structures.
  2. Syntax-oriented approaches. In contrast, the latter approaches applies syntactic (re-writing) operations on expressions or containment assertions directly.
Syntax-oriented approaches have several advantages. First, rewritten expressions are XPath expressions and can be interpreted according to the same common semantics. This provides more readability of the different steps used to state the containment.

Second, since containment is established on XPath expressions directly, it does not introduce different properties, such as those introduced by the models. For instance, tree patterns do not capture the ordering of nodes in trees despite the fact that this feature is important for XPath expressions.

Finally, reasoning on automatons for containment is achieved by applying automaton transformations (complement, intersection, determination). This approach suffers from the lack of reversible methods which allows translation of the results of these operations back from automatons to path expressions. This precludes, for instance, application such as optimizing an XSLT stylesheet by static analysis and relevant replacement of xpath expressions.

This paper is also an attempt to adress most of the issues related to the syntax-oriented approach: mastering the combinatorial complexity and preserving the scalability of the formal tools.

Goal of the paper

The goal of this paper is twofold. First, it defines a formal method which aims to analyze and reason about XPath expressions thanks to (i) an inference system, combined with (ii) a rewriting architecture.

We believe that the formalization we propose is applicable to the entire XPath language and represents a generic tool for the study of the related problems. Additionally, this mathematical tool is relevant to the study of other aspects, such as optimization and path simplification, since it helps exploring, asserting, and proving general properties on the XPath language. In our knowledge, this is the first attempt in defining such a logical architecture dedicated to the XPath language.

Building a tractable inference system is not easy when dealing with such a powerful language, especially considering the combinatorial way to express equivalent paths. The re-writing system’s goal is to produce a normal form for XPath expressions that eases establishing containment while maintaining reasonably small the inference system.

The second goal of the paper is to analyze XPath in order to prove containment assertions on a large sample of the language. In the light of this analysis, we look forward to better understanding the language constructs sources of high computational complexity and decidability problems, as reported in the previously mentioned studies. In the longer run, we are paving the way to the study of important properties, such as soundness and completeness, using the containment approach we propose.

The rest of the paper is organized as follows. In Section “XPath syntax and semantics”, we present a formal syntax and semantics for XPath, which we think captures faithfully a significant fragment of XPath 2.0 [W3C], including all axes and some of the newly proposed operators such as the for and the if constructs. Section “Inferring containment” presents a proof system for containment and equivalence assertions. This inference system is then associated with a transformation architecture, presented in Section “Transformation architecture”. After concluding, we draw some perspectives on this ongoing work, including indications on the mathematical characterization of the proposed approach.

XPath syntax and semantics

We propose a restricted but faithful version of XPath 2.0 [W3C], a subset and/or a combination of the ones proposed by Wadler, [Wadler01], [Wadler02], and Oleanu et al. [Olteanu et al.], plus some extensions that ease formal analysis, such as the explicit root node , the void path (proposed and defined in [Olteanu et al.]). It is one of our future goals to extend this subset toward even more realistic extents, notably including important functions, such as position() in qualifiers, as in [Wadler02].

Our underlying goal is to remain close to the engineering reality that many developers have to deal with, and, thus, try to give the best applicative field to the present work.

Syntax

The syntax is defined in two stages: (i) the core language, and (ii) the syntactic sugars, which just rewrite into expressions of the core.

[Link to open this graphic in a separate page]

Note that according to the XPath 2.0 specification, and as proposed in [Olteanu et al.], but not in accordance with the proposal of [Wadler01], the axes a::N cannot contain other paths (e.g., a::(p1 p2)). A node test N is either an unqualified name n, the wildcard symbol *, or a test function among text(), node(), comment(), processing-instruction(), and element(); when N is under the ns:n form, ns is considered as a namespace prefix in accordance to the specification and processed accordingly. More precisely, it is rewritten into a namespace attribute (see rule r5f, in Figure 2).

The important extension we propose with respect to qualifiers is the node set inclusion constraint p p2 , which brings extra expressive power and interesting possibilities for containment inference.

[Link to open this graphic in a separate page]

A (partial) reflexive ordering of N elements is defined in Figure 1.

Figure 1
[Link to open this graphic in a separate page]

Partial ordering of node tests

The a symbol denotes axes, ranging over the whole set defined in the W3C specification.

[Link to open this graphic in a separate page]

We define the syntactic sugars (see Figure 2 for their translation into core expressions)

[Link to open this graphic in a separate page]

and do likewise for qualifiers (see Figure 3).

[Link to open this graphic in a separate page]
Figure 2
[Link to open this graphic in a separate page]

Syntactic sugars (axis) N5

Figure 3
[Link to open this graphic in a separate page]

Syntactic sugars (qualifiers) N6

We consider that the / and operators are fully associative, and that the precedence ordering is (from the tightest to the loosest):

[Link to open this graphic in a separate page]

so that for instance

[Link to open this graphic in a separate page]

is syntactically understood as

[Link to open this graphic in a separate page]

The precedence of boolean operators is the standard one (not < and < or ).

Denotational semantics

We reused the denotational definition of [Olteanu et al.], also inspired from [Wadler01], and extended it with an explicit document root and an execution context that uniquely associates variable names to tree nodes.

Document model

XPath semantics rely on a document-as-a-tree model. A linear (“flat”) document is seen as a well-formed tree after a successful parsing. A tree is modeled as a set of “typed” nodes ( element, text, comment, processing-instruction, attribute, namespace, and root; the type of a node can be checked by a corresponding unary predicate).

A well-formed tree contains only one root node, which has no parent, no attribute, and no namespace, but may have any other kind of nodes as children. Moreover, only elements can have children. Nodes, identified by x and xi in the sequel, are fully connected in order to form a tree2.

This structural property relies on the {\em parent/child} relation that characterizes the tree, and also its transitive closure .

Moreover, a strict ordering, the document ordering , is defined on every node x of a tree t, and reflects the order of tag occurrence in the linear document. The ordering relation x1 x2 is true if the opening tag of x1 appears strictly before the opening tag of x2 in the document, false otherwise (thus, ancestors of a node x are considered as being before x; that is, nodes are totally ordered).

Semantics of selection

The selection is defined relatively to a context node x and the execution context . If = {..., v=x,...}, then (v)=x. The function is inductively defined, i.e., uses itself for its own definition. The induction is even double, since it relies on the function (defining qualifiers, presented in the next paragraph) which itself uses . This is, however, quite common in denotational semantics, and just reflects faithfully the syntactic structure of the XPath language.

[Link to open this graphic in a separate page]

The notation fa responds to the functions defined in the table below:

[Link to open this graphic in a separate page]

As noticed in the XPath 2.0 specification, the axes self, ancestor, descendant, following, and preceding constitute a (non-overlapping) partition of the tree. The function performs a node test, according to the table below:

[Link to open this graphic in a separate page]

The for expression is the only one that introduces a new (variable, node value) pair into the context.

[Link to open this graphic in a separate page]

Semantics of qualifiers

The originality here is in the inclusion test (last line), directly expressed through a set-theoretic inclusion of selection sets.

[Link to open this graphic in a separate page]

Inferring containment

Judgments, context, and logical rules

The inference system we present in this paper, noted , allows asserting and proving containment properties by using judgments, noted and rules, such as:

[Link to open this graphic in a separate page]

where Ai and B are judgments, and r is the rule name. Rules like these mean that if all judgments Ai are true, then B is true. This well-known notation allows to build proofs as a tree, such as:

[Link to open this graphic in a separate page]

provided the user is able to unify expressions with the logical constituent found in rules. A judgments, such as , means that the path p1 is contained by the path p2, given the context . This latter contains couples of variable names and values, noted v:p, where variables names v are uniquely defined. The context is also a mapping, and (v) will return p, provided v is defined in , abbreviated as . Sometimes, the context may be omitted from the proof tree, when it is invariant or understood as not significant. The assertion “the path p1 is contained in the path p2” means intuitively that the set of nodes selected by p1 is included3 in the set of nodes selected by p2.

Proof trees can also be developed by using an equivalence relation on XPath expressions or judgments. If for instance, we define a path , it is legal by construction4 to derive a proof step by simple substitution5, such as:

[Link to open this graphic in a separate page]

Figures 4 and 5 propose useful equivalences, which are provable using the formal semantics proposed in Section “Denotational semantics” (see [Olteanu et al.] for similar demonstrations).

Figure 4
[Link to open this graphic in a separate page]

XPath equivalences

Figure 5
[Link to open this graphic in a separate page]

Qualifier equivalences

We note the inference system , defined in the coming subsection, are used together with the equivalence relation.

Basic containments

We introduce first two basic rules expressing that, in whatever context , it is always true that (i) the void path is contained in whatever other path p, and (ii) any path p is contained in itself.

[Link to open this graphic in a separate page]

The union operator can be present either at the left hand or at the right hand of a containment assertion. The latter case raises two possible choices for further containments:

[Link to open this graphic in a separate page]

while when located at the left hand, it requires both left sub-terms to be contained by the right term, as expressed by:

[Link to open this graphic in a separate page]

Now we can prove in the empty context thanks to the following proof tree:

[Link to open this graphic in a separate page]

Such proof relies on a very basic mechanism, unification, which allows identifying the term to the term in the bottom of the rule c2a, thanks to a substitution.

[Link to open this graphic in a separate page]

Applying this substitution to the top of the rule allows the further development of the tree. Finer comparisons require handling the / composition operator, quite fundamental in XPath. This is achieved by this very generic rule that conducts symmetrical comparisons

[Link to open this graphic in a separate page]

and using [d2], we can prove that :

[Link to open this graphic in a separate page]

The context is useful to handle paths defined with variables. Using the rules below:

[Link to open this graphic in a separate page]

one can prove that in a context =v:(c | a)

[Link to open this graphic in a separate page]

Note that t1 is not a basic rule but the following theorem (the context, invariant, is omitted):

[Link to open this graphic in a separate page]

We need more rules in order to compare steps. Let us consider the following rule, where a,a' denote axis names and N,N' node tests:

[Link to open this graphic in a separate page]

Considering that axes are partially ordered through the reflexive relation shown in Figure 6, and that node tests are similarly ordered by the relation shown in Figure 1, we can prove that child::b is contained in descendant::*, since

[Link to open this graphic in a separate page]
Figure 6
[Link to open this graphic in a separate page]

Partial ordering of axes

Qualifiers

Comparing qualified paths such as p[q] requires a new judgment, the implication. This is captured by the following rule:

[Link to open this graphic in a separate page]

Consider, for instance, , which should be provable, since both expressions conditionally select a elements, but with more restrictive condition for the former one. In our approach, it is captured by , which is equivalent by definition to . The rules for reasoning on this kind of relations are developed in a coming subsection.

We focus now on applying d3 in combination with equivalences shown in Figure 5 in order to illustrate how to assert various containment properties involving qualifiers. For instance, [d3] allows proving , if we admit for now the sub-proof related to the implication:

[Link to open this graphic in a separate page]

It is similarly possible to compare a qualified xpath with an unqualified expression, e.g.,

[Link to open this graphic in a separate page]

The for construct

The for construct must also be compared, and we consider first the left-hand case.

[Link to open this graphic in a separate page]

Note that the content of the for construct is evaluated in a new enriched context, where the variable $v has been introduced with the right value p1. Here, the variable name must not be already defined in the context.

The right-hand case is pretty similar, excepted that an additional condition must be verified, in order to preserve soundness. If p1 is empty in the expression , the variable $v won’t take any value and, thus, the inner expression p2 will never be evaluated. In conformance with the formal semantics, the whole construct will select nothing and, thus, can only contain the path. For instance, it’s easy to prove the a/b is equivalent to , but this is not the case for b and . Indeed, this latter path is only equivalent to b[a].

[Link to open this graphic in a separate page]

Name conflicts occur when comparing two different for constructs which define the same variable, since the inference system only use one context. Let us consider the following example:

[Link to open this graphic in a separate page]

In such a case, it’s possible to use an α-substitution, i.e., a renaming operation propagated throughout the whole term, very much like in the lambda calculus framework. The substitution of $v by $w in path p is noted p[$w / $v], and we do have p p[$w / $v] as long as $w is not used in p. This latter condition is required prior to using any α-substitution. Thus, the proof of comes in four parts, only three of them being developed here (the forth one, the C judgment (an implication) will be shown later).

[Link to open this graphic in a separate page]

Qualifier implication

Now we present the second piece of our inference system, based on an implication judgment , which means if q1 is true in context , then q2 is true in the same context. We propose first the most general set of rules (the context , invariant, is omitted for clarity).

[Link to open this graphic in a separate page]

This first rule permits many trivial (but often useful) inferences, such as and also less obvious ones, such as .

All other ei rules defined hereafter are directly logically derived from the standard boolean implication.

[Link to open this graphic in a separate page]

The following subset deals with conjunction

[Link to open this graphic in a separate page]

while this one addresses disjunction.

[Link to open this graphic in a separate page]

Inferring emptiness

XPath Emptiness, e.g., , appears quite often in qualifiers. It turns out that such relations represent a particular containment case we have to deal with. An important qualifier equivalence, defined below, characterizes this particularity.

[Link to open this graphic in a separate page]

Used in conjunction with f1 or f2 below, it opens many possibilities for building proofs

[Link to open this graphic in a separate page]

It can also be used in order to prove the previous C assertion.

[Link to open this graphic in a separate page]

Rules ei, f2, f3 still cannot address two particular cases. This first one is captured by the following rule:

[Link to open this graphic in a separate page]

Intuitively, descendant::N for instance can be understood as:

[Link to open this graphic in a separate page]

On such a path, one could apply the rule f2, e1, if we were allowed to handle infinite terms. The other similar case to be considered is:

[Link to open this graphic in a separate page]

Other “multi-steps” axes, such as following, have no single step equivalent; thus, they are not to be considered.

Multi-step axis

At this point, our inference system doesn’t capture an important containment case, such as:

[Link to open this graphic in a separate page]

The reader may verify that no rule can be successfully applied in order to complete the tree:

[Link to open this graphic in a separate page]

We propose now a quite generic rule, since it addresses all “multi-step” axes:

[Link to open this graphic in a separate page]

With g1, it’s possible to prove:

[Link to open this graphic in a separate page]

with:

[Link to open this graphic in a separate page]

Pushing context in expressions

Now comes much more difficult cases. Our inference system cannot establish two important cases captured by the examples below:

[Link to open this graphic in a separate page]

The problem expressed by the E judgment is that the right context b/c is missing in order to compare the a node from the a/b/c expression with the a node from the second one. Using d2, the proof construction attempt is stopped after applying d3:

[Link to open this graphic in a separate page]

However, since c is present as a second-level child of a in , it should be deducible since the right-hand qualifier condition is actually satisfied.

A similar situation holds for the F judgment; the problem comes from the left context a/b of the c element comparison. It should be possible to deduce from this context that c has indeed a a ancestor.

We propose to tackle this problem by adding to a variant of the d2 rule:

[Link to open this graphic in a separate page]

Intuitively, the idea is to push left and right contexts into qualifiers, in order to propagate the information toward the inference process. The notation corresponds to the application of a small inductive transformation function that computes a reversed XPath expression. It has the following property for all paths p:

[Link to open this graphic in a separate page]

It is formally (and explicitly) defined by the function in Figure 7, and it is based on the fact that each a axis has its own reverse axis6 noted , such as shown below (see also [Olteanu et al.] for another work using this symmetry property).

Figure 7
[Link to open this graphic in a separate page]

The reverse XPath computation

Also, the reader can take a closer look at the reversed for expression, the most complex one. Notice that it involves a fresh variable (a new one, not present in the reversed sub-term).

[Link to open this graphic in a separate page]

Capturing contradictions

Another category of difficult containment cases is illustrated by:

[Link to open this graphic in a separate page]

where the qualifiers, being contradictory, force the left-hand path to always evaluate to the empty set. We propose now two rules g2a and g2b that address these cases:

[Link to open this graphic in a separate page]

One can prove the example above:

[Link to open this graphic in a separate page]

Note that we used a theorem t2, which is easy to prove using ei rules:

[Link to open this graphic in a separate page]

Comparing relative and absolute paths

So far this paper has not addressed mixed containments, e.g., , which should be obviously established in this particular case. Note that for the XPath language, a relative path is not just a path that does not start with the root. Indeed, is equivalent to and is, thus, an absolute path. Thus, a relative path must not contain any node, excepted in qualifiers, and we note such paths. The following rule, as a first attempt, seems appropriate to address mixed containment:

[Link to open this graphic in a separate page]

A rule like g3 indeed allows proving

[Link to open this graphic in a separate page]

However, it cannot be used for proving more subtle cases, such as:

[Link to open this graphic in a separate page]

So we propose a more powerful variant that covers these cases:

[Link to open this graphic in a separate page]

Synthesis

Considering increasingly complex cases, we have tried to guide the reader through the difficulties of XPath containment, and also through the difficulty of building a powerful yet simple inference system. Still, at this point, the system is far from being complete. The (formal) soundness is being checked, and partially achieved. The next stage will be to achieve a (more difficult) completeness proof, or to understand where the remaining difficulties lie, because we actually don’t know if the whole system, comprising the rewriting extension proposed in the next section, is complete.

Before reaching this stage, some simple cases are simply not covered. We would like to show, for instance, that:

[Link to open this graphic in a separate page]

or that:

[Link to open this graphic in a separate page]

We believe that such cases should not be addressed through extending the inference system. Indeed, this last must be kept as small as possible and, thus, mathematically tractable. Another possibility could be to enrich the equivalence relation. But in order to be useful and clear, the equivalence must stay quite intuitive and should not involve too complex and too much articulated transformations. Such operations must be clearly decomposed and even justified in order to be used for inferences.

We propose to address remaining containment cases through transforming the XPath expression into a normal form, more suited to handle comparisons. Such transformation can be achieved thanks to a rewriting process, often mathematically more tractable and easier to understand than algorithms. We describe now the transformation architecture which allows us to go one step forward.

Transformation architecture

Overview

We propose a three-stage transformation architecture that preserves the semantics of XPath expressions. The first one is a rewriting that brings any XPath term under a disjunctive normal form by applying, as much as possible, rules from a set , just derived from the equivalence . This stage strongly reduces the syntactic complexity, so that less rules have to be defined in subsequent stages. When applying, as much as possible, rules of , this gives a rewriting closure (or relation) noted .

The second transformation is a mixing of three rewritings:

  1. performs axis rewriting in order to simplify and eliminate steps as much as possible.
  2. performs qualifier simplification and elimination.
  3. , as defined above, normalizes expressions reintroduced by the right-hand sides of and rules.

The combination of the three rewriting stages is noted , where is the standard functional composition. This scheme expresses that we apply first all rules from as much as possible, then the same for , then the same for ; the whole process itself is iterated as much as possible, before reaching a terminal form .

The whole picture is captured by the following diagram:

[Link to open this graphic in a separate page]

Disjunctive normal form

The rewriting relation , reduces any term p to a normal form , that is semantically equivalent, but syntactically simplified. The rule set is the union of several subsets:

[Link to open this graphic in a separate page]

Rules of (Figure 8) are used in order to bring xpath expressions under a disjunctive form, i.e., where the disjunction symbol is always on top of the term, while qualifiers are brought under conjunctive form through rules of and (see Figures 9 and 10).

Figure 8
[Link to open this graphic in a separate page]

Distributing | and / (N1)

Figure 9
[Link to open this graphic in a separate page]

Nesting qualifiers with N2

Moreover, many expressions equivalent to can be detected at this early stage and propagated to the top level thanks to subset (see Figure 11).

Figure 11
[Link to open this graphic in a separate page]

Detecting void paths with N4

Subsets and (see Figures 2 and 3) are dedicated to syntactic sugar expansion.

The rewriting can be applied, for instance, to :

[Link to open this graphic in a separate page]

A subset of is dedicated to the expansion of syntactic sugars so that, for instance:

[Link to open this graphic in a separate page]

Moreover, qualifiers are inserted so that each step is finally associated with at least one “true” qualifier:

[Link to open this graphic in a separate page]

Rewriting axis and qualifiers: and

The idea here is to apply rules that suppress steps as much as possible, while preserving the semantics equivalence. This operation is to be applied both to the principal axes and to axes embedded in qualifiers. For instance, a/.. can be transformed into .[a] , which opens now the possibility to use in order to prove that .[a].

This is captured by the rules:

[Link to open this graphic in a separate page]

As an application example, we have:

[Link to open this graphic in a separate page]

with:

[Link to open this graphic in a separate page]

In this expression, α represents the concatenation of all rule names (in the application order) that have been applied to derive the normal form. This derivation can, thus, be checked in order to verify this assertion.

Simplifying principal axis

In the example above, we used the rule:

[Link to open this graphic in a separate page]

Actually, many rules are required in order to achieve complete axis simplification. A methodical exploration requires considering all possible pairs of axis, ignoring the redundant ones, i.e., those which can be expressed through other axes.

To this purpose, the first rule set is dedicated to axis name suppression, as shown by Figure 12, and applies to descendant-or-self, ancestor-or-self, following, and preceding axes.

Figure 12
[Link to open this graphic in a separate page]

Suppression of axes (A1)

The second rule set (see Figure 13) derives empty expressions induced by axis composition, such as attribute::N/child::N, which always evaluates to an empty node set since an attribute has no children.

Figure 13
[Link to open this graphic in a separate page]

Detection of empty paths (A2)

Then come nine rule sets, through , dedicated to each of the nine remaining axes.

Figure 14
[Link to open this graphic in a separate page]

Axis simplification (A3)

Rules are proposed only when it is possible to gain structural simplifications. For instance:

[Link to open this graphic in a separate page]

clearly cannot be simplified in this sense, whereas:

[Link to open this graphic in a separate page]

brings a structural simplification, by pushing information into qualifiers. Some rules require using the powerful for construct, which captures current context and enables useful rewriting, as already shown in [Olteanu et al.], section 5. Let us consider the following (wrong) rewriting:

[Link to open this graphic in a separate page]

which is not sound (more precisely, it doesn’t preserve the exact semantics, since the new expression selects more nodes that the old one). Indeed the exact rewriting has to use the for construct:

[Link to open this graphic in a separate page]

The rules of must be applied with a left-to-right implicit strategy in order to be sound, because this is the natural orientation of the / operation (which is of course not commutative).

Qualifier simplification

The purpose of the rewriting is to push information into qualifiers in order to simplify axes. Now we propose to simplify qualifiers through another rewriting, noted .

Figure 16
[Link to open this graphic in a separate page]

Qualifier simplification (Q)

Let us consider the following case as an illustration of this process:

[Link to open this graphic in a separate page]

Note that the complexity of the principal axis is unchanged, whereas the complexity of the inner path (embedded in the qualifier) is decreased.

Conclusion and perspectives

In this paper, we have proposed a new approach for the study of the containment problem based on both a logical and rewriting approach. We first introduced an inference system which allows asserting and proving properties on XPath expressions with the help of an equivalence relation . In order to keep this inference system reasonably small, we proposed a multi-stage re-writing architecture. The first stage produces a disjunctive normal form, on which it is simpler to define the other stages, roughly, simplification rules that bring terms p under a form more suitable for comparisons, . Notably, it should be possible to infer containment on such terms without the help of the equivalence relation. This may help tackling problems introduced, for instance, by commutative laws such as , which potentially involve infinite proof tree developments.

We intentionally tried to address the XPath language itself rather than a model or a too restrictive subset, even if it is a challenging task. This allows the study of properties on realistic path expressions and also to check the feasibility of rewriting based applications.

We are currently investigating mathematical proofs for properties, such as soundness, completeness of the inference system (which doesn’t use equivalence but works on reduced terms ), and also termination and soundness of the the rewriting architecture. We formulated the soundness as .

Figure 17: Proposition 1. Soundness
[Link to open this graphic in a separate page]

The expression means that and are in conformance with respect to the definitions they contain. We are exploring a proof using a double structural induction based on the following induction hypothesis ( is a suitable structural complexity measure):

[Link to open this graphic in a separate page]

The completeness theorem is certainly harder, and we actually don’t know yet if it holds; however, we plan to use a similar approach in order to explore the proof.

Figure 18: Proposition 2. Completeness
[Link to open this graphic in a separate page]

Recent results on containment undecidability give hints that completeness cannot be reached. On the other hand, it is hard to evaluate the exact consequence of using explicit node set constraints in qualifiers, which constitutes an original approach. Anyway, we expect to learn more when addressing the completeness proof: to identify more precisely (i) the combination of XPath features responsible for higher computational complexity and (ii) the sources of incompleteness or undecidability.

Even though high computational complexity means in practice low performances, this probably may not be a problem for most of the containment applications such as static analysis, because most of the XPath expressions used in the real world seem to be simple and small.

In parallel to this formal work, we started the development of algorithms embedded in an experimental demonstrator, written in the python language [van Rossum]. This prototype implements the inference together with the re-writing system presented in this paper. We plan to release the demonstrator as soon as the formal properties are established and their limits clearly understood.

Last but not least, we believe the rewriting architecture could be completed with an XPath optimization stage. This one can use rules dedicated to redundancies elimination (e.g., ) or qualifier decomposition (e.g., , computationally more efficient). Moreover, the normalization stage could use a real execution context, including schema information, in order to exploit additional equivalences. For instance, schema constraints may permit to rewrite a/b into a/*, while this does not hold in the general case.

Beside pure optimization, containment under schema validity assumptions can also be addressed using unchanged, provided that the normalization stage is extended in order to exploit schema information.

Notes

1.

The empty path always selects an empty node set. It was first introduced in [Olteanu et al.] — to our knowledge — in order to ease formal treatments.

2.

Every node x of the tree is reachable from the root: mathematically, root(x) x.

3.

Following the set-theoretic definition of inclusion

4.

Mathematically, this equivalence is necessarily fully congruent with respect to a sound containment relation.

5.

In a proof tree, [] are usually not shown for the sake of compactness.

6.

Note also that due to the intrinsic asymmetry of attribute and namespace axes [W3C], we do not have .


Bibliography

[Deutsch/Tannen] Deutsch, Alin, and Val Tannen, Containment of Regular Path Expressions under Integrity Constraints, Knowledge Representation Meets Databases, 2001.

[Hopcroft et al.] Hopcroft, John E., Rajeev Motwani, and Jeffrey D. Ullman, Introduction to Automata Theory, Languages, and Computation, 2nd ed., Addison Wesley, 2000.

[Miklau/Suciu] Miklau, Gerome, and Dan Suciu, Containment and Equivalence for an XPath Fragment (Extended Abstract), Symposium on Principles of Databases Systems, 2002.

[Neven/Schwentick] Neven, Frank, and Thomas Schwentick, XPath containment in the presence of disjunction, DTDs, and variables, International Conference on Database Theory, 2003.

[Olteanu et al.] Olteanu, Dan, Holger Meuss, Tim Furche, and Francois Bry, Symmetry in XPath, technical report, Computer Science Institute, Munich, Germany, October 2001.

[van Rossum] van Rossum, Guido, The Python language home page, http://www.python.org.

[W3C] W3C, XML Path Language (XPath) 2.0, W3C working draft, 15 November 2002.

[Wadler01] Wadler, Phil, A formal semantics of patterns in XSLT, In Proceedings of Markup Technologies 1999, 1999.

[Wadler02] Wadler, Phil, Two semantics for XPath, 1999, http://www.cs.bell-labs.com/who/wadler/topics/xml.html.



Containment of XPath expressions: an inference and rewriting based approach

Jean-Yves Vion-Dury [INRIA Rhône-Alpes, Xerox Research Centre Europe http://www.xrce.xerox.com]
Jean-Yves.Vion-Dury@xrce.xerox.com
Nabil Layaïda [INRIA Rhône-Alpes]
Nabil.Layaida@inria.fr