The success of XML has come with the explosion of the number of vocabularies defined to represent similar information. In many domains, ranging from life sciences to ecommerce, many mappings have been defined between these vocabularies as a strategy for information integration. However, as the vocabularies evolve, maintaining these complex mappings and all artifacts that rely on them remains one of the major issues in information integration. A critical aspect of schema (or vocabulary) evolution is the ability to perform an impact analysis across all artifacts that directly or indirectly depend on the changing vocabularies. When an evolving vocabulary is not directly used, but rather mapped to another one through a complex mapping expressed in languages such as XSLT or XQuery, an impact analysis can be performed at a fine level of granularity only if the dependencies between the source and the target vocabulary are also known with the same level of granularity.
In this paper, we present an architecture for the static analysis of both XSLT and XQuery and a conservative and sound constraintbased analysis of these two languages that effectively uncovers input/output dependency information necessary for a fine grain impact analysis. For each element or attribute declaration d in the output schema the analysis determines all the attribute or element declarations in the input schemas whose instances may influence the evaluation of an instance of d. We show results of early experimental evaluation illustrating the accuracy and scalability of the approach.
Keywords: XSLT; XQuery; Mapping
XML Source  PDF (for print)  Author Package  Typeset PDF 
The successful adoption of XML [XML] as a universal standard exchange format has come with the explosion of the number of vocabularies defined to represent similar information. The naive assumption that, in each domain, a single vocabulary would emerge as the single standard did not hold. In many application domains, ranging from life sciences to eCommerce, building and maintaining mappings between various vocabularies remain key issues to information integration across a variety of sources.
Declarative transformation languages such as XSLT [XSLT2] and XQuery [XQuery1] were designed to address the anticipated need for integration and interoperability between various related vocabularies. They are Turingcomplete languages allowing the specification of how instances of source (or input) schemas are transformed into corresponding instances of target (or output) schemas. However, in scientific domains, such as life sciences, where large and complex data structures, reflecting the complexity of scientific models, need to be mapped between different representations, a growing trend has been to assist users with tools (e.g. Clio [Clio] [ClioMapping]) where mappings can be built semiautomatically and as simply as drawing lines between related elements. These tools typically build intermediary models in which elementary mappings are conceptually expressed by providing the following information:
Besides the obvious usability advantage of the Clio approach over directly specifying the mapping in XSLT or XQuery, the intermediary representation used by Clio greatly simplifies the mapping maintenance problem even when input or output schemas evolve. It directly provides at the finest level of granularity (at the element and attribute declaration level) input and output dependency information needed for impact analysis, a major aspect of the maintenance of various artifacts in case of schema evolution.
However, in the Clio approach, the expressiveness of the intermediary language is limited to the kinds of mappings that can be expressed through the tool. Turingcomplete languages like XSLT are still required to express more complex mappings, but input/output dependency information at a fine granularity level needed for impact analysis is not directly available.
In this paper, we present a conservative and sound constraintbased static analysis of both XSLT and XQuery that effectively uncovers input/output dependency information necessary for a fine grain impact analysis. For each element or attribute declaration d in the output schema the analysis determines all the attribute or element declarations in the input schemas whose instances may influence the evaluation of an instance of d.
In XML and database communities, various static analyses of XQuery have been proposed mainly for optimization purpose [XMLProjection], but also in context of XML security [XMLAccessControl]. A. Marian and J. Simeon [XMLProjection] have proposed a static analysis that infers for a given XQuery all the paths in the input document that may influence its evaluation. It could straightforwardly be extended to extract in an XQuery all the input/output dependency information. Unfortunately, most XQuery static analysis and optimization efforts have not focused on queries involving recursive user defined functions. Indeed, to the best of our knowledge, previous work has simply ignored recursive functions or has provided gross conservative approximations because they were essentially dealing with "simple" expressions that simply extract some information from a document or a repository of documents. In contrast to these simple expressions, most XSLTs are intrinsically recursive (they have at least the recursive default template for element and document nodes) and, should XQuery be used as a fullfledged programming language for web services [XQueryWS], a lot of recursive XQueries would be written.
In the programming language community, many techniques [DataFlow] [ProgramAnalysis] [Compilers] [DataFlowFrameworks] have been applied for the static analysis of recursive declarative and imperative languages. Context sensitive approaches that distinguish between different call sites of the same function have always provided better results (in terms of accuracy of the analysis) over contextinsensitive approaches. But, the success of context sensitive approaches comes with a high price tag on performance which, in general, is exponential in the size of the context. This performance issue has been a major limitation of the scalability of context sensitive approaches to real world programs.
The main contributions of this paper are threefold:
The remainder of this paper is organized as follows. In section “A simple example ”, we introduce a simple example used for illustrative purpose throughout the paper. Section “Architecture” presents the architecture of the input/output dependency analyzer. The constraintbased static analysis of Core XQuery is formally described in section “ Constraintbased static analysis of Core XQuery ”. In Section “ XSLT specific optimization techniques”, we show how the analysis could be optimized for XSLT stylesheets. Finally, section “Experimental evaluation” presents the result of early experimental evaluation of our approach.
Before describing the architecture of our solution to the input/output dependency extraction problem, we present here a simple example to illustrate our approach and give a first sense of the challenges we have to address.
The XSLT stylesheet presented on Figure3 transforms documents conforming to the pseudoschema shown on Figure 1 into documents valid against the pseudoschema illustrated on Figure 2. The transformation consists of (1) renaming 'bookStore' element to 'bib' element, (2) for each 'author' element in the input, the output document contains a corresponding 'author' element whose content is the concatenation of the 'fname' and 'lname' elements separated by a space, and (3) 'price' and 'rating' elements in the input are not used in the output.
element bookStore { element company ? { String} element book + { element title { String } element author + { element title? { String } element fname { String } element lname { String } } element pub { String } element price { Decimal } element rating { Integer } } }
element bib { element company ? { String} element book + { element title? { String } element author + { String } element pub { String } } }
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0"> <xsl:template match="/"> <xsl:applytemplates select="*"/> </xsl:template> <xsl:template match="bookStore"> <xsl:element name="bib"> <xsl:applytemplates select="*"/> </xsl:element> </xsl:template> <xsl:template match="author"> <xsl:element name="author"> <xsl:valueof select="fname"/> <xsl:text> </xsl:text> <xsl:valueof select="lname"/> </xsl:element> </xsl:template> <! 'price' and 'rating' information > <! is not repoted in the output > <xsl:template match="price"/> <xsl:template match="rating"/> <! "Catchall" template > <! This template causes any element, not otherwise > <! processed to be copied to the output > <xsl:template match="*"> <xsl:element name="{name()}"> <xsl:applytemplates select="@*"/> <xsl:applytemplates select="*text()"/> </xsl:element> </xsl:template> </xsl:stylesheet>
Figure 4 shows the input/output dependencies uncovered by our dependency extractor. The structure of the result element consists of (1) a 'schemas' element which identifies all the input schemas ('source' subelements) and output schemas ('target' subelements), and (2) the list of dependencies that were discovered. Each dependency is described by a 'dependency' element. A dependency element specifies the list of input components ('source' elements) whose instances may influence the value of instances of a given output schema component ('target' element). Note that absolute simplified XPath expressions (only child or attribute axes for axis, only qname test or wildcard for node test) are used to uniquely identify each schema component (element or attribute declaration). For example, '/bookStore/book/title' represents the 'title' element declaration that constrains the first child element of any 'book' whereas '/bookStore/book/author/title' refers to the 'title' element declaration against which the first child element of any 'author' is valid.
<result> <schemas> <source rootName="bookStore" schemaLocation="input.xsd"/> <target rootName="bib" schemaLocation="output.xsd"/> </schemas> <dependencies> <dependency> <source value="/bookStore/book/title"/> <target value="/bib/book/title"/> </dependency> <dependency> <source value="/bookStore/book/author/lname"/> <source value="/bookStore/book/author/fname"/> <target value="/bib/book/author"/> </dependency> <dependency> <source value="/bookStore/book/pub"/> <target value="/bib/book/pub"/> </dependency> <dependency> <source>/bookStore/book</source> <target>/bib/book</target> </dependency> <dependency> <source>/bookStore/company</source> <target>/bib/company</target> </dependency> <dependency> <source>/bookStore/book</source> <target>/bib</target> </dependency> </dependencies> </result>
As mentioned in the introduction, the analysis described in this paper is sound: for a given element or attribute declaration d in the output schema, the analysis cannot miss an element or attribute declaration on which d depends. For example, an analysis that would return the set {element 'fname'} as the set of input components that may affect the evaluation of element 'author' in the output schema would not be sound because element 'lname', whose instances obviously influence the value of instances of element author in the output schema, is missing from the set. The soundness of our analysis is formally established in section “ Correctness”.
The analysis is also conservative: the set of dependent input components is not guaranteed to be minimal. In other words, the computed set of dependent input components may contain some elements which could safely be removed (i.e. removed without making the analysis unsound. Such elements are called falsepositives). For example, an analysis that simply states that each output schema component depends on all input schema components is obviously sound and inexpensive to compute, but too inaccurate or conservative. The more falsepositives are present in the result, the less accurate the analysis is.
A major challenge to any sound and conservative static analysis consists of finding the right balance between accuracy and performance. Throughout the remainder of the paper we focus on approaches that allow accurate results while minimizing the performance impact.
We now present the main components of the architecture (see Figure 5) for an input/output dependency extractor. The dependency extractor takes as input an XSLT or an XQuery, one or more input schemas and zero and more output schemas^{1}. It outputs an XML document which describes all the discovered dependencies. Figure 4 shows an example of its output.
One key challenge posed to the design and implementation of an input/output dependency extractor for both XSLT and XQuery is the combined syntactic and semantic complexity of both languages. In both languages, there are several ways to express the same transformation and several syntactic sugars, which make the formal analysis of these languages rather cumbersome. As shown in Figure 5, this issue is addressed by first transforming an XSLT stylesheet or an XQuery query into a Core XQuery expression. Core XQuery has the advantage of being a well defined subset of XQuery, without the syntactic sugar of XQuery and XSLT or the implicit rules of XSLT (e.g. builtin templates, template instantiation rules, conflict resolution rules, etc). Furthermore, its formal semantics is clearly defined in the XQuery Formal Semantics Specification [XQueryFS], and the compilation to Core XQuery provides a common framework for analyzing both XSLT and XQuery.
In Figure 5, the normalizer component implements the transformation from XQuery to Core XQuery formally defined in the XQuery Formal Semantics Specification [XQueryFS], whereas the XSLT2XQuery component applies the approach introduced by L. Villard et al. [XSLT2XQuery] to compile XSLT to XQuery. As explained in section “ XSLT specific optimization techniques”, in order to significantly reduce the cost of performing an accurate contextsensitive analysis of XSLT, we need to keep in the generated XQuery some information that was directly available in the original XSLT. Special annotations in the generated XQuery are used for that purpose. The XQuery generated from our XSLT example is shown on Figure 6.
(: due to space constraint namespace and variable declarations are not shown :) declare function t2q:template0($t2q:dot as node(),$t2q:pos as xs:integer, $t2q:last as xs:integer,$t2q:mode as xs:string) as item()* (: XSLT Annotation: [match = / ] :) { (let $t2q:sequence := (($t2q:dot/child::*)) return let $t2q:last := count($t2q:sequence) return for $t2q:dot at $t2q:pos in $t2q:sequence return t2q:applyTemplates($t2q:dot,$t2q:pos,$t2q:last,'#default')) } ; declare function t2q:template2($t2q:dot as node(),$t2q:pos as xs:integer, $t2q:last as xs:integer,$t2q:mode as xs:string) as item()* (: XSLT Annotation: [match = author ] :) {(element author {text {stringjoin( for $t2q:d in subsequence(data(((($t2q:dot/child::fname)))),1.0,1.0) return ($t2q:d cast as xs:string),' ')},text {' '},text {stringjoin( for $t2q:d in subsequence(data(((($t2q:dot/child::lname)))),1.0,1.0) return ($t2q:d cast as xs:string),' ')}}) } ; (: due to number of pages contraint XQuery functions for bookStore, :) (: price, rating templates are not shown :) declare function t2q:template5($t2q:dot as node(),$t2q:pos as xs:integer, $t2q:last as xs:integer,$t2q:mode as xs:string) as item()* (: XSLT Annotation: [match = * ] :){ (element {stringjoin( for $t2q:d in subsequence(data((name(($t2q:dot treat as node())))),1.0,1.0) return ($t2q:d cast as xs:string),' ')} { let $t2q:sequence := (($t2q:dot/attribute::*)) return let $t2q:last := count($t2q:sequence) return for $t2q:dot at $t2q:pos in $t2q:sequence return t2q:applyTemplates($t2q:dot,$t2q:pos,$t2q:last,'#default'), let $t2q:sequence := ((($t2q:dot/child::*)($t2q:dot/child::text()))) return let $t2q:last := count($t2q:sequence) return for $t2q:dot at $t2q:pos in $t2q:sequence return t2q:applyTemplates($t2q:dot,$t2q:pos,$t2q:last,'#default')}) } ; declare function t2q:applyTemplates($t2q:dot as node(),$t2q:pos as xs:integer ,$t2q:last as xs:integer,$t2q:mode as xs:string) as item()* { (let $t2q:dot5 := (($t2q:dot/self::*)) return let $t2q:dot4 := (($t2q:dot/self::bookStore)) return let $t2q:dot3 := (($t2q:dot/self::author)) return let $t2q:dot2 := (($t2q:dot/self::price)) return let $t2q:dot1 := (($t2q:dot/self::rating)) return let $t2q:dot0 := (($t2q:dot/self::documentnode())) return ( if ((($t2q:mode='#default') and exists($t2q:dot0))) then t2q:template0($t2q:dot0,$t2q:pos,$t2q:last,$t2q:mode) else (if ((($t2q:mode='#default') and exists($t2q:dot1))) then t2q:template4($t2q:dot1,$t2q:pos,$t2q:last,$t2q:mode) else ( if ((($t2q:mode='#default') and exists($t2q:dot2))) then t2q:template3($t2q:dot2,$t2q:pos,$t2q:last,$t2q:mode) else ( if ((($t2q:mode='#default') and exists($t2q:dot3))) then t2q:template2($t2q:dot3,$t2q:pos,$t2q:last,$t2q:mode) else ( if ((($t2q:mode='#default') and exists($t2q:dot4))) then t2q:template1($t2q:dot4,$t2q:pos,$t2q:last,$t2q:mode) else ( if ((($t2q:mode='#default') and exists($t2q:dot5))) then t2q:template5($t2q:dot5,$t2q:pos,$t2q:last,$t2q:mode) else t2q:builtInApplyTemplates($t2q:dot,$t2q:pos,$t2q:last,$t2q:mode)))))))) } ; declare function t2q:builtInApplyTemplates($t2q:dot as node(),$t2q:pos as xs:integer, $t2q:last as xs:integer,$t2q:mode as xs:string) as item()* { (( if (exists(((($t2q:dot/self::text()))(($t2q:dot/(.)[(. instance of attribute())]))))) then stringjoin( for $t2q:d in subsequence(data($t2q:dot),1.0,1.0) return ($t2q:d cast as xs:string),' ') else ( if (exists(((($t2q:dot/self::processinginstruction()))(($t2q:dot/self::comment()))))) then () else let $t2q:sequence := (($t2q:dot/child::node())) return let $t2q:last := count($t2q:sequence) return for $t2q:dot at $t2q:pos in $t2q:sequence return t2q:applyTemplates($t2q:dot,$t2q:pos,$t2q:last,$t2q:mode)))) } ; declare function t2q:identity($t2q:node as node()) as node() {$t2q:node} ; declare function t2q:convertoperand($t2q:actual as item(),$t2q:expected as item()) as item() { (: body not shown due to space constraint:) } ; declare function t2q:isNumeric($t2q:value as item()) as xs:boolean {(: body not shown due to space constraint:)} ; t2q:applyTemplates($t2q:initialcontextnode,1,1,'#default')
The first goal of our static analysis is to associate to each core XQuery subexpression a result abstract value that represents as precisely as we can afford ^{2} the result of any concrete evaluation of the subexpression. Informally, for our analysis, an abstract value is a set consisting of element and attribute declarations (e.g. element 'bib'), simple types (e.g. xs:string) and constant atomic values (e.g. 5). For example, consider the following expression
(element book { if (exist($t2q:dot/pub)) return element pub { ' A publisher ' } else ('5', '6') }, '2', '3')
A sound static analysis could associate the result abstract value "{element 'book' from the output schema, xs:string('2'), xs:string('3')}+" to the expression. This means that the evaluation of the expression is guaranteed to yield a sequence of one or more items such that each item is either valid against the element declaration 'book' or is the string '2' or '3'. Thus, the result abstract value tells us that the expression can build an element (element 'book') of the target schema.
For dependency extraction purpose, we also require that the analysis computes, for each Core XQuery subexpression e, a used abstract value that represents all the attributes or elements in the input that may influence the value of output nodes. In general, for a given expression, this used abstract value is computed by (1) aggregating the result and used abstract values its subexpressions, and (2) removing all element and attribute declarations that come from the output schemas, simple types and constant atomic values ^{3}. In the previous simple expression, its only inner subexpression whose used abstract value contains schema components from the input schema is 'exist($t2q:dot/pub))'. Intuitively, the used abstract value of the subexpression 'exist($t2q:dot/pub))' depends on the result and used abstract values of the argument of the function call. In order to find the result and used abstract values of $t2q:dot/pub, we need to keep track of all result abstract values representing all sequence items that can be bound to the variable $t2q:dot as well as the used abstract value associated with them. Such a variable can be bound to values passed as parameters of function calls; therefore the analysis must take into account the complex and recursive nature of the function call graph. Assuming that the result abstract value of the variable $t2q:dot is {element 'book' from the input schema}* and its used abstract value is {}0, the used abstract value of the expression is {element 'pub' from the input schema}. Now, we know that the content of element 'book' in the output schema may depend on element 'pub' in the input schema.
For the simple example represented in section “Goals of static analysis of Core XQuery”, we have shown how an analysis that associates result and used abstract values to each subexpression and keeps track of these abstract values for each variable allows us to find input/output dependencies. In our architecture, the Dependency Extractor component is responsible for reading the result and used abstract values provided as annotation to build the dependency result file.
An analysis satisfying the requirements of section “Goals of static analysis of Core XQuery” was specified and implemented using a constraintbased approach. The first step of this approach consists of generating some local constraints that must be satisfied by result and used abstract values of each subexpression and variable. Then, in the second step, the generated recursive system of constraints is solved, and the solution yields the result and used abstract values for each subexpression. In our architecture, the first step is performed by the constraint builder component, while the second is the responsibility of the constraint solver component. A major advantage of this approach is the separation of the specification of the analysis, constraint generation, from its implementation, constraint solving. On the one hand, the specification of the analysis can be done without worrying too much about how the constraint system will be solved. Since constraints are local, they are easily understandable and structural proofs can easily be established. New constraints can easily be added to make the analysis more precise. On the other hand, standard optimized constraint solvers can be reused.
The formal specification of the constraints generated by the constraint builder component is presented in section “ Acceptability relation and constraints ”
We now formally specify the static analyses of Core XQuery needed for input/output dependency extraction.
Although the approach described in this paper has been implemented for the whole Core XQuery language, in the remainder of the paper, for simplicity of the presentation, we only consider the subset of Core XQuery described at Figure 7.
Expr ::= Literal  ()  Expr, Expr  Var  for Var in Expr return Expr  let Var := Expr return Expr  Axis NodeTest  if ( Expr ) then Expr else Expr  Expr ( '='  '!='  '>'  '<'  'or'  'and' ) Expr  element { Expr } { Expr }  QName (Expr?)  declare QName(x?) { Expr } Var ::= $QName
This subset of Core XQuery is sufficient to illustrate the main challenges of the static analysis. It consists of literal values (e.g., strings, decimals), the empty sequence ("()"), sequence construction, variables, for and let expressions, XPath steps, conditionals, comparisons, logical operators, element constructors, function calls and function declarations. Also, for simplicity of the presentation, we do not consider namespaces.
In section “Goals of static analysis of Core XQuery”, we have already pointed out the need to statically obtain an abstract representation of all input nodes that may affect the evaluation of a given expression. In this section, we formally define the meaning of "attribute and element nodes that influence the evaluation of an expression".
First, when the evaluation of an expression starts all input documents that will be accessed are not always known since documents can be opened dynamically using doc function. Since we are mainly interested in XQueries and XSLTs that perform transformations between known schemas, in the remainder of this paper, we assume that all the schemas of all input documents are statically known.
Before the evaluation of an expression e, the set of known input nodes is defined as the set of nodes bound, in the initial environment, to free variables of e and all the nodes in the same XML tree as any of those input nodes. We introduce the following judgment to track input element and attribute nodes that are accessed (or used) during the evaluation:
DynEnv  expr => v using US
This judgment means that, in the dynamic environment DynEnv, the expression expr evaluates to the value v and, during the evaluation of expr, the subset US of input attribute and element nodes where accessed (or used).
Figure 8 illustrates how the evaluation inference rules defined in the XQuery Formal Semantics Specification [XQueryFS] are modified to keep track of all input element and attribute nodes accessed during the evaluation of an expression.
 (EvalLiteral) DynEnv  Literal => Literal using {}  (EvalEmptySeq) DynEnv  () => () using {} DynEnv  expr1 => Value1 using US1 DynEnv  expr2 => Value2 using US2  (EvalComma) DynEnv  expr1, expr2 => (Value1, Value2) using US1 U US2 DynEnv.varValue(var) = value  (EvalVar) DynEnv  var => value using {} DynEnv  expr1 => () using US1  (EvalFor1) DynEnv  for var in expr1 return expr2 => () using US1 DynEnv  expr1 => Item1 ,..., Itemn using US' DynEnv + varValue( var => Item1)  expr2 => Value1 using US1 ... DynEnv + varValue(var => Itemn)  expr2 => Valuen using USn  (EvalFor2) DynEnv  for var in expr1 return expr2 => Value1 ,..., Valuen using US1 U ... U USn U US' DynEnv  expr1 => Value1 using US' DynEnv + varValue( var => Value1)  expr2 => Value2 using US''  (EvalLet) DynEnv  let var := expr1 return expr2 => Value2 using US'' U US' DynEnv.varValue($fs:dot) = Value1 Value1 matches node DynEnv  axis Axis of Value1 => Value2 Axis principal PrincipalNodeKind DynEnv  test NodeTest with PrincipalNodeKind of Value2 => Value3 Value3 does not contain text nodes  (AxisNodeTest1) DynEnv  Axis NodeTest => fs:distinctdocorder(Value3) using toSet(Value3) intersect InputElt&AttrNodes (where toSet function maps a sequence to a set containing exactly the same elements) DynEnv.varValue($fs:dot) = Value1 Value1 matches node DynEnv  axis Axis of Value1 => Value2 Axis principal PrincipalNodeKind DynEnv  test NodeTest with PrincipalNodeKind of Value2 => Value3 Value3 does contain text nodes  (AxisNodeTest2) DynEnv  Axis NodeTest => fs:distinctdocorder(Value3) using { x  (x is input node) and (x is element) and (x parent of y) and (y in Value3) and (y is text node)} U toSet(Value3) intersect InputElt&AttrNodes DynEnv  expr1 => true using US' DynEnv  expr2 => Value2 using US''  (EvalCond1) DynEnv  if Expr1 then Expr2 else Expr3 => Value2 using US' U US'' DynEnv  expr1 => false using US' DynEnv  expr3 => Value3 using US''  (EvalCond2) DynEnv  if expr1 then expr2 else expr3 => Value3 using US' U US'' DynEnv  expr1 => v1 using US' DynEnv  expr2 => v2 using US'' DynEnv  v1 op v2 => v3 using {}  (EvalOp) DynEnv  expr1 op expr2 => v3 using US' U US'' where op is in { =, !=, <, >, or, and} DynEnv  expr1 => Value1 using US' Value1 matches xs:QName DynEnv  fs:itemsequencetonodesequence (expr2) => Value2 using US'' Value2 matches (attribute*, (element  text  processinginstruction  comment)*)  (EvalEltConst) DynEnv  element { expr1 } { expr2 } => element Value1 {Value2} using US' U US'' U (toSet(Value2) intersect InputElt&AttrNodes) declare func( x ) { expr2 } DynEnv  expr => value1 using US1 DynEnv + varValue( x => value1)  expr2 => value2 using US2  (EvalUserDefFunc) DynEnv  func(expr) => value2 using US2 U US1
One important point worth mentioning in the refined evaluation rules: for a step expression evaluating to a sequence containing text nodes, we consider that the parent of the return text nodes that are in the input documents influence the evaluation. This decision, which may seem arbitrary, is motivated by the fact that we are tracking dependencies on input attribute and element nodes, not on text nodes. Furthermore, we cannot simply ignore these text nodes because, as illustrated in the example "element name { /a//text()}", we will not account for some dependencies on some input nodes.
Nodes accessed (or used) during the evaluation of an expression e do not necessarily correspond to nodes on which the result of e depends because unnecessary evaluations of some subexpressions may have been performed. For example, the result of the XQuery expression "if ( xs:boolean(true) or (/bookStore/book[1]/price < 100)) then 1 else 0" does not depend on any input nodes although its evaluation can access the price of the first book (another similar example: "let x:=//book return 1"). In this example, the evaluation of a subexpression does not affect the value of the result regardless of the schema to which the input document conforms. If the input document is known to be valid against a given schema, this could entail restrictions that could make the evaluation of some subexpressions irrelevant. For example, if the input document is known to be valid against the schema shown in Figure 1, the result of "if ( /bookStore/book[1]/price != 'a string')" then 1 else 0" is always 1 because "/bookStore/book[1]/price" must be of type xs:decimal therefore cannot be equal to 'a string'.
In the previous examples, we were able to discard some accessed nodes (i.e. not consider them as nodes that influence the result) because we could find an equivalent expression for which these nodes were not accessed during the evaluation. For these expressions, the equivalent expression was the constant "1".
In the remainder of this section, we formally define the equivalence relation between expressions and the notion of influenced nodes.
Definition. An environment constraint envConst is a complete function from the set of QNames to the set of XQuery Types^{4}.
Definition. We define the partial order relation <= on the set of environment constraints as follows: envConst1 <= envConst2 iff, for any QName q, envConst1(q) <:= envConst2(q), where <:= denotes the partial order relation on XQuery Types defined in the XQuery Formal Semantics Specification [XQueryFS] .
Definition. A dynamic environment dynEnv is consistent with an environment constraint envConst iff, for any QName q in dynEnv, dynEnv.varValue(q) is valid against envConst(q). Stating that the input schemas are known really means that there is an environment constraint envConst such that all evaluations are done in dynamic environments consistent with envConst.
Definition. Two expressions exp1 and exp2 are equivalent under a given environment constraint envConst, denoted "envConst  exp1 <=> exp2", iff for any dynamic environment dynEnv consistent with envConst and for any sequence value v, (there is a set X such that dynEnv  exp1 => v using X) iff (there is a set Y such that dynEnv  exp2 => v using Y). In other words, two expressions are equivalent under a given environment envConst iff, under the same dynamic environment consistent with envConst, their evaluation either raises an error or they both evaluate to the same value (note that, in general, this value will be different for different dynamic environments). The set equiv[envConst](exp) denotes the set of all expressions equivalent to an expression exp given the environment constraint envConst.
Definition. Let exp be a Core XQuery expression, let envConst be an environment constraint and dynEnv a dynamic environment consistent with envConst such that dynEnv  exp => v using US. The set infl[envConst](dynEnv, exp) of input attribute and element nodes that influence the evaluation of exp under the dynamic environment dynEnv consistent with envConst is defined as follows:
infl[envCons](dynEnv, exp) = { n  for all exp' in equiv[envCons](exp), (dynEnv  exp' => v using US') implies n is in US'}
As previously mentioned, the result of our analysis is to assign to each subexpression an abstract value that approximates it. Choosing the right abstract domain is crucial because it will determine in part the accuracy of the analysis, its performance and the usefulness of its result (i.e. the potential applications that can leverage the output of the analysis). Many factors were considered in the design of the abstract domain. They can be grouped into two categories:
The abstract domain AbsDom[c] chosen satisfied the two constraints mentioned earlier. An element of AbsDom[c] consist of:
An abstract value represents a set of sequence items. The occurrence indicator of an abstract value av bounds the number of items in sequences represented by av. The base set of av restricts the items that may appear in a sequence represented by av to either some specific atomic values, to atomic values instances of some specific atomic types or to attribute and element nodes valid against some given attribute and element declarations. For example, the abstract value { xs:decimal(5), xs:string, element bookStore }+ represents all the sequence items containing one or more items, and each item in the sequence is either the decimal value 5, a string or an element valid against the type element bookStore.
Notation. For an abstract value av: occurrenceIndicator(av) denotes its occurrence indicator, base(av) its base set, baseConstant(av) the greatest subset of base(av) containing only constant atomic values, baseAtomicType(av) the greatest subset of base(av) consisting of only atomic simple types, and baseNode(av) the greatest subset of base(av) consisting of element and attribute declarations.
Given an expression e and its input and output schemas, element and attribute declarations used in the abstract domain AbsDom[c] of the analysis of e are limited to those declared in the schemas plus element and attribute declarations for temporary constructed nodes. These temporary nodes are constructed during the evaluation of e but are not added to the result, and therefore there may not be declarations in the output schemas that constrain them. However, we need to represent them in the abstract domain since some expressions may use them (e.g. "let x := element myElt { "a string value" } return $x//text()"). The static type inference algorithm described in the XQuery Formal Semantics Specification [XQueryFS] provides the type of temporary constructed nodes used in the abstract domain.
The partial order relation, <:=, defined in the XQuery Formal Semantics Specification [XQueryFS] can be straightforwardly extended to a partial order relation on AbsDom[c]. Let Top = S* be the abstract value whose base set consists of all atomic simple types and all element and attribute declarations allowed in AbsDom[c] ( in the previous paragraph we have shown that it is a finite set) and whose occurrence indicator is "*", Top is the greatest element of AbsDom[c] with regard to the partial order. We include in AbsDom[c] the element "None" and extend the order relation so that "None" is the smallest element. AbsDom[c] with the extended order relation is a complete lattice, and, since it is finite, it satisfies the ascending chain condition.
Having formally defined the abstract domain of the static analysis, we now specify the constraint system whose resolution yields, for each subexpression e, (1) an abstract value approximating the result of any evaluation and (2) an abstract value representing all the input nodes that may influence the evaluation of e.
For simplicity of the presentation, we assume that all declared variables (including function parameters) have distinct names. We also assume that each expression is uniquely identified by a label, an integer corresponding to its location. The result of the analysis can be expressed as the pair of functions (D, AE) such that
where Var is the set of variables in the query, Label is the set of labels. Dr maps a label (i.e. an expression e) to an abstract value approximating the result of any evaluation e. Ds maps a label (i.e. an expression e) to an abstract value approximating the input elements and attributes that may influence the evaluation of e. AE corresponds to the abstract environment. It keeps track of abstraction bound to variables by the static analysis. AEr maps a variable (including variables corresponding to parameters of user defined functions) to an abstract value approximating all its possible values at runtime. AEs maps a variable v (including variables corresponding to parameters of user defined functions) to an abstract value approximating input elements and attributes that may influence the evaluation of expressions whose values may be assigned to v.
We use the same notation as in [ProgramAnalysis] to indicate that a tuple (D,AE) is acceptable for an expression expr: "(D,AE) = expr". The acceptability relation tells us whether a given pair (D, AE) is indeed an acceptable solution to the analysis problem.
For our first analysis, we do not take into account the context in which functions are called. Figure 10 shows the set of constraints that must hold for (D, AE) to be acceptable for different kinds of expressions for the contextinsensitive analysis.
Literal. (literal) states that a solution (D, AE) is acceptable for a literal expression Literal with label l iff the abstract value Dr(l) approximating its result contains the singleton consisting of the typed value of Literal (tv function maps a literal lit to its corresponding typed value. For example tv("a string literal") = xs:string("a string literal"). In other words, the sequence (tv(Literal)) must be in the set of concrete values represented by Dr(l). On the other hand, since a literal expression does not depend on any input nodes, the set of concrete values corresponding to Ds(l) must at least contain the empty sequence.
Empty Sequence. (EmptySeq) simply states the obvious: the result and used abstract values associated with the empty sequence expression must at least include (or be an approximation of) the abstract value for the empty sequence.
Comma operator. (Comma) states that the result (Dr) and used (Ds) abstract values of a sequence expression is obtained by composing the result and used values of its subexpressions. We also require (D, AE) to be acceptable for the subexpressions.
Variable. For a variable reference var, the result and used abstract values must at least be an approximation of the result and used abstract values of var stored in the abstract environment AE.
Let expression. First (D, AE) has to be acceptable for the subexpressions of the let expression considered. The abstract values associated with variable v in the abstract environment AE must at least be approximations of the abstract values associated with exp1. It is important to note that the result and used values of exp1 can influence the result and used abstractions of the let expression iff the variable v is referenced in exp2.
For expression. The treatment of for expressions is almost similar to the treatment of let expressions. The differences account for the fact that, at runtime, if exp1 evaluates to an empty sequence then the for expression yields an empty sequence; otherwise, the variable v is always bound to a sequence of exactly one item taken from those in the sequence obtained from evaluating exp1. As a consequence of these runtime differences, the static analysis of for expressions is slightly different from let expressions in two aspects:
Step expression. For step, the result and used values of the context node $fs:dot is first retrieved from the abstract environment AE. The abstract evaluation of a step is similar to the static typing of step expressions defined in the XQuery Formal Semantics Specification [XQueryFS]. For example, given the example input schema of Figure 1 and an abstract context item set to "{element bookStore}1", 'book/title' evaluates to "{element title}+", and './/author/*' yields "{ element title, element fname, element lname }+". The result abstract value of a step expression must be an approximation of the result of performing the abstract evaluation of the step. The last constraint makes sure that parents of text nodes in the result of the evaluation of the step are represented by the used abstract value. src function selects only input element and attribute declarations.
IfThenElse expression exp. The analysis result depends on the analysis result of the test subexpression exp1. If the "then" branch can be taken (i.e. if the result abstract value of exp0 contains the abstraction of xs:boolean('true')) then (D, AE) must also be acceptable for exp1 and the result abstract value (resp. the used abstract value) of exp must at least contain the result abstract value (resp. the used abstract value) of exp1. Similar constraints must hold if the "else" branch can be taken. Finally, the used abstract value of the test expression exp0 is taken into account only if both branches can be reached; otherwise, the conditional expression can safely be rewritten into either exp1 or exp2, and thus does not depend on the evaluation of exp0.
Element constructor The most interesting aspect of constructor is the use of the eltConst function in the computation of the result abstract value. The abstract value of exp0 represents a sequence of QNames. When applied to eltConst the abstraction of sequences of QName yields an abstraction of sequences of temporary and output element nodes. Formally, for an abstract value av:
Comparison expression. The analysis basically composes analysis results of the subexpressions (src function selects only element and attribute declarations from input schemas).
Function call expression exp. There are two points worth mentioning:
The analysis is performed assuming that the input documents conform to statically known schemas. As explained in section “Definition of input nodes influencing the evaluation of an expression”, this basically means that the analysis is done under an environment constraint envConst mapping QNames of variables to their XQuery type. These XQuery types need to be approximated by elements of the abstract domain. The XQuery Formal Semantics Specification [XQueryFS] defines prime and quantifier functions on types such that for a type t, "(prime(t)) quantifier(t)" is a subtype (or an approximation of) t. The type function prime(t) extracts all item types from the type t, and combines them into a choice. For example, for t = (xsd:decimal, xsd:float, xsd:integer), prime(t) = xsd:decimal  xsd:float xsd:integer and quantifier(t)='+'. An XQuery type t is approximated in our abstract domain by adom(t) = items(prime(t)) quantifier(t), where items function returns the item types of which an union type is made. For example adom(t) = {xsd:decimal, xsd:float, xsd:integer}+ .
In order to take into account the environment constraint envConst, in addition to constraints defined in Figure 10, the constraint presented in Figure 11 must also be satisfied.
For an expression e, the constraints imposed by the acceptability relation = (as shown on Figure 10) on e and recursively on all its subexpressions as well as constraints shown on Figure 11 define a system S of constraints. Each constraint has the form "f(D,AE) X " where f is a monotone function of (D, AE) and X is one of Dr(l), Ds(l), AEr(l) or AEs(l) with l a label. By the construction of S, it follows that an analysis (D, AE) is acceptable and satisfies the constraints shown on Figure 11 iff it is solution of S.
For illustration purpose, we consider a simple labeled conditional expression shown on Figure 12.
For the example on Figure 12 , an analysis (D, AE) is acceptable (figures 10 and 11  assuming that the free variable $b is of type xs:boolean) iff it satisfies the conjunction of constraints shown on Figure 13, which can be expressed as the system S of constraints appearing on Figure 14.
On Figure 14, "=>" does not represent the logical implication as it does everywhere else in the paper. On Figure 14, "=>" denotes a binary operator whose first operand is a boolean value and whose second operand and result are abstract values. If its first operand is true then its result value is the same as its second operand; otherwise (i.e first element is false) its result value is None (the least element of the abstract domain). Formally, for an abstract value av, ((true => av) = av ) and ((false => av) = None).
As illustrated in the previous section, from the acceptabity relation, a system S of monotonic constraints can be generated. The function "(D, AE) : l > (Top, Top, Top, Top)" which maps each label to the greatest element of the abstract domain is obviously a solution of S. To get the most accurate analysis, the least solution of S is of interest. For monotonic constraint systems on a complete lattice satisfying the ascending chain condition such as S, the least solution exists and some efficient algorithms to compute it are described by F. Nielson et al. [ProgramAnalysis].
Theorem: The analysis is sound. The following holds:
In other words, any acceptable solution (D, AE) under an environment constraint envConst is such that for any evaluation in a dynamic environment dynEnv consistent with envConst: (1) there is at least one sequence item represented by either the result or use value whose items are exactly the influenced input nodes for the considered evaluation, and (2) the result of the evaluation belongs to the set of sequence items represented by the result abstract value.
Proof. First, for an acceptable solution (D, AE), we prove by structural induction on the expression exp that there is an equivalent expression exp' under envConst (i.e envConst  exp <=> exp'. exp' may be the same as exp) such that for any dynamic environment dynEnv consistent with envConst, "dynEnv  exp' => v using US" implies that (1) elements of US are all contained in at least one sequence item represented by either the result or the used abstract value associated with exp', and (2) v is a sequence item in the set of sequence items represented by the result abstract value. Once this result is established the proof of the theorem follows from the definition of infl function.
As mentioned in the experimental section “Experimental evaluation”, the context insensitive approach is less expensive to compute. However, it has two important shortcomings:
To solve the limitation of the context insensitive approach with respect to the analysis of function calls, we associate to each subexpression e different result and used abstract values depending on the context that may reach e. A context c records the last n possible invocation sites of functions that have not returned (n is less or equal to a maximum context size k, which is a configurable parameter of the analysis). Now, the result of the analysis is expressed as the pair of functions (D, AE) such that :
We use the notation " (D, AE) =c exp" to denote that the pair (D, AE) is acceptable for the expression exp with context c.
Constraint (funCall) in Figure 10 is then changed as shown in Figure 17 to produce a context sensitive analysis.
Unfortunately, the worst case complexity of the context sensitive approach is exponential in the size of the maximum context size. We will present in section “ XSLT specific optimization techniques” various techniques to avoid this worst case behavior for the analysis of XSLT stylesheets.
The analysis presented in section “Contextinsensitive analysis” is inaccurate in the treatment of element constructors because it does not keep track of the possible parents of constructed elements. To address this limitation, we add a component Dp to D to represent the possible parents of an element constructed by an element constructor at a given label.
The acceptability relation defined in Figure 10 is modified as shown in Figure 18 to produce a "parent" sensitive analysis.
Constraints not modified are not shown in Figure 18. "..." means that nothing else has been changed.
We assume that the static type inference defined in the XQuery Formal Semantics Specification [XQueryFS] has been performed, and for each subexpression e whose label is l, type(l) yields the static XQuery type. The function tempdoc maps a label l to the following element "document { adom(type(l)) }" of the abstract domain ^{7}. It is used when elements constructed by the subexpression corresponding to l are not directly added to the result.
All constraints except (eltConst) simply propagate parent information topdown (from an expression to its subexpressions). Constraint (eltConst) indicates that the result abstract value of an element constructor e is computed by taking into account the abstract value representing its possible parent (Dp(l)) in addition to the abstract value representing its possible QNames. Thus, we can now differentiate between element declarations having the same QName in the output schema provided that they are declared in the content models of distinct element declarations.
A direct application of the analyses presented in section “ Constraintbased static analysis of Core XQuery ” to Core XQueries generated from XSLT stylesheets is expensive and yields very inaccurate results. Two peculiarities of XSLT stylesheets account for these failures of the basic Core XQuery analyses:
<result> <schemas> <source rootName="bookStore" schemaLocation="input.xsd"/> <target rootName="bib" schemaLocation="output.xsd"/> </schemas> <dependencies> <! ... > <dependency> <source>/bookStore/book</source> <source>/bookStore/company</source> <target>/bib/book</target> </dependency> <dependency> <source>/bookStore/book</source> <source>/bookStore/company</source> <target>/bib/company</target> </dependency> <! ... > </dependencies> </result>
In order to make the analysis more precise in the presence of "catchall" templates, before compiling the stylesheet to XQuery, a "catchall" template t whose 'match' attribute is '*' is replaced by specialized templates with the same body as t. The 'match' attribute values of the new specialized templates correspond to QNames of element declarations in the input schemas. In our example, the "catchall" template is replaced by two specialized templates with the same body but whose 'match' attribute values are 'book' and 'company'.
Special care is needed for input schemas containing attribute or element wildcards because, in that case, "catchall" templates must not be removed. In fact, at runtime, they may be instantiated for elements and attributes constrained by named attribute and element declarations unknown statically. However, the analysis should not consider that statically known named element or attribute declarations could match the match pattern of the cathall templates ( i.e. "*" or "@*"). Instead, it should consider that only element and attribute wildcards can match patterns specified by "catchall" templates, whereas only statically known named element and attribute declarations can match patterns of generated specialized templates. We use special annotations to allow the analyzer to recognize generated specialized templates and "catchall" templates.
As shown in the experimental evaluation section “Experimental evaluation”, the worst case complexity of the basic contextsensitive evaluation of Core XQueries generated from XSLTs is realized in practice: the number of constraints generated is exponential in the size of the maximum context size. The worst case complexity is essentially due to the fact that, for any xsl:applytemplates invocation, all the templates are considered. However, for an xsl:applytemplates appearing in an unnamed template, the list of templates that may be instantiated at runtime can almost always be significantly statically reduced (this reduction of the number of templates to consider is especially important because, in the previous section “Template specialization”, we have specialized all the templates). For the example presented in section “A simple example ” on Figure 3, it is obvious that the xsl:applytemplates appearing in the first template can only instantiate the template whose 'match' attribute is 'bookStore', and the xsl:applytemplates appearing in this 'bookStore' template can only instantiate the specialized template whose match attribute is "book".
The reduction of the list of templates to consider is achieved by a simple very conservative context insensitive analysis of both the match pattern specified by the template t where an xsl:applytemplates instruction ats appears and the select XPath attribute of ats. This analysis can be complex if the select XPath or the match pattern contains variables because a flow analysis of the whole stylesheet may be required to find an approximation of all the possible values of the variables. We avoid that complexity by simply approximating the abstract value of variables by the greatest element of the abstract domain.
In the generated XQuery, we annotate each t2q:applyTemplates function call corresponding to the translation of an xsl:applytemplates by the subset of templates that can actually be instantiated at runtime. Thus, when the constraint builder analyzes these t2q:applyTemplates function calls, it will not generate constraints using the default body of t2q:applyTemplates where all templates are considered, but constraints will be generated from a rewritten version of its body where only the subset of templates in the XSLT annotation are considered. As shown in the experimental evaluation section, this specialization of the body of t2q:applyTemplates based on its invocation context can significantly reduce the number of generated constraints, and make the analysis scale.
All the analyses presented earlier have been implemented in Java according to the architecture described in section “Architecture”. The purpose of this section is to report on the first experimental evaluation of our approach and implementation.
We have performed two types of experiments: on correctness and accuracy and on performance and scalability. They were done with three stylesheets:
Although the theoretical soundness of our analyses has been established in section “ Correctness”, the first kind of experiments was necessary to test the correctness of our implementation. These experiments also allow a better understanding of the limitations and strengths of all the analyses described in the paper:
The second kind of experiments evaluates the performance and scalability of the context sensitive approach. The number of constraints to solve by the constraint solver represents the major factor influencing performance and scalability. As long as all the generated constraints can fit in memory a generic iterative constraint solver sorting constraints in their topological order [ProgramAnalysis] can efficiently solve the system. As mentioned in section “Context sensitive analysis”, the worstcase exponentialbehavior of context sensitive analyses represents their most important drawback and scalability limitation. In order to show that this worst case is not realized in practice when the XSLT specific techniques of section “ XSLT specific optimization techniques” are applied, we have performed two context sensitive analyses with parent information of the three stylesheets for different values of the maximum context size: the first was only based on Core XQuery analysis described in section “ Constraintbased static analysis of Core XQuery ”; the second applied XSLT specific techniques presented in section “ XSLT specific optimization techniques”. Figure 19 shows how, in these two experiments, the number of constraints generated varies depending on the maximum context size. Curves obtained from the first experiment have labels of the form "CoreX <Name>" whereas those obtained from the second experiment (with XSLT specific optimizations) have labels of the form "XSLT <Name>". Figure 19 shows that XSLT specific optimizations make the analysis of XSLT (1) significantly less expensive that the generic analysis of Core XQuery and (2) more scaleable as the maximum context size increases. The result of the first experiment is not shown for PubMed because, even for a maximum context size of 2, the number of generated constraints cannot fit in a 1 GB main memory.
In this paper, we have presented a common architecture for the static analysis of both XQuery and XSLT particularly tailored to handle recursive XQueries and stylesheets. It has been applied to solve the input/output dependency extraction problem. Our initial experiments have shown that, by taking into account some peculiarities of XSLT stylesheets, our context sensitive analysis can be both precise and scaleable.
However, the same experiments have also demonstrated that the worst case exponential behavior of the context sensitive analysis is realized when XSLT specificities are ignored. Therefore, its performance and scalability may be limited if it is applied to XQueries containing many function calls to recursive functions. Addressing this possible scalability issue for XQuery constitutes an interesting challenge which will require more experiments with XQueries and work on XQuery specific techniques to reduce the analysis cost (e.g. the possible use of the result of XQuery type inference to perform the kind of specializations applied to XSLT).
If no output schemas are specified, the static type inference defined in the XQuery Formal Semantics Specification [XQueryFS] can be used to infer them. 

The more precise the abstract values are the more expensive the analysis will be. 

Used abstract values consist only of element and attribute declarations in input schemas. 

An environment constraint maps a QName to a XQuery type conforming to the type system defined in the XQuery Formal Semantics Specification [XQueryFS] . 

The ascending chain condition is satisfied iff any ascending sequence of elements of the lattice eventually stabilizes. An ascending sequence (A1,..,An, ...) is a sequence such that for all i,j, i<=j implies Ai <= Aj. The ascending chain condition says that, for any such sequence, there is an integer k such for all i >= k, Ai = Ak. 

In our implementation, the base set also contain special abstractions to represent comment and processing instruction nodes, but for simplicity we do not consider them in the remainder of the paper. Simple atomic types are used to represent text nodes. 

Note that we had to slightly modify the abstract domain so that the base set of an abstract value can now contain element of the form document{ adom(type(l)}. The abstract domain remains finite and is still a complete lattice. 

PubMed is a service of the National Library of Medicine. It includes over 15 million citations for biomedical articles back to the 1950's. 

The Entrez Nucleotides database is a collection of genomic sequences from several sources, including GenBank, RefSeq, and PDB. 

Schema declarations that can safely be removed without making the analysis unsound. 
[Clio] The clio project: Managing heterogeneity. ACM SIGMOD Record volume 30 pages 7883, 2001. citeseer.ist.psu.edu/miller01clio.html
[ClioMapping] L. Popa, M. Hernandez, Y. Velegrakis, R. Miller, F. Naumann, and H. Ho. Mapping xml and relational schemas with clio, 2002.
[Compilers] Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. Compilers: Principles, Techniques and Tools.Addison Wesley, 1986.
[DataFlow] C. Chambers, J. Dean, and D. Grove. Frameworks for intra and interprocedural dataflow analysis.Technical report, 1996.
[DataFlowFrameworks] C. Chambers, J. Dean, and D. Grove. Frameworks for intra and interprocedural dataflow analysis. Technical report, 1996.
[DBLP] Digital Bibliography & Library Project: http://dblp.unitrier.de/
[Nucleotide] NCBI Entrez Nucleotides database: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Nucleotide&itool=toolbar
[ProgramAnalysis] Flemming Nielson, Hanne Riis Nielson, and Chris Hankin. Principles of Program Analysis.SpringerVerlag, 1999.
[PubMed] NCBI Entrez PubMed: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
[XML] T. Bray, J. Paoli, C. M. SperbergMcQueen, E. Maler, and F. Yergeau. Extensible markup language (XML) 1.0 (third edition). W3c recommendation,WorldWideWeb Consortium, February 2004.
[XMLAccessControl] Makoto Murata and Akihiko Tozawa and Michiharu Kudo and Satoshi Hada. XML access control using static analysis. In CCS 03: Proceedings of the 10th ACM conference on Computer and communications security, pages 7384, New York, NY, USA, 2003. ACM Press.
[XMLProjection] Amelie Marian and Jerome Simeon. Projecting xml documents. In Proceedings of International Conference on Very Large Databases (VLDB), Berlin, Germany, September 2003.
[XQuery1] Scott Boag, Don Chamberlin, Mary F. Fernandez, Daniela Florescu, Jonathan Robie, and Jerome Simeon. XQuery 1.0: An XML query language. W3c working draft, World Wide Web Consortium, July 2004. http://www.w3.org/TR/2004/WDxquery20040723.
[XQueryFS] XQuery 1.0 formal semantics. W3C Working Draft, March 2002.
[XQueryWS] Nicola Onose and Jerome Simeon. XQuery at your Web Service. In WWW 04: Proceedings of the 13th International Conference on World Wide Web, pages 603611, New York, NY, USA, 2004, ACM Press.
[XSLT2] Michael Kay. XSL transformations (XSLT) version 2.0. W3c working draft, World Wide Web Consortium, November 2003. http://www.w3.org/TR/2003/WDxslt2020031112.
[XSLT2XQuery] Achille Fokoue, Kristoffer Rose, Jerome Simeon, and Lionel Villard. Compiling XSLT into XQuery. To appear in Proceedings of World Wide Web Conference 2005. Currently available at www.research.ibm.com/XML/CompilingXSLT2toXQueryFokoueWWW2005.pdf.