Extracting Input/Output dependencies from XSLT 2.0 and XQuery 1.0

Achille Fokoue
achille@us.ibm.com

Abstract

The success of XML has come with the explosion of the number of vocabularies defined to represent similar information. In many domains, ranging from life sciences to e-commerce, many mappings have been defined between these vocabularies as a strategy for information integration. However, as the vocabularies evolve, maintaining these complex mappings and all artifacts that rely on them remains one of the major issues in information integration. A critical aspect of schema (or vocabulary) evolution is the ability to perform an impact analysis across all artifacts that directly or indirectly depend on the changing vocabularies. When an evolving vocabulary is not directly used, but rather mapped to another one through a complex mapping expressed in languages such as XSLT or XQuery, an impact analysis can be performed at a fine level of granularity only if the dependencies between the source and the target vocabulary are also known with the same level of granularity.

In this paper, we present an architecture for the static analysis of both XSLT and XQuery and a conservative and sound constraint-based analysis of these two languages that effectively uncovers input/output dependency information necessary for a fine grain impact analysis. For each element or attribute declaration d in the output schema the analysis determines all the attribute or element declarations in the input schemas whose instances may influence the evaluation of an instance of d. We show results of early experimental evaluation illustrating the accuracy and scalability of the approach.

Keywords: XSLT; XQuery; Mapping

Achille Fokoue

Achille Fokoue has worked on XML related technologies and business rules at IBM's T.J. Watson Research Center and at IBM's Zurich Research Lab (Switzerland) since 1999.

In Watson lab, he was the primary architect and the developer of IBM XML Schema Quality Checker. His work on XML technologies involved static analyses of XPath, XQuery and XSLT for both program understanding and optimization purposes. In his work on business rules, Fokoue has contributed to the design and implementation of the Fusion system, a set of tools and a runtime designed to bridge the gap between business users and application developers.

In Zurich lab's pervasive computing department, he was involved in the early development of the peer-to-peer replication infrastructure of the Fluid Project.

Extracting Input/Output dependencies from XSLT 2.0 and XQuery 1.0

Achille Fokoue [IBM T. J. Watson Research Center]

Extreme Markup Languages 2005® (Montréal, Québec)

Copyright © 2005 Achille Fokoue. Reproduced with permission.

Introduction

The successful adoption of XML [XML] as a universal standard exchange format has come with the explosion of the number of vocabularies defined to represent similar information. The naive assumption that, in each domain, a single vocabulary would emerge as the single standard did not hold. In many application domains, ranging from life sciences to e-Commerce, building and maintaining mappings between various vocabularies remain key issues to information integration across a variety of sources.

Declarative transformation languages such as XSLT [XSLT2] and XQuery [XQuery1] were designed to address the anticipated need for integration and interoperability between various related vocabularies. They are Turing-complete languages allowing the specification of how instances of source (or input) schemas are transformed into corresponding instances of target (or output) schemas. However, in scientific domains, such as life sciences, where large and complex data structures, reflecting the complexity of scientific models, need to be mapped between different representations, a growing trend has been to assist users with tools (e.g. Clio [Clio] [ClioMapping]) where mappings can be built semi-automatically and as simply as drawing lines between related elements. These tools typically build intermediary models in which elementary mappings are conceptually expressed by providing the following information:

  • elements of the source schemas corresponding to the input of the mapping
  • the element of the target schemas to which input elements are mapped
  • an expression specifying how source elements are transformed into the target element

Besides the obvious usability advantage of the Clio approach over directly specifying the mapping in XSLT or XQuery, the intermediary representation used by Clio greatly simplifies the mapping maintenance problem even when input or output schemas evolve. It directly provides at the finest level of granularity (at the element and attribute declaration level) input and output dependency information needed for impact analysis, a major aspect of the maintenance of various artifacts in case of schema evolution.

However, in the Clio approach, the expressiveness of the intermediary language is limited to the kinds of mappings that can be expressed through the tool. Turing-complete languages like XSLT are still required to express more complex mappings, but input/output dependency information at a fine granularity level needed for impact analysis is not directly available.

In this paper, we present a conservative and sound constraint-based static analysis of both XSLT and XQuery that effectively uncovers input/output dependency information necessary for a fine grain impact analysis. For each element or attribute declaration d in the output schema the analysis determines all the attribute or element declarations in the input schemas whose instances may influence the evaluation of an instance of d.

Related Work

In XML and database communities, various static analyses of XQuery have been proposed mainly for optimization purpose [XMLProjection], but also in context of XML security [XMLAccessControl]. A. Marian and J. Simeon [XMLProjection] have proposed a static analysis that infers for a given XQuery all the paths in the input document that may influence its evaluation. It could straightforwardly be extended to extract in an XQuery all the input/output dependency information. Unfortunately, most XQuery static analysis and optimization efforts have not focused on queries involving recursive user defined functions. Indeed, to the best of our knowledge, previous work has simply ignored recursive functions or has provided gross conservative approximations because they were essentially dealing with "simple" expressions that simply extract some information from a document or a repository of documents. In contrast to these simple expressions, most XSLTs are intrinsically recursive (they have at least the recursive default template for element and document nodes) and, should XQuery be used as a full-fledged programming language for web services [XQueryWS], a lot of recursive XQueries would be written.

In the programming language community, many techniques [DataFlow] [ProgramAnalysis] [Compilers] [DataFlowFrameworks] have been applied for the static analysis of recursive declarative and imperative languages. Context sensitive approaches that distinguish between different call sites of the same function have always provided better results (in terms of accuracy of the analysis) over context-insensitive approaches. But, the success of context sensitive approaches comes with a high price tag on performance which, in general, is exponential in the size of the context. This performance issue has been a major limitation of the scalability of context sensitive approaches to real world programs.

Contributions

The main contributions of this paper are threefold:

  1. We apply a context sensitive constraint-based analysis of both XSLT and XQuery that effectively handles recursive functions. In addition to its application to impact analysis in the context of schema evolution, the analysis presented in this paper can be used to extend the XQuery optimization approach presented by A. Marian and J. Simeon in [XMLProjection] to handle recursion.
  2. We show how the result of the analysis can be used to extract input/output dependency information from any XSLT or XQuery.
  3. We show how, by taking into account some peculiarities of XSLT (e.g. connections between apply-templates and templates), our context sensitive approach can scale to real XSLTs. Results of early experimental evaluation illustrating the accuracy and scalability of the approach are presented.

Outline

The remainder of this paper is organized as follows. In section “A simple example ”, we introduce a simple example used for illustrative purpose throughout the paper. Section “Architecture” presents the architecture of the input/output dependency analyzer. The constraint-based static analysis of Core XQuery is formally described in section “ Constraint-based static analysis of Core XQuery ”. In Section “ XSLT specific optimization techniques”, we show how the analysis could be optimized for XSLT stylesheets. Finally, section “Experimental evaluation” presents the result of early experimental evaluation of our approach.

A simple example

Before describing the architecture of our solution to the input/output dependency extraction problem, we present here a simple example to illustrate our approach and give a first sense of the challenges we have to address.

The XSLT stylesheet presented on Figure3 transforms documents conforming to the pseudo-schema shown on Figure 1 into documents valid against the pseudo-schema illustrated on Figure 2. The transformation consists of (1) renaming 'bookStore' element to 'bib' element, (2) for each 'author' element in the input, the output document contains a corresponding 'author' element whose content is the concatenation of the 'fname' and 'lname' elements separated by a space, and (3) 'price' and 'rating' elements in the input are not used in the output.

Figure 1: Input schema
element bookStore
{
	element company ? { String}
	element book + 
	{
		element title { String }
		element author +
		{
			element title? { String }
			element fname { String }
			element lname { String }
		}
		element pub { String }
		element price { Decimal }
		element rating { Integer }
	}
}
Figure 2: Output schema
element bib
{
	element company ? { String}
	element book + 
	{
		element title? { String }
		element author + { String }
		element pub { String }
	}
}
Figure 3: XSLT Stylesheet
<xsl:stylesheet
		xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
		version="2.0">

		<xsl:template match="/">
				<xsl:apply-templates select="*"/>
 		</xsl:template>	
	
 		<xsl:template match="bookStore">
  			<xsl:element name="bib">
    				<xsl:apply-templates select="*"/>
  			</xsl:element>
		</xsl:template>

		<xsl:template match="author">
				<xsl:element name="author">
						<xsl:value-of select="fname"/>
						<xsl:text> </xsl:text>
						<xsl:value-of select="lname"/>
				</xsl:element>
		</xsl:template>
		
		<!-- 'price' and 'rating' information -->
		<!-- is not repoted in the output 			  -->
		<xsl:template match="price"/>
		
		<xsl:template match="rating"/>
		
		<!-- "Catchall" template                             -->
		<!-- This template causes any element, not otherwise -->
		<!--  processed to be copied to the output           -->
		<xsl:template match="*">
  			<xsl:element name="{name()}">
    				<xsl:apply-templates select="@*"/>
    				<xsl:apply-templates select="*|text()"/>
  			</xsl:element>
		</xsl:template>
</xsl:stylesheet>

Figure 4 shows the input/output dependencies uncovered by our dependency extractor. The structure of the result element consists of (1) a 'schemas' element which identifies all the input schemas ('source' sub-elements) and output schemas ('target' sub-elements), and (2) the list of dependencies that were discovered. Each dependency is described by a 'dependency' element. A dependency element specifies the list of input components ('source' elements) whose instances may influence the value of instances of a given output schema component ('target' element). Note that absolute simplified XPath expressions (only child or attribute axes for axis, only qname test or wildcard for node test) are used to uniquely identify each schema component (element or attribute declaration). For example, '/bookStore/book/title' represents the 'title' element declaration that constrains the first child element of any 'book' whereas '/bookStore/book/author/title' refers to the 'title' element declaration against which the first child element of any 'author' is valid.

Figure 4: Input/Output dependencies
<result>
   <schemas>
      <source rootName="bookStore" 
       schemaLocation="input.xsd"/>
      <target rootName="bib" 
       schemaLocation="output.xsd"/>
   </schemas>
   <dependencies>
      <dependency>
         <source value="/bookStore/book/title"/>
         <target value="/bib/book/title"/>
      </dependency>
      <dependency>
         <source value="/bookStore/book/author/lname"/>
         <source value="/bookStore/book/author/fname"/>
         <target value="/bib/book/author"/>
      </dependency>
          <dependency>
         <source value="/bookStore/book/pub"/>
         <target value="/bib/book/pub"/>
      </dependency>
      <dependency>
         <source>/bookStore/book</source>
         <target>/bib/book</target>
      </dependency>
	 <dependency>
         <source>/bookStore/company</source>
         <target>/bib/company</target>
      </dependency>
      <dependency>
         <source>/bookStore/book</source>
         <target>/bib</target>
      </dependency>
     </dependencies>
</result>

As mentioned in the introduction, the analysis described in this paper is sound: for a given element or attribute declaration d in the output schema, the analysis cannot miss an element or attribute declaration on which d depends. For example, an analysis that would return the set {element 'fname'} as the set of input components that may affect the evaluation of element 'author' in the output schema would not be sound because element 'lname', whose instances obviously influence the value of instances of element author in the output schema, is missing from the set. The soundness of our analysis is formally established in section “ Correctness”.

The analysis is also conservative: the set of dependent input components is not guaranteed to be minimal. In other words, the computed set of dependent input components may contain some elements which could safely be removed (i.e. removed without making the analysis unsound. Such elements are called false-positives). For example, an analysis that simply states that each output schema component depends on all input schema components is obviously sound and inexpensive to compute, but too inaccurate or conservative. The more false-positives are present in the result, the less accurate the analysis is.

A major challenge to any sound and conservative static analysis consists of finding the right balance between accuracy and performance. Throughout the remainder of the paper we focus on approaches that allow accurate results while minimizing the performance impact.

Architecture

We now present the main components of the architecture (see Figure 5) for an input/output dependency extractor. The dependency extractor takes as input an XSLT or an XQuery, one or more input schemas and zero and more output schemas1. It outputs an XML document which describes all the discovered dependencies. Figure 4 shows an example of its output.

Figure 5: Architecture of the input/output dependency extractor
[Link to open this graphic in a separate page]

Transformation to Core XQuery

One key challenge posed to the design and implementation of an input/output dependency extractor for both XSLT and XQuery is the combined syntactic and semantic complexity of both languages. In both languages, there are several ways to express the same transformation and several syntactic sugars, which make the formal analysis of these languages rather cumbersome. As shown in Figure 5, this issue is addressed by first transforming an XSLT stylesheet or an XQuery query into a Core XQuery expression. Core XQuery has the advantage of being a well defined subset of XQuery, without the syntactic sugar of XQuery and XSLT or the implicit rules of XSLT (e.g. built-in templates, template instantiation rules, conflict resolution rules, etc). Furthermore, its formal semantics is clearly defined in the XQuery Formal Semantics Specification [XQueryFS], and the compilation to Core XQuery provides a common framework for analyzing both XSLT and XQuery.

In Figure 5, the normalizer component implements the transformation from XQuery to Core XQuery formally defined in the XQuery Formal Semantics Specification [XQueryFS], whereas the XSLT2XQuery component applies the approach introduced by L. Villard et al. [XSLT2XQuery] to compile XSLT to XQuery. As explained in section “ XSLT specific optimization techniques”, in order to significantly reduce the cost of performing an accurate context-sensitive analysis of XSLT, we need to keep in the generated XQuery some information that was directly available in the original XSLT. Special annotations in the generated XQuery are used for that purpose. The XQuery generated from our XSLT example is shown on Figure 6.

Figure 6: XQuery generated from XSLT
(: due to space constraint namespace and variable declarations are not shown :)
declare function t2q:template0($t2q:dot as node(),$t2q:pos as xs:integer,
$t2q:last as xs:integer,$t2q:mode as xs:string) as item()*
(: XSLT Annotation: [match = / ] :)
{ (let $t2q:sequence := (($t2q:dot/child::*))
  return let $t2q:last := count($t2q:sequence)
   return for $t2q:dot at $t2q:pos in $t2q:sequence
    return t2q:applyTemplates($t2q:dot,$t2q:pos,$t2q:last,'#default'))
} ;
declare function t2q:template2($t2q:dot as node(),$t2q:pos as xs:integer,
$t2q:last as xs:integer,$t2q:mode as xs:string) as item()*
(: XSLT Annotation: [match = author ] :)
{(element author {text {string-join(
     for $t2q:d in subsequence(data(((($t2q:dot/child::fname)))),1.0,1.0)
     return ($t2q:d cast as xs:string),' ')},text {' '},text {string-join(
     for $t2q:d in subsequence(data(((($t2q:dot/child::lname)))),1.0,1.0)
     return ($t2q:d cast as xs:string),' ')}})
} ;
(: due to number of pages contraint XQuery functions for bookStore, :)
(: price, rating templates are not shown :)
declare function t2q:template5($t2q:dot as node(),$t2q:pos as xs:integer,
$t2q:last as xs:integer,$t2q:mode as xs:string) as item()*
(: XSLT Annotation: [match = * ] :){
 (element {string-join(
    for $t2q:d in subsequence(data((name(($t2q:dot treat as node())))),1.0,1.0)
    return ($t2q:d cast as xs:string),' ')} {
   let $t2q:sequence := (($t2q:dot/attribute::*))
   return  let $t2q:last := count($t2q:sequence)
    return for $t2q:dot at $t2q:pos in $t2q:sequence
     return t2q:applyTemplates($t2q:dot,$t2q:pos,$t2q:last,'#default'),
   let $t2q:sequence := ((($t2q:dot/child::*)|($t2q:dot/child::text())))
   return let $t2q:last := count($t2q:sequence)
    return for $t2q:dot at $t2q:pos in $t2q:sequence
     return t2q:applyTemplates($t2q:dot,$t2q:pos,$t2q:last,'#default')})
} ;
declare function t2q:applyTemplates($t2q:dot as node(),$t2q:pos as xs:integer
,$t2q:last as xs:integer,$t2q:mode as xs:string) as item()*
{ (let $t2q:dot5 := (($t2q:dot/self::*))
  return let $t2q:dot4 := (($t2q:dot/self::bookStore))
   return let $t2q:dot3 := (($t2q:dot/self::author))
    return let $t2q:dot2 := (($t2q:dot/self::price))
     return let $t2q:dot1 := (($t2q:dot/self::rating))
      return let $t2q:dot0 := (($t2q:dot/self::document-node()))
       return (
        if ((($t2q:mode='#default') and exists($t2q:dot0)))
        then t2q:template0($t2q:dot0,$t2q:pos,$t2q:last,$t2q:mode)
        else (if ((($t2q:mode='#default') and exists($t2q:dot1)))
         then t2q:template4($t2q:dot1,$t2q:pos,$t2q:last,$t2q:mode)
         else ( if ((($t2q:mode='#default') and exists($t2q:dot2)))
          then t2q:template3($t2q:dot2,$t2q:pos,$t2q:last,$t2q:mode)
          else ( if ((($t2q:mode='#default') and exists($t2q:dot3)))
           then t2q:template2($t2q:dot3,$t2q:pos,$t2q:last,$t2q:mode)
           else ( if ((($t2q:mode='#default') and exists($t2q:dot4)))
            then t2q:template1($t2q:dot4,$t2q:pos,$t2q:last,$t2q:mode)
            else ( if ((($t2q:mode='#default') and exists($t2q:dot5)))
             then t2q:template5($t2q:dot5,$t2q:pos,$t2q:last,$t2q:mode)
             else t2q:builtInApplyTemplates($t2q:dot,$t2q:pos,$t2q:last,$t2q:mode))))))))
} ;
declare function t2q:builtInApplyTemplates($t2q:dot as node(),$t2q:pos as xs:integer,
$t2q:last as xs:integer,$t2q:mode as xs:string) as item()*
{ (( if (exists(((($t2q:dot/self::text()))|(($t2q:dot/(.)[(. instance of attribute())])))))
  then string-join(
    for $t2q:d in subsequence(data($t2q:dot),1.0,1.0)
    return ($t2q:d cast as xs:string),' ')
  else ( if (exists(((($t2q:dot/self::processing-instruction()))|(($t2q:dot/self::comment())))))
   then ()
   else let $t2q:sequence := (($t2q:dot/child::node()))
    return let $t2q:last := count($t2q:sequence)
     return for $t2q:dot at $t2q:pos in $t2q:sequence
      return t2q:applyTemplates($t2q:dot,$t2q:pos,$t2q:last,$t2q:mode))))
} ;
declare function t2q:identity($t2q:node as node()) as node()
{$t2q:node} ;
declare function t2q:convert-operand($t2q:actual as item(),$t2q:expected as item()) as item()
{ (: body not shown due to space constraint:) } ;
declare function t2q:isNumeric($t2q:value as item()) as xs:boolean
{(: body not shown due to space constraint:)} ;
t2q:applyTemplates($t2q:initial-context-node,1,1,'#default')

Goals of static analysis of Core XQuery

The first goal of our static analysis is to associate to each core XQuery sub-expression a result abstract value that represents as precisely as we can afford 2 the result of any concrete evaluation of the sub-expression. Informally, for our analysis, an abstract value is a set consisting of element and attribute declarations (e.g. element 'bib'), simple types (e.g. xs:string) and constant atomic values (e.g. 5). For example, consider the following expression

(element book 
{
	if (exist($t2q:dot/pub)) return
		element pub { ' A publisher ' }
	else
		('5', '6')
}, '2', '3') 

A sound static analysis could associate the result abstract value "{element 'book' from the output schema, xs:string('2'), xs:string('3')}+" to the expression. This means that the evaluation of the expression is guaranteed to yield a sequence of one or more items such that each item is either valid against the element declaration 'book' or is the string '2' or '3'. Thus, the result abstract value tells us that the expression can build an element (element 'book') of the target schema.

For dependency extraction purpose, we also require that the analysis computes, for each Core XQuery sub-expression e, a used abstract value that represents all the attributes or elements in the input that may influence the value of output nodes. In general, for a given expression, this used abstract value is computed by (1) aggregating the result and used abstract values its sub-expressions, and (2) removing all element and attribute declarations that come from the output schemas, simple types and constant atomic values 3. In the previous simple expression, its only inner sub-expression whose used abstract value contains schema components from the input schema is 'exist($t2q:dot/pub))'. Intuitively, the used abstract value of the sub-expression 'exist($t2q:dot/pub))' depends on the result and used abstract values of the argument of the function call. In order to find the result and used abstract values of $t2q:dot/pub, we need to keep track of all result abstract values representing all sequence items that can be bound to the variable $t2q:dot as well as the used abstract value associated with them. Such a variable can be bound to values passed as parameters of function calls; therefore the analysis must take into account the complex and recursive nature of the function call graph. Assuming that the result abstract value of the variable $t2q:dot is {element 'book' from the input schema}* and its used abstract value is {}0, the used abstract value of the expression is {element 'pub' from the input schema}. Now, we know that the content of element 'book' in the output schema may depend on element 'pub' in the input schema.

Dependency Extractor: exploiting the result of static analysis

For the simple example represented in section “Goals of static analysis of Core XQuery”, we have shown how an analysis that associates result and used abstract values to each sub-expression and keeps track of these abstract values for each variable allows us to find input/output dependencies. In our architecture, the Dependency Extractor component is responsible for reading the result and used abstract values provided as annotation to build the dependency result file.

Constraint-based static analysis

An analysis satisfying the requirements of section “Goals of static analysis of Core XQuery” was specified and implemented using a constraint-based approach. The first step of this approach consists of generating some local constraints that must be satisfied by result and used abstract values of each sub-expression and variable. Then, in the second step, the generated recursive system of constraints is solved, and the solution yields the result and used abstract values for each sub-expression. In our architecture, the first step is performed by the constraint builder component, while the second is the responsibility of the constraint solver component. A major advantage of this approach is the separation of the specification of the analysis, constraint generation, from its implementation, constraint solving. On the one hand, the specification of the analysis can be done without worrying too much about how the constraint system will be solved. Since constraints are local, they are easily understandable and structural proofs can easily be established. New constraints can easily be added to make the analysis more precise. On the other hand, standard optimized constraint solvers can be reused.

The formal specification of the constraints generated by the constraint builder component is presented in section “ Acceptability relation and constraints ”

Constraint-based static analysis of Core XQuery

We now formally specify the static analyses of Core XQuery needed for input/output dependency extraction.

Subset of Core XQuery considered

Although the approach described in this paper has been implemented for the whole Core XQuery language, in the remainder of the paper, for simplicity of the presentation, we only consider the subset of Core XQuery described at Figure 7.

Figure 7: Core XQuery Subset
Expr ::=  Literal
	| ()
	| Expr, Expr
	| Var
	| for Var in Expr return Expr
	| let Var := Expr return Expr
	| Axis NodeTest
	| if ( Expr ) then Expr else Expr
	| Expr ( '=' | '!=' | '>' | '<' | 'or' | 'and' ) Expr
	| element  { Expr }  { Expr } 
	| QName (Expr?)
	| declare QName(x?) { Expr }

Var ::=  $QName
		
		

This subset of Core XQuery is sufficient to illustrate the main challenges of the static analysis. It consists of literal values (e.g., strings, decimals), the empty sequence ("()"), sequence construction, variables, for and let expressions, XPath steps, conditionals, comparisons, logical operators, element constructors, function calls and function declarations. Also, for simplicity of the presentation, we do not consider namespaces.

Definition of input nodes influencing the evaluation of an expression

In section “Goals of static analysis of Core XQuery”, we have already pointed out the need to statically obtain an abstract representation of all input nodes that may affect the evaluation of a given expression. In this section, we formally define the meaning of "attribute and element nodes that influence the evaluation of an expression".

First, when the evaluation of an expression starts all input documents that will be accessed are not always known since documents can be opened dynamically using doc function. Since we are mainly interested in XQueries and XSLTs that perform transformations between known schemas, in the remainder of this paper, we assume that all the schemas of all input documents are statically known.

Before the evaluation of an expression e, the set of known input nodes is defined as the set of nodes bound, in the initial environment, to free variables of e and all the nodes in the same XML tree as any of those input nodes. We introduce the following judgment to track input element and attribute nodes that are accessed (or used) during the evaluation:

		DynEnv |- expr =>  v using US 
	

This judgment means that, in the dynamic environment DynEnv, the expression expr evaluates to the value v and, during the evaluation of expr, the subset US of input attribute and element nodes where accessed (or used).

Figure 8 illustrates how the evaluation inference rules defined in the XQuery Formal Semantics Specification [XQueryFS] are modified to keep track of all input element and attribute nodes accessed during the evaluation of an expression.

Figure 8: Evaluation inference rules
		
-------------------------------------------------------------   (EvalLiteral)
DynEnv |- Literal  => Literal  using {}	

		
-------------------------------------------------------------  (EvalEmptySeq)
DynEnv |- () =>  ()  using {}	


DynEnv |- expr1 => Value1 using US1
DynEnv |- expr2 => Value2 using US2
-------------------------------------------------------------- (EvalComma)
DynEnv |- expr1, expr2 => (Value1, Value2) using US1 U US2


DynEnv.varValue(var) = value 
--------------------------------------------------------------  (EvalVar)
DynEnv |- var   =>  value using {}


DynEnv |- expr1 => ()  using US1
-------------------------------------------------------------- (EvalFor1)
DynEnv |- for var in expr1 return expr2  => () using US1

		
DynEnv |-  expr1 =>  Item1 ,..., Itemn using US'
DynEnv + varValue( var => Item1) |- expr2 =>  Value1 using US1
			...
DynEnv + varValue(var => Itemn) |- expr2 => Valuen using USn
----------------------------------------------------------------  (EvalFor2)
DynEnv |- for var in expr1 return expr2 => Value1 ,..., Valuen
		       using  US1 U ... U USn U US'


DynEnv |- expr1  =>  Value1 using US'
		
DynEnv + varValue( var => Value1) |-  expr2 => Value2 using US'' 
------------------------------------------------------------------ (EvalLet)
DynEnv |-  let var := expr1 return expr2 => Value2 using US'' U US'
		

DynEnv.varValue($fs:dot) = Value1 
Value1 matches node 
DynEnv |- axis Axis of Value1 => Value2 
Axis principal PrincipalNodeKind 
DynEnv |- test NodeTest with PrincipalNodeKind of Value2 => Value3 
Value3 does not contain text  nodes
-------------------------------------------------------------- (AxisNodeTest1)
DynEnv |- Axis NodeTest => fs:distinct-doc-order(Value3) 
			using toSet(Value3) intersect InputElt&AttrNodes
(where toSet function maps a sequence to a set containing exactly the same 
elements)

DynEnv.varValue($fs:dot) = Value1 
Value1 matches node 
DynEnv |- axis Axis of Value1 => Value2 
Axis principal PrincipalNodeKind 
DynEnv |- test NodeTest with PrincipalNodeKind of Value2 => Value3 
Value3 does contain text  nodes
---------------------------------------------------------------- (AxisNodeTest2)
DynEnv |- Axis NodeTest => fs:distinct-doc-order(Value3) 
	using { x | (x is input node) and (x is element)
	and (x parent of y) and (y in Value3) and (y is text node)} 
	U toSet(Value3) intersect InputElt&AttrNodes

		
		
DynEnv |- expr1 => true   using US'  
DynEnv |- expr2 => Value2 using US''
-----------------------------------------------------------------  (EvalCond1)
DynEnv |- if Expr1 then Expr2 else Expr3 => Value2 
		using US' U US'' 


DynEnv |- expr1 => false  using US'     
DynEnv |- expr3 => Value3 using US'' 
------------------------------------------------------------------  (EvalCond2)
DynEnv |- if expr1 then expr2 else expr3 => Value3
		using US' U US''


DynEnv |- expr1 => v1 using US'
DynEnv |- expr2 => v2 using US''
DynEnv |- v1 op v2 => v3 using {}
------------------------------------------------------------------- (EvalOp)
DynEnv |- expr1 op expr2 => v3  using US' U US''
	where op is in { =, !=, <, >, or, and}

		
DynEnv |- expr1 => Value1  using US'
Value1 matches xs:QName
DynEnv |- fs:item-sequence-to-node-sequence (expr2)  => Value2 using US''
Value2 matches (attribute*, (element | text | processing-instruction | comment)*) 
--------------------------------------------------------------------- (EvalEltConst)
DynEnv |- element { expr1 } { expr2 } =>  element Value1 {Value2}
		using US' U US'' U (toSet(Value2) intersect InputElt&AttrNodes)

		
declare func( x ) { expr2 }
DynEnv |- expr => value1 using US1
DynEnv + varValue( x => value1) |- expr2 => value2 using US2
------------------------------------------------------------------ (EvalUserDefFunc)
DynEnv |- func(expr) => value2 using US2 U US1		

	

One important point worth mentioning in the refined evaluation rules: for a step expression evaluating to a sequence containing text nodes, we consider that the parent of the return text nodes that are in the input documents influence the evaluation. This decision, which may seem arbitrary, is motivated by the fact that we are tracking dependencies on input attribute and element nodes, not on text nodes. Furthermore, we cannot simply ignore these text nodes because, as illustrated in the example "element name { /a//text()}", we will not account for some dependencies on some input nodes.

Nodes accessed (or used) during the evaluation of an expression e do not necessarily correspond to nodes on which the result of e depends because unnecessary evaluations of some sub-expressions may have been performed. For example, the result of the XQuery expression "if ( xs:boolean(true) or (/bookStore/book[1]/price < 100)) then 1 else 0" does not depend on any input nodes although its evaluation can access the price of the first book (another similar example: "let x:=//book return 1"). In this example, the evaluation of a sub-expression does not affect the value of the result regardless of the schema to which the input document conforms. If the input document is known to be valid against a given schema, this could entail restrictions that could make the evaluation of some sub-expressions irrelevant. For example, if the input document is known to be valid against the schema shown in Figure 1, the result of "if ( /bookStore/book[1]/price != 'a string')" then 1 else 0" is always 1 because "/bookStore/book[1]/price" must be of type xs:decimal therefore cannot be equal to 'a string'.

In the previous examples, we were able to discard some accessed nodes (i.e. not consider them as nodes that influence the result) because we could find an equivalent expression for which these nodes were not accessed during the evaluation. For these expressions, the equivalent expression was the constant "1".

In the remainder of this section, we formally define the equivalence relation between expressions and the notion of influenced nodes.

Definition. An environment constraint envConst is a complete function from the set of QNames to the set of XQuery Types4.

Definition. We define the partial order relation <= on the set of environment constraints as follows: envConst1 <= envConst2 iff, for any QName q, envConst1(q) <:= envConst2(q), where <:= denotes the partial order relation on XQuery Types defined in the XQuery Formal Semantics Specification [XQueryFS] .

Definition. A dynamic environment dynEnv is consistent with an environment constraint envConst iff, for any QName q in dynEnv, dynEnv.varValue(q) is valid against envConst(q). Stating that the input schemas are known really means that there is an environment constraint envConst such that all evaluations are done in dynamic environments consistent with envConst.

Definition. Two expressions exp1 and exp2 are equivalent under a given environment constraint envConst, denoted "envConst |- exp1 <=> exp2", iff for any dynamic environment dynEnv consistent with envConst and for any sequence value v, (there is a set X such that dynEnv |- exp1 => v using X) iff (there is a set Y such that dynEnv |- exp2 => v using Y). In other words, two expressions are equivalent under a given environment envConst iff, under the same dynamic environment consistent with envConst, their evaluation either raises an error or they both evaluate to the same value (note that, in general, this value will be different for different dynamic environments). The set equiv[envConst](exp) denotes the set of all expressions equivalent to an expression exp given the environment constraint envConst.

Definition. Let exp be a Core XQuery expression, let envConst be an environment constraint and dynEnv a dynamic environment consistent with envConst such that dynEnv |- exp => v using US. The set infl[envConst](dynEnv, exp) of input attribute and element nodes that influence the evaluation of exp under the dynamic environment dynEnv consistent with envConst is defined as follows:

infl[envCons](dynEnv, exp) = { n | for all exp' in equiv[envCons](exp), 
			(dynEnv |- exp' => v using US') implies n is in US'} 
	
In other words, the influenced input element and attribute nodes are the ones which must be accessed during the evaluation of any equivalent expression exp' of exp.

Abstract Domain

As previously mentioned, the result of our analysis is to assign to each sub-expression an abstract value that approximates it. Choosing the right abstract domain is crucial because it will determine in part the accuracy of the analysis, its performance and the usefulness of its result (i.e. the potential applications that can leverage the output of the analysis). Many factors were considered in the design of the abstract domain. They can be grouped into two categories:

  1. Analysis framework constraints. In order to fit nicely into the standard program analysis framework [ProgramAnalysis], the abstract domain has to be a complete lattice satisfying the ascending chain condition5. This constraint guarantees that iterative algorithms used to solve recursive constraint systems always terminate.
  2. Implementation constraints. The abstract domain was also shaped by some practical considerations aimed at an efficient implementation. For example, the ability to control, through some parameters, the trade-off between accuracy and performance of the analysis.

The abstract domain AbsDom[c] chosen satisfied the two constraints mentioned earlier. An element of AbsDom[c] consist of:

  1. An occurrence indicator: "0", "1", "?", "+" or "*".
  2. A base set consisting of:
    • constant atomic values (e.g. double values, decimal values, boolean values, string values, etc.). The number of atomic values in the set is less than or equal to c.
    • XML Schema built-in simple types (e.g. double type, decimal type, string type, etc.)
    • a finite number of element and attribute declarations6.

An abstract value represents a set of sequence items. The occurrence indicator of an abstract value av bounds the number of items in sequences represented by av. The base set of av restricts the items that may appear in a sequence represented by av to either some specific atomic values, to atomic values instances of some specific atomic types or to attribute and element nodes valid against some given attribute and element declarations. For example, the abstract value { xs:decimal(5), xs:string, element bookStore }+ represents all the sequence items containing one or more items, and each item in the sequence is either the decimal value 5, a string or an element valid against the type element bookStore.

Notation. For an abstract value av: occurrenceIndicator(av) denotes its occurrence indicator, base(av) its base set, baseConstant(av) the greatest subset of base(av) containing only constant atomic values, baseAtomicType(av) the greatest subset of base(av) consisting of only atomic simple types, and baseNode(av) the greatest subset of base(av) consisting of element and attribute declarations.

Given an expression e and its input and output schemas, element and attribute declarations used in the abstract domain AbsDom[c] of the analysis of e are limited to those declared in the schemas plus element and attribute declarations for temporary constructed nodes. These temporary nodes are constructed during the evaluation of e but are not added to the result, and therefore there may not be declarations in the output schemas that constrain them. However, we need to represent them in the abstract domain since some expressions may use them (e.g. "let x := element myElt { "a string value" } return $x//text()"). The static type inference algorithm described in the XQuery Formal Semantics Specification [XQueryFS] provides the type of temporary constructed nodes used in the abstract domain.

The partial order relation, <:=, defined in the XQuery Formal Semantics Specification [XQueryFS] can be straightforwardly extended to a partial order relation on AbsDom[c]. Let Top = S* be the abstract value whose base set consists of all atomic simple types and all element and attribute declarations allowed in AbsDom[c] ( in the previous paragraph we have shown that it is a finite set) and whose occurrence indicator is "*", Top is the greatest element of AbsDom[c] with regard to the partial order. We include in AbsDom[c] the element "None" and extend the order relation so that "None" is the smallest element. AbsDom[c] with the extended order relation is a complete lattice, and, since it is finite, it satisfies the ascending chain condition.

Context-insensitive analysis

Having formally defined the abstract domain of the static analysis, we now specify the constraint system whose resolution yields, for each sub-expression e, (1) an abstract value approximating the result of any evaluation and (2) an abstract value representing all the input nodes that may influence the evaluation of e.

For simplicity of the presentation, we assume that all declared variables (including function parameters) have distinct names. We also assume that each expression is uniquely identified by a label, an integer corresponding to its location. The result of the analysis can be expressed as the pair of functions (D, AE) such that

where Var is the set of variables in the query, Label is the set of labels. Dr maps a label (i.e. an expression e) to an abstract value approximating the result of any evaluation e. Ds maps a label (i.e. an expression e) to an abstract value approximating the input elements and attributes that may influence the evaluation of e. AE corresponds to the abstract environment. It keeps track of abstraction bound to variables by the static analysis. AEr maps a variable (including variables corresponding to parameters of user defined functions) to an abstract value approximating all its possible values at runtime. AEs maps a variable v (including variables corresponding to parameters of user defined functions) to an abstract value approximating input elements and attributes that may influence the evaluation of expressions whose values may be assigned to v.

Acceptability relation and constraints

We use the same notation as in [ProgramAnalysis] to indicate that a tuple (D,AE) is acceptable for an expression expr: "(D,AE) |= expr". The acceptability relation tells us whether a given pair (D, AE) is indeed an acceptable solution to the analysis problem.

For our first analysis, we do not take into account the context in which functions are called. Figure 10 shows the set of constraints that must hold for (D, AE) to be acceptable for different kinds of expressions for the context-insensitive analysis.

Figure 10: Context insensitive acceptability relation
[Link to open this graphic in a separate page]

Literal. (literal) states that a solution (D, AE) is acceptable for a literal expression Literal with label l iff the abstract value Dr(l) approximating its result contains the singleton consisting of the typed value of Literal (tv function maps a literal lit to its corresponding typed value. For example tv("a string literal") = xs:string("a string literal"). In other words, the sequence (tv(Literal)) must be in the set of concrete values represented by Dr(l). On the other hand, since a literal expression does not depend on any input nodes, the set of concrete values corresponding to Ds(l) must at least contain the empty sequence.

Empty Sequence. (EmptySeq) simply states the obvious: the result and used abstract values associated with the empty sequence expression must at least include (or be an approximation of) the abstract value for the empty sequence.

Comma operator. (Comma) states that the result (Dr) and used (Ds) abstract values of a sequence expression is obtained by composing the result and used values of its sub-expressions. We also require (D, AE) to be acceptable for the sub-expressions.

Variable. For a variable reference var, the result and used abstract values must at least be an approximation of the result and used abstract values of var stored in the abstract environment AE.

Let expression. First (D, AE) has to be acceptable for the sub-expressions of the let expression considered. The abstract values associated with variable v in the abstract environment AE must at least be approximations of the abstract values associated with exp1. It is important to note that the result and used values of exp1 can influence the result and used abstractions of the let expression iff the variable v is referenced in exp2.

For expression. The treatment of for expressions is almost similar to the treatment of let expressions. The differences account for the fact that, at runtime, if exp1 evaluates to an empty sequence then the for expression yields an empty sequence; otherwise, the variable v is always bound to a sequence of exactly one item taken from those in the sequence obtained from evaluating exp1. As a consequence of these runtime differences, the static analysis of for expressions is slightly different from let expressions in two aspects:

  • The fourth constraint of (for) states that if the result abstract value associated with exp1 is an approximation of the empty sequence, then the result abstract value of the whole for expression must also be an approximation of the empty sequence.
  • The third constraint of (for) uses the map function to ensure that the result abstract value to which the variable v is bound has an occurrence indicator of "1" unless exp1 is never reached or is guaranteed to always evaluate to the empty sequence. Formally, for an abstract value av:
    • if av = None then map(av) = None
    • otherwise, av = base(av) occurrenceIndicator(av)
      • if occurrenceIndicator(av) is one of '1', '?', '+','*' then map(av) = base(av)1
      • otherwise (i.e. av = {}0), map(av) = None

Step expression. For step, the result and used values of the context node $fs:dot is first retrieved from the abstract environment AE. The abstract evaluation of a step is similar to the static typing of step expressions defined in the XQuery Formal Semantics Specification [XQueryFS]. For example, given the example input schema of Figure 1 and an abstract context item set to "{element bookStore}1", 'book/title' evaluates to "{element title}+", and './/author/*' yields "{ element title, element fname, element lname }+". The result abstract value of a step expression must be an approximation of the result of performing the abstract evaluation of the step. The last constraint makes sure that parents of text nodes in the result of the evaluation of the step are represented by the used abstract value. src function selects only input element and attribute declarations.

IfThenElse expression exp. The analysis result depends on the analysis result of the test sub-expression exp1. If the "then" branch can be taken (i.e. if the result abstract value of exp0 contains the abstraction of xs:boolean('true')) then (D, AE) must also be acceptable for exp1 and the result abstract value (resp. the used abstract value) of exp must at least contain the result abstract value (resp. the used abstract value) of exp1. Similar constraints must hold if the "else" branch can be taken. Finally, the used abstract value of the test expression exp0 is taken into account only if both branches can be reached; otherwise, the conditional expression can safely be rewritten into either exp1 or exp2, and thus does not depend on the evaluation of exp0.

Element constructor The most interesting aspect of constructor is the use of the eltConst function in the computation of the result abstract value. The abstract value of exp0 represents a sequence of QNames. When applied to eltConst the abstraction of sequences of QName yields an abstraction of sequences of temporary and output element nodes. Formally, for an abstract value av:

  • if the type xs:QName is an element of base(av), eltConst(av) = the set of all temporary element declarations and element declarations in output schemas
  • otherwise, av = {q | q is a QName} occurrenceIndicator(av), eltConst(av) = { element q | q is in base(av) }
Note that this analysis is too conservative for output schemas in which element declarations do not have distinct QNames. The analysis will not be able to distinguish them. We will see in section “A better analysis of element constructors” how the analysis can be made more precise.

Comparison expression. The analysis basically composes analysis results of the sub-expressions (src function selects only element and attribute declarations from input schemas).

Function call expression exp. There are two points worth mentioning:

  1. The analysis result of an argument of the function call may be taken into account in the computation of the abstract value of the function call only if the parameter to which it is bound is actually referenced in the body of the function.
  2. The analysis is context insensitive: a function parameter is bound in the abstract environment AE to the aggregation of the analysis results of the arguments of all the function invocations to the same function. As a result, the abstract values associated with all function calls to the same function are the same regardless of the context of the call.

The analysis is performed assuming that the input documents conform to statically known schemas. As explained in section “Definition of input nodes influencing the evaluation of an expression”, this basically means that the analysis is done under an environment constraint envConst mapping QNames of variables to their XQuery type. These XQuery types need to be approximated by elements of the abstract domain. The XQuery Formal Semantics Specification [XQueryFS] defines prime and quantifier functions on types such that for a type t, "(prime(t)) quantifier(t)" is a subtype (or an approximation of) t. The type function prime(t) extracts all item types from the type t, and combines them into a choice. For example, for t = (xsd:decimal, xsd:float, xsd:integer), prime(t) = xsd:decimal | xsd:float| xsd:integer and quantifier(t)='+'. An XQuery type t is approximated in our abstract domain by adom(t) = items(prime(t)) quantifier(t), where items function returns the item types of which an union type is made. For example adom(t) = {xsd:decimal, xsd:float, xsd:integer}+ .

In order to take into account the environment constraint envConst, in addition to constraints defined in Figure 10, the constraint presented in Figure 11 must also be satisfied.

Figure 11: Additional constraints taking into account the environment constraint
[Link to open this graphic in a separate page]

Generating the Constraint System

For an expression e, the constraints imposed by the acceptability relation |= (as shown on Figure 10) on e and recursively on all its sub-expressions as well as constraints shown on Figure 11 define a system S of constraints. Each constraint has the form "f(D,AE) X " where f is a monotone function of (D, AE) and X is one of Dr(l), Ds(l), AEr(l) or AEs(l) with l a label. By the construction of S, it follows that an analysis (D, AE) is acceptable and satisfies the constraints shown on Figure 11 iff it is solution of S.

For illustration purpose, we consider a simple labeled conditional expression shown on Figure 12.

Figure 12: Example of a simple labeled expression
[Link to open this graphic in a separate page]

For the example on Figure 12 , an analysis (D, AE) is acceptable (figures 10 and 11 - assuming that the free variable $b is of type xs:boolean) iff it satisfies the conjunction of constraints shown on Figure 13, which can be expressed as the system S of constraints appearing on Figure 14.

Figure 13: Constraints on (D,AE) for the previous simple example.
[Link to open this graphic in a separate page]

On Figure 14, "=>" does not represent the logical implication as it does everywhere else in the paper. On Figure 14, "=>" denotes a binary operator whose first operand is a boolean value and whose second operand and result are abstract values. If its first operand is true then its result value is the same as its second operand; otherwise (i.e first element is false) its result value is None (the least element of the abstract domain). Formally, for an abstract value av, ((true => av) = av ) and ((false => av) = None).

Figure 14: Refined system S of constraints generated for the previous simple example.
[Link to open this graphic in a separate page]

Finding the best solution satisfying the acceptability relation

As illustrated in the previous section, from the acceptabity relation, a system S of monotonic constraints can be generated. The function "(D, AE) : l -> (Top, Top, Top, Top)" which maps each label to the greatest element of the abstract domain is obviously a solution of S. To get the most accurate analysis, the least solution of S is of interest. For monotonic constraint systems on a complete lattice satisfying the ascending chain condition such as S, the least solution exists and some efficient algorithms to compute it are described by F. Nielson et al. [ProgramAnalysis].

Correctness

Theorem: The analysis is sound. The following holds:

Figure 15: Correctness
[Link to open this graphic in a separate page]

In other words, any acceptable solution (D, AE) under an environment constraint envConst is such that for any evaluation in a dynamic environment dynEnv consistent with envConst: (1) there is at least one sequence item represented by either the result or use value whose items are exactly the influenced input nodes for the considered evaluation, and (2) the result of the evaluation belongs to the set of sequence items represented by the result abstract value.

Proof. First, for an acceptable solution (D, AE), we prove by structural induction on the expression exp that there is an equivalent expression exp' under envConst (i.e envConst |- exp <=> exp'. exp' may be the same as exp) such that for any dynamic environment dynEnv consistent with envConst, "dynEnv |- exp' => v using US" implies that (1) elements of US are all contained in at least one sequence item represented by either the result or the used abstract value associated with exp', and (2) v is a sequence item in the set of sequence items represented by the result abstract value. Once this result is established the proof of the theorem follows from the definition of infl function.

Limitation of the context insensitive approach

As mentioned in the experimental section “Experimental evaluation”, the context insensitive approach is less expensive to compute. However, it has two important shortcomings:

  1. The analysis of element constructors does not differentiate between output element and attribute declarations having the same QName. This issue is addressed in section “A better analysis of element constructors”
  2. The analysis of function calls to the same function yields the same result. This may not be a major issue for queries that do not contain many invocations of the same function. However, queries generated from XSLT stylesheets typically invoke the recursive t2q:applyTemplates function so many times that a context insensitive analysis is too imprecise.

Context sensitive analysis

To solve the limitation of the context insensitive approach with respect to the analysis of function calls, we associate to each sub-expression e different result and used abstract values depending on the context that may reach e. A context c records the last n possible invocation sites of functions that have not returned (n is less or equal to a maximum context size k, which is a configurable parameter of the analysis). Now, the result of the analysis is expressed as the pair of functions (D, AE) such that :

We use the notation " (D, AE) |=c exp" to denote that the pair (D, AE) is acceptable for the expression exp with context c.

Constraint (funCall) in Figure 10 is then changed as shown in Figure 17 to produce a context sensitive analysis.

Figure 17: Context Sensitive analysis of function calls
[Link to open this graphic in a separate page]

Unfortunately, the worst case complexity of the context sensitive approach is exponential in the size of the maximum context size. We will present in section “ XSLT specific optimization techniques” various techniques to avoid this worst case behavior for the analysis of XSLT stylesheets.

A better analysis of element constructors

The analysis presented in section “Context-insensitive analysis” is inaccurate in the treatment of element constructors because it does not keep track of the possible parents of constructed elements. To address this limitation, we add a component Dp to D to represent the possible parents of an element constructed by an element constructor at a given label.

The acceptability relation defined in Figure 10 is modified as shown in Figure 18 to produce a "parent" sensitive analysis.

Figure 18: Improved element constructor analysis
[Link to open this graphic in a separate page]

Constraints not modified are not shown in Figure 18. "..." means that nothing else has been changed.

We assume that the static type inference defined in the XQuery Formal Semantics Specification [XQueryFS] has been performed, and for each sub-expression e whose label is l, type(l) yields the static XQuery type. The function tempdoc maps a label l to the following element "document { adom(type(l)) }" of the abstract domain 7. It is used when elements constructed by the sub-expression corresponding to l are not directly added to the result.

All constraints except (eltConst) simply propagate parent information top-down (from an expression to its sub-expressions). Constraint (eltConst) indicates that the result abstract value of an element constructor e is computed by taking into account the abstract value representing its possible parent (Dp(l)) in addition to the abstract value representing its possible QNames. Thus, we can now differentiate between element declarations having the same QName in the output schema provided that they are declared in the content models of distinct element declarations.

XSLT specific optimization techniques

A direct application of the analyses presented in section “ Constraint-based static analysis of Core XQuery ” to Core XQueries generated from XSLT stylesheets is expensive and yields very inaccurate results. Two peculiarities of XSLT stylesheets account for these failures of the basic Core XQuery analyses:

  1. First, most stylesheets performing transformations between two vocabularies often use a "catchall" template. A "catchall" template performs some default processing of elements and attributes not otherwise processed. In the example of section “A simple example ”, the last template of the stylesheet displayed on Figure 3 is a "catchall" template. Since, for a given invocation context of the t2q:applyTemplates function, all element abstract values representing the current context node match the pattern "*", even a context sensitive analysis with a large maximum context size will assign the same used value to all the matching element abstract values. For our example, the result would indicate that book and company elements in the output document may depend on both book and company elements from the input document:
    <result>
       <schemas>
          <source rootName="bookStore" 
           schemaLocation="input.xsd"/>
          <target rootName="bib" 
           schemaLocation="output.xsd"/>
       </schemas>
       <dependencies>
    	<!-- ... -->
    
          <dependency>
             <source>/bookStore/book</source> 
    	   <source>/bookStore/company</source>
             <target>/bib/book</target>
          </dependency>
    	<dependency>
             <source>/bookStore/book</source>
    	   <source>/bookStore/company</source>
             <target>/bib/company</target>
          </dependency>
    	<!-- ... -->
           </dependencies>
    </result>
    
  2. The second specificity of Core XQueries generated from XSLT stylesheets is the invocation of t2q:applyTemplates functions in too many different locations. A context insensitive analysis almost always just states that any element or attribute in the output document may depend on the whole input document. Therefore, a context-sensitive analysis is necessary. However, it is extremely expensive because the call graph is such that almost all edges point to the t2q:applyTemplates function, which indirectly points back to all its incoming nodes.

Template specialization

In order to make the analysis more precise in the presence of "catchall" templates, before compiling the stylesheet to XQuery, a "catchall" template t whose 'match' attribute is '*' is replaced by specialized templates with the same body as t. The 'match' attribute values of the new specialized templates correspond to QNames of element declarations in the input schemas. In our example, the "catchall" template is replaced by two specialized templates with the same body but whose 'match' attribute values are 'book' and 'company'.

Special care is needed for input schemas containing attribute or element wildcards because, in that case, "catchall" templates must not be removed. In fact, at runtime, they may be instantiated for elements and attributes constrained by named attribute and element declarations unknown statically. However, the analysis should not consider that statically known named element or attribute declarations could match the match pattern of the cathall templates ( i.e. "*" or "@*"). Instead, it should consider that only element and attribute wildcards can match patterns specified by "catchall" templates, whereas only statically known named element and attribute declarations can match patterns of generated specialized templates. We use special annotations to allow the analyzer to recognize generated specialized templates and "catchall" templates.

Context-based specialization of t2q:applyTemplates body

As shown in the experimental evaluation section “Experimental evaluation”, the worst case complexity of the basic context-sensitive evaluation of Core XQueries generated from XSLTs is realized in practice: the number of constraints generated is exponential in the size of the maximum context size. The worst case complexity is essentially due to the fact that, for any xsl:apply-templates invocation, all the templates are considered. However, for an xsl:apply-templates appearing in an unnamed template, the list of templates that may be instantiated at runtime can almost always be significantly statically reduced (this reduction of the number of templates to consider is especially important because, in the previous section “Template specialization”, we have specialized all the templates). For the example presented in section “A simple example ” on Figure 3, it is obvious that the xsl:apply-templates appearing in the first template can only instantiate the template whose 'match' attribute is 'bookStore', and the xsl:apply-templates appearing in this 'bookStore' template can only instantiate the specialized template whose match attribute is "book".

The reduction of the list of templates to consider is achieved by a simple very conservative context insensitive analysis of both the match pattern specified by the template t where an xsl:apply-templates instruction ats appears and the select XPath attribute of ats. This analysis can be complex if the select XPath or the match pattern contains variables because a flow analysis of the whole stylesheet may be required to find an approximation of all the possible values of the variables. We avoid that complexity by simply approximating the abstract value of variables by the greatest element of the abstract domain.

In the generated XQuery, we annotate each t2q:applyTemplates function call corresponding to the translation of an xsl:apply-templates by the subset of templates that can actually be instantiated at runtime. Thus, when the constraint builder analyzes these t2q:applyTemplates function calls, it will not generate constraints using the default body of t2q:applyTemplates where all templates are considered, but constraints will be generated from a rewritten version of its body where only the subset of templates in the XSLT annotation are considered. As shown in the experimental evaluation section, this specialization of the body of t2q:applyTemplates based on its invocation context can significantly reduce the number of generated constraints, and make the analysis scale.

Experimental evaluation

All the analyses presented earlier have been implemented in Java according to the architecture described in section “Architecture”. The purpose of this section is to report on the first experimental evaluation of our approach and implementation.

We have performed two types of experiments: on correctness and accuracy and on performance and scalability. They were done with three stylesheets:

  • two stylesheets from a commercial product for information integration used to mediate requests to NCBI Entrez PubMed8 [PubMed] and NCBI Entrez Nucleotides9 [Nucleotide].
  • one stylesheet generated by Clio [Clio] to map the DBLP [DBLP] XML schema to a schema containing information about all authors in the DBLP database. This non-recursive stylesheet contains a single template for the root. It is written in the pull mode: it constructs elements and attributes of the output schema, and explicitly pulls information it requires from the source; no xsl:apply-templates instructions are used.

Although the theoretical soundness of our analyses has been established in section “ Correctness”, the first kind of experiments was necessary to test the correctness of our implementation. These experiments also allow a better understanding of the limitations and strengths of all the analyses described in the paper:

  • The context insensitive analysis yields very accurate result for the non-recursive DBLP stylesheet, but very inaccurate results for the two recursive life sciences stylesheets (it basically states that each element or attribute of the output depends on the whole input document)
  • The context sensitive analysis without parent information applied to the PubMed transformation results in false positives10 in the influenced input nodes set for output element declarations with the same names (e.g. CopyrightInformation, PMID)
  • The context sensitive analysis with parent information does not have false positives10 for a large enough context size: 4 for PubMed transformation and 3 for Nucleotide transformation.

The second kind of experiments evaluates the performance and scalability of the context sensitive approach. The number of constraints to solve by the constraint solver represents the major factor influencing performance and scalability. As long as all the generated constraints can fit in memory a generic iterative constraint solver sorting constraints in their topological order [ProgramAnalysis] can efficiently solve the system. As mentioned in section “Context sensitive analysis”, the worst-case exponential-behavior of context sensitive analyses represents their most important drawback and scalability limitation. In order to show that this worst case is not realized in practice when the XSLT specific techniques of section “ XSLT specific optimization techniques” are applied, we have performed two context sensitive analyses with parent information of the three stylesheets for different values of the maximum context size: the first was only based on Core XQuery analysis described in section “ Constraint-based static analysis of Core XQuery ”; the second applied XSLT specific techniques presented in section “ XSLT specific optimization techniques”. Figure 19 shows how, in these two experiments, the number of constraints generated varies depending on the maximum context size. Curves obtained from the first experiment have labels of the form "CoreX <Name>" whereas those obtained from the second experiment (with XSLT specific optimizations) have labels of the form "XSLT <Name>". Figure 19 shows that XSLT specific optimizations make the analysis of XSLT (1) significantly less expensive that the generic analysis of Core XQuery and (2) more scaleable as the maximum context size increases. The result of the first experiment is not shown for PubMed because, even for a maximum context size of 2, the number of generated constraints cannot fit in a 1 GB main memory.

Figure 19: Number of generated constraints for various context sizes
[Link to open this graphic in a separate page]

Conclusion

In this paper, we have presented a common architecture for the static analysis of both XQuery and XSLT particularly tailored to handle recursive XQueries and stylesheets. It has been applied to solve the input/output dependency extraction problem. Our initial experiments have shown that, by taking into account some peculiarities of XSLT stylesheets, our context sensitive analysis can be both precise and scaleable.

However, the same experiments have also demonstrated that the worst case exponential behavior of the context sensitive analysis is realized when XSLT specificities are ignored. Therefore, its performance and scalability may be limited if it is applied to XQueries containing many function calls to recursive functions. Addressing this possible scalability issue for XQuery constitutes an interesting challenge which will require more experiments with XQueries and work on XQuery specific techniques to reduce the analysis cost (e.g. the possible use of the result of XQuery type inference to perform the kind of specializations applied to XSLT).

Notes

1.

If no output schemas are specified, the static type inference defined in the XQuery Formal Semantics Specification [XQueryFS] can be used to infer them.

2.

The more precise the abstract values are the more expensive the analysis will be.

3.

Used abstract values consist only of element and attribute declarations in input schemas.

4.

An environment constraint maps a QName to a XQuery type conforming to the type system defined in the XQuery Formal Semantics Specification [XQueryFS] .

5.

The ascending chain condition is satisfied iff any ascending sequence of elements of the lattice eventually stabilizes. An ascending sequence (A1,..,An, ...) is a sequence such that for all i,j, i<=j implies Ai <= Aj. The ascending chain condition says that, for any such sequence, there is an integer k such for all i >= k, Ai = Ak.

6.

In our implementation, the base set also contain special abstractions to represent comment and processing instruction nodes, but for simplicity we do not consider them in the remainder of the paper. Simple atomic types are used to represent text nodes.

7.

Note that we had to slightly modify the abstract domain so that the base set of an abstract value can now contain element of the form document{ adom(type(l)}. The abstract domain remains finite and is still a complete lattice.

8.

PubMed is a service of the National Library of Medicine. It includes over 15 million citations for biomedical articles back to the 1950's.

9.

The Entrez Nucleotides database is a collection of genomic sequences from several sources, including GenBank, RefSeq, and PDB.

10.

Schema declarations that can safely be removed without making the analysis unsound.


Bibliography

[Clio] The clio project: Managing heterogeneity. ACM SIGMOD Record volume 30 pages 78--83, 2001. citeseer.ist.psu.edu/miller01clio.html

[ClioMapping] L. Popa, M. Hernandez, Y. Velegrakis, R. Miller, F. Naumann, and H. Ho. Mapping xml and relational schemas with clio, 2002.

[Compilers] Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. Compilers: Principles, Techniques and Tools.Addison Wesley, 1986.

[DataFlow] C. Chambers, J. Dean, and D. Grove. Frameworks for intra- and interprocedural dataflow analysis.Technical report, 1996.

[DataFlowFrameworks] C. Chambers, J. Dean, and D. Grove. Frameworks for intra- and interprocedural dataflow analysis. Technical report, 1996.

[DBLP] Digital Bibliography & Library Project: http://dblp.uni-trier.de/

[Nucleotide] NCBI Entrez Nucleotides database: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Nucleotide&itool=toolbar

[ProgramAnalysis] Flemming Nielson, Hanne Riis Nielson, and Chris Hankin. Principles of Program Analysis.Springer-Verlag, 1999.

[PubMed] NCBI Entrez PubMed: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed

[XML] T. Bray, J. Paoli, C. M. Sperberg-McQueen, E. Maler, and F. Yergeau. Extensible markup language (XML) 1.0 (third edition). W3c recommendation,WorldWideWeb Consortium, February 2004.

[XMLAccessControl] Makoto Murata and Akihiko Tozawa and Michiharu Kudo and Satoshi Hada. XML access control using static analysis. In CCS 03: Proceedings of the 10th ACM conference on Computer and communications security, pages 73-84, New York, NY, USA, 2003. ACM Press.

[XMLProjection] Amelie Marian and Jerome Simeon. Projecting xml documents. In Proceedings of International Conference on Very Large Databases (VLDB), Berlin, Germany, September 2003.

[XQuery1] Scott Boag, Don Chamberlin, Mary F. Fernandez, Daniela Florescu, Jonathan Robie, and Jerome Simeon. XQuery 1.0: An XML query language. W3c working draft, World Wide Web Consortium, July 2004. http://www.w3.org/TR/2004/WD-xquery-20040723.

[XQueryFS] XQuery 1.0 formal semantics. W3C Working Draft, March 2002.

[XQueryWS] Nicola Onose and Jerome Simeon. XQuery at your Web Service. In WWW 04: Proceedings of the 13th International Conference on World Wide Web, pages 603-611, New York, NY, USA, 2004, ACM Press.

[XSLT2] Michael Kay. XSL transformations (XSLT) version 2.0. W3c working draft, World Wide Web Consortium, November 2003. http://www.w3.org/TR/2003/WD-xslt20-20031112.

[XSLT2XQuery] Achille Fokoue, Kristoffer Rose, Jerome Simeon, and Lionel Villard. Compiling XSLT into XQuery. To appear in Proceedings of World Wide Web Conference 2005. Currently available at www.research.ibm.com/XML/CompilingXSLT2toXQuery-Fokoue-WWW2005.pdf.



Extracting Input/Output dependencies from XSLT 2.0 and XQuery 1.0

Achille Fokoue [IBM T. J. Watson Research Center]
achille@us.ibm.com