Generalizing XPath for directed graphs

Steve Cassidy
Steve.Cassidy@mq.edu.au

Abstract

XPath is a very natural and powerful way to specify locations in XML documents. This paper examines possible generalizations of XPath to allow both locations and paths through generalized labeled directed graphs to be specified. The need for such a path language is driven by work in querying Linguistic Annotations which are in general more complex in structure than XML documents. The result of this exercise is a powerful path language which reduces to XPath as a special case and which could potentially be useful in a range of query applications.

Keywords: XPath; XQuery; Trees/Graphs; Querying

Steve Cassidy

Steve Cassidy is a Computer Scientist who has worked in various areas relating to language and cognition over the last 15 years. He completed a Ph.D. in Wellington, New Zealand on computer models of reading development and then moved to Macquarie University, Sydney to work at the SHLRC [Speech Hearing and Language Research Centre]. At SHLRC he worked on applying statistical models to acoustic phonetics problems and on the development of the Emu Speech Database System. His work on Emu has led to an involvement with groups in the US and Europe who are aiming to define standards for Linguistic annotation. Steve is now working at the Centre for Language Technology, part of the Computing Department at Macquarie where he is persuing research in annotation and speech and language technology.

Generalizing XPath for directed graphs

Steve Cassidy [Centre for Language Technology]

Extreme Markup Languages 2003® (Montréal, Québec)

Copyright © 2003 Steve Cassidy. Reproduced with permission.

Background

This paper describes a generalization of the XPath query language motivated by the design of a query language for Annotated language corpora.

Annotated corpora

Research in Linguistics and Language Technology relies heavily on collections of language data called corpora [Church & Mercer 93]. These can consist of text documents (written texts or transcriptions of speech), audio, video or other time series data (for example, physiological measurements of speech articulation). A common requirement in most fields of research is to apply some kind of annotation to the corpus data. This annotation is often derived manually and encodes additional information about the linguistic data ranging from simple part of speech tags for words to complex rhetorical structures in dialogue.

A particular feature of linguistic annotations is that they often involve more than one hierarchy over the same set of tokens. For example, Figure 1 shows part of a sentence which has been annotated syntactically and intonationally resulting in two hierarchies dominating a single set of word nodes.

Figure 1
[Link to open this graphic in a separate page]

An example Emu annotation showing an utterance from the TIMIT database [TIMIT] which has been augmented with both a syntactic and intonational annotation. The names on the left show the types of each set of nodes. The Word nodes have been duplicated to show the links to both the syntactic and intonational hierarchies. For clarity, sequence relationships are implicit in the left to right ordering of nodes in this diagram.

Recent work has attempted to develop a uniform data model for manipulation and interchange of annotated corpus data. In one model, annotations are viewed as directed graphs [Cassidy 01] where nodes represents tokens in the corpus. Nodes may be labeled with various attributes including a type (phoneme, word, paragraph), start and end times, and one or more labels. The edges in this graph can have one of three types: a parent/child relationship defining one or more hierarchies, an explicit sequence relationship, and a general associative relationship. This data model is capable of representing annotations such as the one in Figure 1, as well as many other styles of annotation. The optionality of timing data means that the model is also a good fit to pure textual data.

An alternate data model, called Annotation Graphs, has been proposed [Bird & Liberman 01] which again represents annotations as a directed graph but foregrounds the sequential structure of annotations leaving hierarchical and associative relations implicit. In an annotation graph (Figure 2), nodes are anchors which may have associated timestamps and edges correspond to tokens in the corpus and have one or more labels. Hierarchical relations in annotation graphs are implicitly encoded by edges which span multiple smaller edges. Associative relations can be encoded by common labels in the same way as ID/IDREF is used in XML. While Annotation graphs might seem very different from the Emu hierarchical model illustrated above, we believe that they are isomorphic [Cassidy 01], and in fact the next version of the Emu toolkit will use annotation graphs internally.

Figure 2
[Link to open this graphic in a separate page]

A simple Annotation graph where nodes represent anchors (in this case, with associated times), and edges represent tokens in the corpus (e.g., words and phrases). The hierarchical relationship between the NP edge and the word edges is implicit in this model.

Toolsets in various specialist domains exist to create and make use of annotated corpora. Examples are the Emu system http://emu.sourceforge.net for acoustic phonetic analysis, GATE http://gate.ac.uk, and MATE [McKelvie et al. 01] for text based data. As part of some of these tools, a query language is provided to find parts of the corpus for study or analysis. For example, the Emu system allows queries which retrieve speech segments according to various criteria allowing subsequent numerical analysis of the speech data. Following the development of the more general data models described above, the need for a more general query language becomes apparent; this paper describes some work aimed at developing such a language.

Relation to XML

Clearly the annotation data model shows a close resemblance to XML, and indeed if only one hierarchy needs to be encoded in an annotation, it can be naturally serialized as XML. Many existing Linguistic text corpora are encoded in XML, and there have been large scale efforts to develop standardized DTDs for corpus encoding [TEI]. The problem of intersecting hierarchies has been addressed using, for example, standoff markup [Thompson & McKelvie 97] where each hierarchy is stored in a separate XML document, each referring to regions in the original text using XPointer references.

XML has been less frequently used for annotation of speech data. It is used as an interchange format in some tools, such as the Annotation Graph toolkit; however, in these cases the structure of the XML document does not reflect the hierarchical structure of the annotations.

Given this close relationship, an obvious question is whether any of the various XML query languages are appropriate for the task of querying annotations.

Annotation query use cases

In an earlier paper [Cassidy 02], we presented a number of annotation query use cases and used them to evaluate XQuery as an annotation query language. This showed that while a number of the queries could be evaluated within XQuery, some interesting cases were very difficult to express. Here we present two examples which illustrate differences between this application and the XML query use-cases.

Find L- Intermediate phrases
  containing words which form an NP Syntax phrase.
An example answer to this query is the L- Intermediate level node containing the words “your dark suit” in Figure 1. This query highlights the need to be able to cope with multiple hierarchies since it requires looking at both Intermediate and Syntax parents of the Word nodes. While XML won’t allow multiple parents, there’s nothing in XQuery (or in particular, XPath) which precludes a traversal from parent to child to a different parent. If we allow this, the query can be written in XQuery as shown in Figure 3 (of course, the result here would be an XML fragment which might not be what is required; we’ll ignore this factor for the moment).

Figure 3
FOR $np in //syntax[@label="NP"]
LET $iparent ::= $np//word[1]/parent::inter
WHERE
    EVERY $int IN 
      $np//word/parent::inter
    SATISFIES ($int == $iparent) AND 
              ($int/@label = "L-")
RETURN
    $iparent

XQuery code to find L- phrases dominating words which form an NP.

Find sequences of any number of consonant phonemes
  followed by a single vowel
  followed by any number of consonants: C+VC+.
This is a significant use case since it illustrates an important class of query for language corpora — finding non-annotated patterns based on ‘grammatical’ structure. At the same time, this kind of query is very difficult to express in XQuery (although not impossible, see [Cassidy 02] for a solution using functions).

The evaluation of XQuery with respect to these and other use-cases provides two interesting insights. Firstly, the document or data centric views provided for by XQuery (and other XML query languages) are almost, but not quite, appropriate for annotation query. Secondly, XPath is a very clear and natural way to specify locations of elements within annotations, but would need to be extended to achieve completeness.

Outline

This paper reports on some work which explores the idea of extending XPath in directions which make it more useful for annotation query. In particular, it describes an attempt to generalize XPath to address queries like the second one presented above (C+VC+). The result of this exercise is a generalized path-based query language which applies not only to annotations, but also to generalized directed graphs, and which reduces to XPath as a special case.

For the purposes of this paper, I’ll talk about a particular kind of directed graph which is a generalization of the graphs described for annotations. These graphs can be described as edge labeled directed graphs. An example is the XML graph structure which contains edges labeled “child”, “attribute”, and a few others. The nodes in these graphs have two labels: a type (cf. the XML element name) and a textual label. This is an interesting class of graphs which subsumes, for example, XML, RDF, and our annotation data.

The remainder of this paper reviews the literature on path-based query languages and then describes the generalization of XPath and evaluates this in the annotation domain and in more general graph-based data.

Query language review

Annotation query

For the most part, the query languages proposed and developed for annotation data have been of ad-hoc design and are able to cope only with a subset of the kinds of annotation data seen in practice. For example, the Emu query language [Cassidy 01] provides a means of finding nodes based on sequential and hierarchical relations but lacks many basic query language features, such as disjunction and compositionality. Most other annotation query tools have made use of general purpose structural pattern matching tools, such as Sgrep http://www.cs.helsinki.fi/u/jjaakkol/sgrep.html, which allow for regular expression like searches in structured textual data.

One important proposal that has been made recently follows from the work on annotation structures by Bird and his colleagues [Bird & Liberman 01]. [Bird et al. 00] propose a query language based on path expressions in an annotation graphs. Queries are made up of a set of path expressions which describe paths through the annotation graph. For example, in the query term X.[:word, label:cat].Y, X and Y are query variables and stand for nodes in the annotation graph, and the terms in brackets specify the type (:word) and attribute values associated with the edge between these nodes.

While this query language proposal has a number of interesting features, it is awkward to express queries in. Since it focuses on traversing only the sequential relations in an annotation graph, expressing hierarchical queries is very unnatural. Evaluation against a number of annotation query use cases has shown [Cassidy et al. 00] that while it is capable of expressing all of the use cases, the constructs needed to do so are highly complex and would be difficult for naive users to work with. Part of the effort in this paper is to work with a very natural query language component (XPath) and see if it can be extended for annotation query wile retaining the naturalness of expression.

Path-based query languages

There is a considerable literature on path-based query languages both as alternatives to relational query languages and in more specialized domains. In particular, path expressions have often been proposed in the context of querying semi-structured data. Abiteboul [Abiteboul 97] describes path expressions as sequences of object attributes (e.g., guide.restaurant.address.zipcode) and extended path expressions as those which allow for wildcards and path variables (e.g., myreport.#.title to match any title attribute of an object reachable from myreport). These are implemented in, for example, OQL-doc [Christophides et al. 94], which extends an object-oriented database query language with path expression.

The G+ path language [Mendelzon & Wood] and its successor Graphlog [Consens & Mendelzon] provides a means of specifying regular path expressions to match against paths in edge labeled graphs. Queries in Graphlog are expressed graphically as a graph where the edges can be labeled with path regular expressions which match sequences of edges in the graph being queried. Graphlog queries can include closures over edges (e.g., follow zero or more child edges) and negation (e.g., there’s no child edge between a pair of nodes).

One feature missing from Graphlog is the ability to specify conditions on the nodes traversed in a path. Since Graphlog path regular expressions relate to sequences of edges, they are able to describe arbitrary sequences of edge traversals. The nodes traversed are not constrained in any way, so one can’t, for example, constrain each one to have a certain property. It is possible to express this kind of query by defining new relations via sub-queries (one would define a relation sequence of C and then look for the closure over that). This is one required feature that can perhaps be addressed more concisely in our generalization of XPath.

Graphlog provides a powerful an expressive query language which is supported by a thorough analysis of the power of the language and the complexity of query processing. As such it may provide some theoretical basis to relate to the work described here on generalizing XPath.

XPath

XPath (http://www.w3.org/TR/xpath) is a component of a number of XML standards and is used to select specific nodes in an XML document for subsequent processing. An XPath expression is a mapping from a node (the context node) to the set of all nodes reachable by the specified path. A path expression is written as a series of steps where each step defines the axis used to reach new nodes and a node test used to restrict the set of nodes reached along the axis. In XPath, axes are defined as a finite set of node-to-node mappings in the graph and reflect the structure of XML documents; example axes are child, ancestor, attribute, and self. Node tests consist of two parts: a restriction on the element name and an optional predicate expression. The XPath standard also contains additional features such as built in functions.

XPath was built as a pragmatic solution to the problem of locating nodes in an XML graph. There is no underlying path algebra that has been used to guide the development of the language (although see [Wadler 00] for a post-hoc analysis of the formal semantics of the XPath 1.0 and http://www.w3.org/TR/xquery-semantics/ for the official version for the XPath 2.0 proposal). Features have been added to make it work in the context of a number of W3C recommendations. In particular, the inclusion of function definitions in the XPath 2.0 (http://www.w3.org/TR/xpath20) proposal provides arbitrary computational power to the language but arguably muddies the idea of pure path expressions. The goal of this paper is to try to add some expressive power while keeping to the core concept of path expressions.

Generalizing XPath

XPath limitations

The widespread adoption of XPath is in part due to the naturalness of expressing node locations as paths from a known node in the graph. The abbreviated XPath syntax is very familiar to any web developer familiar with the URI syntax or unix pathnames. There are, however, a number of arbitrary limitations inherent in XPath which, if relaxed, might make result in a path language applicable to a wider range of data, in particular to annotated corpora.

From the use case analysis of XQuery, there are a number of restrictions in XPath which could be relaxed in order to make it a more general query language component. These are listed here with some commentary on the motivation for each.

  1. The result of evaluating a path expression is a node set. In some applications, it would be useful to be able to retrieve the path that was traversed. Hence, the result of a path expression would be a path set.
  2. Limited, non-extensible set of axes.
  3. Inability to state conditions on paths as opposed to nodes. This extends to the notion of defining regular expressions on paths.

Returning path sets

While locating nodes in a graph is the sole use of XPath, there are cases in annotation query and perhaps in other applications where selecting paths through the graph is important. The prototypical example from annotation query is find sequences of phonemes with property X. Note that no changes to the path language are needed to achieve this, but that the query processor needs to be augmented to maintain path sets rather than node sets.

Defining axes

The set of axes defined in XPath is clearly designed to allow the set of graph traversal operations that are seen to be atomic in XML document trees. In a more general graph structure, it is clearly more difficult to foresee what axes will be needed, and in general it is more useful to define what an axis is than to prescribe a particular set of axes.

An XPath axis is fundamentally a mapping from nodes to node sets that defines a way of traversing the underlying graph. Each axis encapsulates two things: a type of edge to follow (e.g., child vs. attribute) and whether to follow it transitively (e.g., child vs. descendant).

One approach to extending the axis set is to allow arbitrary definitions of axes as mappings from nodes to node sets. If we assume that the graph structure being searched is the edge labeled directed graph described earlier, then any edge label might be associated with an axis. Axes might also be defined for more complex traversal — an example might be an uncle axis which returns siblings of a node’s parent using a combination of parent and child edges. Similarly, for an XML document one might define an IDREF axis which followed IDREF links.

This approach would allow more control of transitivity in the axis definition as is the case with the descendant axis in XPath. However, a more general solution is to factor this out as a modifiable parameter in a path expression. The current binary single step/transitive distinction can be generalized to a how many steps parameter which controls how many steps along an axis are allowable. For example, I could find my grandparents by stepping along the parent axis exactly twice; I could find all of my children and grandchildren by stepping along the child axis up to two times.

The proposal here is to define an axis as a function mapping nodes to pathsets; the function takes a parameter denoting how many steps to take along the axis in generating new paths.

Path conditions

In XPath, conditions apply only to the last node on the path extended from the context node; for example, the expression /descendant:para[@style='important'] traverses along any number of child edges to find paragraphs with a particular attribute. No conditions are placed on any of the intermediate nodes along the axis.

Rather than limiting conditions to the final element of a path, the approach taken here is to allow conditions on any subset of the nodes traversed while extending the path. For example, a condition might apply to all nodes, the first node, or the last two nodes.

Introducing this generalization allows the use-case described earlier to be addressed. A sequence of nodes labeled C can be located by walking along the following axis an unlimited number of steps applying the condition to every node.

Extend programs

In order to generalize XPath, it is useful to first reduce it to a minimal form; one way is to view an XPath expression as a sequence of function applications each mapping a node set to a node set. Each call finds new nodes along some axis, possibly applying some node tests. Note that this abstraction ignores XPath features, such as sequence expressions and function calls; these fall outside the core path language and might well be added back in when the generalized path language is defined.

The generalized language can be defined by defining similar function calls in the following manner. Each path step is a call to an extend function which maps path sets to path sets. The signature of extend is as follows:

extend [axis] [step] [condition] [where]
where [axis] is the name of the axis to traverse, [step] is a restriction on the number of steps taken along the axis, [condition] is a boolean condition on nodes and [where] is a restriction on where the condition must hold. These are discussed in detail below.

Axis

The [axis] argument names an axis which must define a mapping from nodes to nodesets in the graph. Each axis need only define what it means to take a single step from a node. We foresee a framework providing for definition of new axes on particular graphs in terms of a base set of axes or via a general purpose programming language.

Step

The [step] argument determines how man steps should be taken along the axis and takes the form of a list of positive integers or the special value inf. The interpretation of this list is that a path length is allowed if included in the list. The special value of inf means that any length of path is valid. If the list includes integers and inf, then any path of length greater than the highest integer is valid. For example, [2,3,4] would allow paths of length 2, 3 or 4 but not of length 1 or 5, [1,10,inf] would allow paths of length 1 or longer than 10. It is expected that syntactic sugar would be introduced to specify ranges of integers, etc. Note that when the only integer specified is zero, the path will not be extended at all; this only makes sense in combination with where = 0, see below for a discussion.

Condition

The [condition] argument is intended to be a general boolean condition on nodes in the same sense as XPath conditions. This would include recursive calls to the path language and any conditions on node labels that are appropriate to the application. The exact form of these conditions is not specified here; in the trial implementation, we have implemented a general callout to the host programming language (Tcl). In the examples below, the conditions are expressed as a set of tests on the value of labels. As in XPath, the condition operates on a context node; unlike XPath, the condition may be applied to many nodes along a path.

Where

The [where] argument specifies where on the path the condition must hold. As with [step], this is a list of positive integers plus the special values of inf and end. An integer in the list means that the condition is applied to that node on the path. The special value inf means that the condition must hold for all nodes in the path. The special value end means that the condition applies to the end of the path only. For example, the list [1,2,end] would find paths where the first, second and last nodes satisfied the condition.

An important special case is when the where argument is includes a zero. Path positions are indexed from one so that the first node added to a path has an index one. The zero index refers to the context node — that is the start node for the path extension. If where includes zero, then the condition must hold on the context node with the result that all existing paths whose last node does not meet the condition will be pruned. In XPath, the same effect is achieved via a special axis called self; the mechanism proposed here removes the need for this special axis. In combination with a zero step argument, this can result in a pure prune operation where no new paths are extended, but any existing paths not fulfilling the condition are pruned.

Initial conditions

In XPath, there is a unique starting node from which an XPath expression is evaluated. This is either a node defined by external context or the root of the XML document. Extend programs could maintain the idea of an externally provided node (or node set or path set), but since the underlying data model is a generalized directed graph, there is no equivalent to the document root. Two possible starting conditions seem plausible: start with the empty node set, or start with the set of all nodes. Starting from the empty set requires some magic behavior for each axis: walking from the empty set along any axis leads to the set of all nodes. Given this, it seems best to assume the full node set as a starting point for an extend program which isn’t given external context.

Example extend programs

To illustrate the ideas described above, here are two example extend programs which implement XPath and annotation query use-cases. In these examples, a global path set is assumed, but in practice this would be passed between calls to the extend function.

//book/chapter/following-sibling::chapter. Using the child and following-sibling axes and assuming that each node has a element label corresponding to the element name:

extend child             [inf] element='book'    [end]
extend child             [1]   element='chapter' [end]
extend following-sibling [inf] element='chapter' [end] 
In this example, the first instruction finds book nodes along the child axis and is allowed to make any number of steps; only the final node in the path needs to be a book. The second instruction finds chapter nodes one step away. The third instruction makes any number of steps along the following-sibling axis to find another chapter node; again, only the end node is tested. The result of this program would be a set of paths; to generate the equivalent set of nodes as the XPath expression, the end nodes of each path can be taken.

Find sequences of any number of consonant phonemes followed by a single vowel followed by any number of consonants: C+VC+. Assuming a following axis which connects nodes in sequence, the extend program is:

extend following [0,inf] 'type=phoneme&label=C' [0,all]
extend following [1]     'type=phoneme&label=V' [all]
extend following [1,inf] 'type=phoneme&label=C' [all]
Here, the first instruction extends the start set of all nodes by between zero and infinity steps ensuring that all nodes, including the start node, are labeled C. The second instruction extends by exactly one step to a node labeled V. The third instruction is similar to the first, but doesn’t constrain the zeroth node (which we know is a V).

Syntax

While extend programs might provide the required expressive power, it is clearly not suitable for end users, and a more natural query syntax is desirable. An obvious choice is to stay close to the XPath syntax and add new elements for the step and where parameters. One possible syntax would be:

/axis{step}::condition::{where}
The step and where components would default to their XPath values (step = 1 and where = end). The condition component would be a literal name (which would match against some application defined property of the node) with an optional node predicate in square brackets as per XPath. Using this syntax, the second example above would be written:
/following{0,inf}::phoneme[label=C]::{0,all}
/following::phoneme[label=C]
/following{0,inf}::phoneme[label=C]::{0,all}
where path elements are broken for readability. Here the type label has been used as the default node test.

Some short forms could be defined for step and where; for example, + for {1,inf}, and* for {0,inf}:

/following*::phoneme[label=C]::+
/following::phoneme[label=C]
/following*::phoneme[label=C]::+

In practice, it is likely that an abbreviated syntax might be defined in particular application areas; for example, the most common axis (such as child in XML) might be omitted, and short forms might be introduced for common node tests.

Applications of generalized XPath

This section illustrates potential applications of the generalized path language in a number of domains. These examples are necessarily contrived since the focus of this work has been on annotation query. However, they serve to illustrate the properties of the path language in a more familiar domain for most readers. The point of the examples is not “here’s something I can do that no other QL can” but merely to show that in some possibly interesting use cases, the generalized path language provides a clear way of expressing the desired result.

RDF

The RDF [Resource Description Framework] http://www.w3.org/RDF/ defines a data model for metadata consisting of a directed graph of nodes representing resources or values and edges representing relationships. As such, RDF graphs provide an interesting test case for the new path language.

RDF graphs are constructed from a set of triples of the form [subject, predicate, object], for example, [http://www.ics.mq.edu.au/~cassidy/, creator, Steve]. The subject is also referred to as a resource and is usually written as a URI. The object can be either a literal value or a resource URI which itself might be the subject of another triple.

The primary feature of the path language of use in RDF is the ability to define new axes based on the properties of the graph. While there are a number of ways of doing this, one obvious mapping is to use predicate names (edge labels) as axes names. This gives a potentially limitless set of axes, but an application might restrict the set via a schema and, hence, allow more efficient evaluation of path expressions. In addition, an application might define compound axes based on traversal of multiple edges.

Other features in the path language — the extraction of paths and application of tests to whole paths — are less obviously applicable here. Returning whole paths would allow one to use the path language to generate subgraphs of the original RDF graph which might be useful as a query language component.

Node tests for RDF graphs could be simple comparisons with the node label — that is the resource name or literal value. As an alternative, the special status of the rdf:type might be recognized in the path language to allow selection of nodes based on type. So, the proposed path syntax for locating nodes in RDF graphs is:

predicate{step}::TypeOrLiteral[predicate]::{where}
where TypeOrLiteral is treated as a disjunctive comparison with the node label or the rdf:type of the node. predicate is a path expression which is evaluated with the context node set to the current node being matched.

RDF Calendar Query

Figure 4
[Link to open this graphic in a separate page]

A fragment of an RDF graph corresponding to an event with one attendee.

Libby Miller (http://swordfish.rdfweb.org/rdfquery/) provides a number of RDF query use cases illustrating the capabilities of the Squish RDF query language. One interesting case involves querying a collection of events (RDF icalendar) and associated information about individuals (RDF foaf). The graph in Figure 4 shows a small fragment of this kind of data. One example query (http://swordfish.rdfweb.org/discovery/2001/07/swws/) is "Get me details of all the events that Hendler is attending".

The key to this query is to locate just those events which have someone named Hendler as an attendee. This can be expressed as a path expression:

rdf:type{0}::ical:event[ical:attendee::event/foaf:name::*[match(hendler)]
Having located the event, the items of interest can be identified by relative path expressions:
ical:description::*/
ical:dtstart::*/util::hour:*
ical:dtstart::*/util::minute:*
Clearly, these components would need to be integrated into some processing framework, such as that provided by XQuery or XSLT. It would not be too much of a stretch to imagine an XSLT-like stylesheet language for deriving XML serializations of RDF graphs.

Finding friends

As a simple example of how the step feature might be used with RDF, consider the problem of finding friends within a FOAF [Friend of a Friend] RDF datastore. For example, if I want to throw a party and invite all my friends and their friends and their friends I could use the path expression:

ical:type::foaf:person[foaf:name::match(cassidy)]/
foaf:knows{1,3}::person/foaf:name::*
to find the names of all of these people. The step restriction {1,3} will allow traversal of between 1 and three foaf:knows edges. The names of the people we want to invite are the final nodes on all the paths returned from the above path expression.

XML query

While XPath does a good job of selecting elements from XML documents, there are a few use cases that might benefit from the additional features described here.

Building annotations

When dealing with legacy annotations, it is often useful to be able to add new structure to a document marked up under an older schema. In some cases, this means adding new parent elements to group a sequence of elements. An example here might be a book which has been marked up as a series of chapters, each with a type attribute indicating its status in the book:

<book>
  <chapter type='preface'> </chapter>
  <chapter type='body'>  </chapter>
  <chapter type='body'>  </chapter>
  <chapter type='appendix'>  </chapter>
  <chapter type='body'>  </chapter>
  <chapter type='appendix'>  </chapter>
  <chapter type='index'>  </chapter>
</book>
We might like to select the sequence of one or more body chapters followed by one or more appendix chapters and group them into a part. This could be done with the path expression:
following-sibling*::chapter[@type='body']/
following-sibling*::chapter[@type='appendix']
A document processor could then build new structure around the element sequences returned by the path expression.

Implementation

We have a trial implementation of an extend program interpreter written in Tcl which can operate on arbitrary graph data. Demonstrations have been build for XML and generalized graphs, and we are working on an implementation for Annotation Graphs. Currently, no parser for the syntax shown above has been written as there are still many questions about how this will work in general.

In our implementation, a new graph type can be added to the system by defining a single class which provides an interface to the underlying graph. This class defines methods corresponding to the axes defined on the graph, each of which takes a node and returns the node set found by traversing the axis in the underlying graph. It is relatively straightforward then to implement a class as a wrapper around an existing graph data model such as XML.

The algorithm used to process extend programs is a simple naive one which doesn’t try to optimize the process in any way. The main process is to build up the path set by calling the axis methods of the underlying graph following the step and where parameters passed to extend. The Tcl extend function takes an additional argument of a path set so that subsequent calls to extend can build on earlier ones. Here is a sample extend program which implements the C+VC+ query from earlier in the paper:

set p [extend following {0 inf} "[has type phoneme] && [has name c]" 1 {}]
set p [extend following 1 "[has type phoneme] && [has name v]" all $p]
set p [extend following all "[has type phoneme] && [has name c]" all $p]

It is worthwhile acknowledging that an efficient implementation of this language might be very difficult given its generality. However, we’ve not yet looked at this aspect of the problem. It may be that restrictions need to be placed on the language to enable efficient implementation.

Discussion

In conclusion, XPath can be generalized to a new path language for generalized edge labeled directed graphs, which provides a very consistent set of semantics for axes and path extension. This new language reduces to XPath as a special case where only single steps are taken along an axis and conditions are only applied to nodes at the end point of the path. We believe that this is an interesting generalization which might form the basis of a powerful and natural query language for linguistic annotations and which might find application in other areas.


Acknowledgments

I’d like to thank Dan Steffan for the implementation of the prototype extend program interpreter and discussions on the graph searching problem.


Bibliography

[Abiteboul 97] Abiteboul, S. Querying semi-structured data, in Proceedings of the 6th International Conference on Database Theory, 1997.

[Bird & Liberman 01] Bird, Steven, and Liberman, Mark. “A formal framework for linguistic annotation”. Speech Communication 33(1, 2), pp 23–60, 2001.

[Bird et al. 00] Bird, Steven, Buneman, Peter, and Tan, Wang-Chiew. Towards a query language for annotation graphs, in Proceedings of the Second International Conference on Language Resources and Evaluation, pp 807-814, 2000.

[Cassidy 01] Cassidy, S., and Harrington, J. “Multi-level annotation in the Emu speech database management system”. Speech Communication 33, 61–77, January 2001.

[Cassidy 02] Cassidy, S. XQuery as an Annotation Query Language: a Use Case Analysis, in Proceedings of the Language Resources and Evaluation Conference. Las Palmas, Spain, 2002.

[Cassidy et al. 00] Cassidy, Steve, Welby, Pauline, McGory, Julie, and Beckman, Mary. Testing the Adequacy of Query Languages Against Annotated Spoken Dialog, in Proceedings of the Speech Science and Technology Conference, Canberra, December 2000.

[Christophides et al. 94] Christophides, V., Abiteboul, S., Cluet, S., and Scholl, M. “From structured documents to novel query facilities”. SIGMOD RECORD, 23(2):313–324, June 1994. http://citeseer.nj.nec.com/christophides96from.html.

[Church & Mercer 93] Church, K.W., and Mercer, R.L. “Introduction to the special issue on computational linguistics using large corpora”. Computational Linguistics 19(1):1–24, 1993.

[Consens & Mendelzon] Consens, M.P., and Mendelzon, A.O. Graphlog: A Visual Formalism for Real Life Recursion, in Proc. of the ACM Symp. on Principles of Database Systems. 1990. http://citeseer.nj.nec.com/consens90graphlog.html.

[McKelvie et al. 01] McKelvie, D., Isard, A., Mengel, A., Grosse, M., and Klien, M. “The MATE Workbench — an annotation tool for XML coded speech corpora”. Speech Communication 33, January 2001.

[Mendelzon & Wood] Mendelzon, Alberto O., and Wood, Peter T. “Finding Regular Simple Paths In Graph Databases”. SIAM Journal of Computing, vol. 24, no. 6, Dec. 1995. http://citeseer.nj.nec.com/mendelzon89finding.html.

[TEI] Sperberg-McQueen, C.M., and Burnard, L. (eds.). TEI P4: Guidelines for Electronic Text Encoding and Interchange. Text Encoding Initiative Consortium. XML Version: Oxford, Providence, Charlottesville, Bergen. http://www.tei-c.org/.

[Thompson & McKelvie 97] Thompson, H.S., and McKelvie, D. Hyperlink semantics for standoff markup of read-only documents, in SGML Europe 1997.http://www.ltg.ed.ac.uk/~ht/sgmleu97.html.

[TIMIT] Garofolo, John S. et al. TIMIT Acoustic-Phonetic Continuous Speech Corpus. Available from the LDC. http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1.

[Wadler 00] Wadler, Philip. A formal semantics of patterns in XSLT, in Markup Technologies ’99. Philadelphia, December 1999.http://citeseer.nj.nec.com/wadler00formal.html.



Generalizing XPath for directed graphs

Steve Cassidy [Centre for Language Technology]
Steve.Cassidy@mq.edu.au