There have been a number of formal models proposed for XML. Almost all of these models focus on treating XML as semistructured data, typically as edgelabeled graphs, or as a tree of nodes. While this approach is fine for more traditional database applications, or situations where structure is of paramount importance, it fails to deal with the issues found in the use of XML for text. In particular, unless special functions are introduced, queries that involve text spanning node boundaries are not closed over a set of nodes, and likewise, the returned sets of nodes are not necessarily wellformed XML documents. This paper presents an algebra for data that, when applied to XML, is not only closed over the entire set of operations permissible in more traditional XML query languages, but over the operations that involve text and XML fragments that are not in themselves wellformed.
XML Source  PDF (for print)  Author Package  Typeset PDF 
There are many situations where a sequence^{1} of data is given, and some structure is to be inferred such that the data might be further manipulated. Examples of this include traditional flat file data formats, DNA Sequences, and plain textual data. As an example, take the following piece of text.
This is a piece of italic text.
This is a sequence of characters, but there is some internal structure in the italicized portion that could possibly be manipulated. In languages such as XML, the structure of the data is made explicit via embedded markup, as shown below.
<doc>This is a piece of <i>italic</i> text.</doc>
This is a convenient way to make the implicit structure of the data explicit, and hence simplifies many forms of processing. Such markup typically needs to be nested with no overlap, though it is acknowledged that real text is often not so conveniently structured, as evidenced by papers dealing[durandMarkup][sperbergDesign1] with the issues of using SGML or XML to markup texts in the humanities.
Given the typical nesting structure, the markup lends itself to interpretation as a tree, so it is hardly surprising to see the tools and APIs supporting embedded markup to model it as such[w3cdom][w3cxsl], though there are some exceptions[adler]. One of the questions raised by using embedded markup is whether the markup is part of the text, or whether it is simply metadata about the text. In other words, is the tree the real document, or is the sequence of characters all there is? For the XMLsavvy, the answer is generally that the tree is the canonical form because it represents the inherent structure of the document. An alternate, and possibly older view of documents, called parallel markup [nelsonLit] [nelsonEmbed], asserts that the markup is not part of the text, but rather an interpretation imposed over a range of characters.
This paper defines an algebra called Core Range Algebra that provides a formalism for the use of ranges as a means of dealing with the structure in sequences of data. The motivation behind Core Range Algebra is to provide a basis for a complete formalism bridging the worlds of structured data, such as XML, and semistructured data such as memos.
This paper introduces Core Range Algebra, an algebra of ranges. Core Range Algebra alone is not powerful enough to describe the markup structures possible with XML, or indeed of any nested embedded markup structure. That said, it is capable of modeling overlapping ranges of data, which XML and most typical markup languages disallow.
Core Range Algebra is defined as an algebra over a sequence(S) of items(I) from a finite set forming an alphabet (A) and is closed over all operations on A. At the heart of CRA are the abstract data types sequence, and range described below. In the descriptions, a period is used to denote access to an attribute of a given object, so foo.bar means “the bar attribute of foo”.
The sequence data type is one of the fundamental abstract data types in computer science: it is a completely ordered finite set of items. We use the following definitions:
itemI 
A member of the alphabet(A), or . There is no restriction on the format or structure of an item. 
alphabetA 
A finite set of items(I). 
sequenceS 
An ordered list of items(I) from an alphabet(A) such that . 
lengthL 
The number of items held by the sequence. 
positionP 
An index into the sequence which identifies an item. The position will be in 0..n1 where n is the length of the sequence. 
Empty: sequence IsEmpty: sequence → boolean Contains: sequence item → boolean Equals: sequence sequence → boolean Diff: sequence sequence → sequence Union: sequence sequence → sequence Intersect: sequence sequence → sequence Concat: sequence sequence → sequence Insert: sequence position item → sequence Delete: sequence position → sequence ItemAt: sequence position → item or null Last: sequence → item or null First: sequence → item or null
IsEmpty(S) 
Tests whether S contains any items, or S.length > 0. 
Contains(S,I) 
Tests whether S contains an item equal to I. 
Equals(S,T) 
Tests whether S and T are each of the same length, and that each item at a given position in S is equal to the item at the same position in T. 
Diff(S,T) 
Returns a sequence containing the items in S that are not contained in T. 
Union(S,S) 
Returns a sequence containing the union of S and T. 
Intersect(S,T) 
Returns a new sequence holding the intersection of S and T. 
Concat(S,T) 
Concatenate S and T and return a new sequence comprised of the members of S and T in the order (S1...Sn,T1...Tn). 
Insert(S,P,I) 
Insert I into S at the position P and return a new sequence. If P is out of bounds no change occurs (this could also be raised as an error condition). This operation could be defined in terms of insertion (copying) into an empty sequence. 
Delete(S,P) 
Delete the item at position P from the sequence S thereby creating a new sequence. If P is out of bounds no change occurs (this could also be raised as an error condition). Note that this could also be defined in terms of insertion (copying) into an empty sequence. 
ItemAt(S,P) 
Return the item at position P within the sequence S, or null if P is out of bounds (this could also be raised as an error condition). 
Last(S) 
Return the last item from S, or if S is empty, null. 
First(S) 
Return the first item from S, or if S is empty, null. 
Perhaps the most important operator over sequences is the concatenation operator. Given 2 sequences, a third sequence is created by joining the 2 sequences together. There is a monoid relationship in this operator given that the empty sequence represents the unity item, and it also provides us with the means to derive the Kleen closure of a subset, which is critical to regular expressions.
The range data type is defined in terms of a sequence S. A range is essentially a marker for a subsequence of the sequence S, and consists of a position within S, and the length of the subsequence. We use the following definitions:
range 
A {start, length} pair, where start is a position and length is the number of items included in the range.These can be accessed using the dot operator. For example, if we have a range R, we would write R.start or R.length. 
StartsBefore: sequence sequence → boolean StartsAfter: sequence sequence → boolean StartsWithin: sequence sequence → boolean EndsBefore: sequence sequence → boolean EndsAfter: sequence sequence → boolean EndsWithin: sequence sequence → boolean Within: sequence sequence → boolean Same: sequence sequence → boolean Create^{2} : position length → range Flatten: range sequence → sequence EndOf: range → position Move: range position → range Resize: range length → range
The following predicates are defined for ranges, where each takes two ranges, a and b. Note that all predicates are defined in terms of relationships between the start and length of ranges, hence the operations should be very efficient.
StartsBefore(a, b) 
Tests whether a starts before b, or . 
StartsAfter(a, b) 
Tests whether a starts after b, or . 
StartsWithin(a, b) 
Tests whether a starts within b, or . 
EndsBefore(a, b) 
Tests whether a ends before b, or . 
EndsAfter(a, b) 
Tests whether a ends after b, or . 
EndsWithin(a, b) 
Tests whether a ends within b, or . 
Within(a, b) 
Tests whether a contains b, or . Note that this is equivalent to saying and can be derived from that. 
Same(a, b) 
Tests whether the start and length of a are the same as the start and length of b, or . 
Create(P,L) 
Given a position P and a length L, create a new range. Here both P and L are positive integers. 
Flatten(R,S) 
Given a sequence S, and a range R, create a new sequence holding the subsequence of S identified by the position and length of R. This is essentially the same as copying all the items in the subsequence into a newly created sequence using Insert. 
EndOf(R) 
Calculate the last position identified by the range R, or . 
Move(R, n) 
Move the range R forward of backward by n, or . 
Resize(R, n) 
Make the range R larger or smaller by n items, or . 
In addition to the fundamental data types sequence and range, Core Range Algebra makes use of sequences of ranges. Functions defined in terms of sequences of ranges can be used to infer structure based on the patterns of containment and other such relationships between ranges.
StartOrder: sequence → sequence EndOrder: sequence → sequence Unique: sequence → sequence Ancestor: range sequence → sequence Descendant: range sequence → sequence Parent: range sequence → range Child: range sequence → range
StartOrder(S) 
Given a sequence of ranges S, return a new sequence such that the ranges are ordered in increasing order based on the start of the ranges. If more than one range has the same start, these ranges are sorted by length.

EndOrder(S) 
Given a sequence of ranges S, create a new sequence such that the ranges are ordered in increasing order based on the EndOf function on the ranges. If more than one range has the same EndOf value, these ranges are sorted in increasing order of length.

Unique(S) 
Given a sequence of ranges S, create a new sequence such that no two ranges have the same start and length. 
Ancestor(R, S) 
Given a sequence of ranges S, and a range R, return all ranges in S that are ancestors of R, or in other words, all ranges that completely contain R.

Descendant(R, S) 
Given a sequence of ranges S, and a range R, return all ranges in S that are descendants of R, or in other words, all ranges that are completely contained by R.

Parent(R, S) 
Given a sequence of ranges S, and a range R, determine which, if any, ranges in S directly contain R such that their start is closest to R.start.

Child(R, S) 
Given a sequence of ranges S, and a range R, determine which, if any, ranges in S are contained by R, and by no other range whose start is greater than the start of R. In other words, determine which, if any, ranges are contained by R, and whose parent is R.
Note that the list of children can be ordered by applying order to the result of this function. 
The following grammar defines the expression language or Core Range Algebra; a very typical functional language. Expressions are evaluated in a dynamic environment with lexical scoping.
name  Name  
function name  Function  
variable  Var  
constant  Const  
comparison operator  Op _{eq} ::= =  !=  <  >  <=  >=  
arithmetic operator  Op _{bin} ::= +    *  /  %  
boolean operator  Op _{bool} ::= &&    
binary operator  BinaryOp ::= Op _{eq}  Op _{bin}  Op _{bool}  
unary operator  UnaryOp ::= +    not  
expression  Expr ::= Const  constant expression 
 Expr BinaryOp Expr  binary expression  
 UnaryOp Expr  unary expression  
 if Exp then Exp else Exp  conditional expression  
 let Var = Exp do Exp  local binding  
 Function(Expr; ... Expr)  function application  
 for Var in Expr do Expr  iteration expression  
 error  error expression 
In addition to the core expression languages, the functions defined above for the sequence and range data types can be used in expressions, as can the dot notion.
Core Range Algebra, as defined in the paper, is a simple algebra that operates over sequences, ranges, and sequences of ranges. The intent of Core Range Algebra is to provide a basis for manipulating sequences of data, especially text, as a sequence of items, and to provide means for layering higherlevel interpretations over that substrate such that operations at the higher level can be defined in terms of the underlying model.
What is woefully missing in the existing specification is a means for constructing sequences of ranges from an underlying sequence. Given the Kleen closure, we have a formal basis for creating ranges based on regular expressions. This will not only provide a powerful means for creating ranges over unstructured texts, but also gives a means for inferring structurally recursive data structures, or to infer higher level structures over previously inferred structures.
One obvious further extension is to expand the Core Range Algebra to fully cover XML. An approach much like REX or Regular Fragmentation can be used to build the structures of XML, from the underlying sequences, and with additional functions for range construction such that elements, attributes, etc., can be constructed, a full query language can be defined over the Core Range Algebra defined above.
There is a fairly large body of work available to support aspects of Core Range Algebra. In terms of data model Alignment Calculus[grahneSafety], a declarative string database query language is in many ways similar, as is SRS[coupayeSRS] a very widely used system in the gene research field. Of course, the basic idea of parallel markup came from Ted Nelson [nelsonLit].
For querying and algebra for semistructured data or XML, there is an large and growing amount of literature. In Buneman[unQL] a complete query language and algebra based on structural recursion is presented. In Fernandez[xmlMonad] a proposal for a formal algebra for XML Query is presented which might become the standard for nodecentric processing. There are many other papers related to retrieval of semistructured data, including SAL[beeriSAL], XIRQL[fuhrXIRQL] and, of course, XQuery[wc3xquery] itself.
For the notion of parsing XML using regular expressions, regular fragmentations from Simon St. Laurent[regularFrag] and REX[cameron98] are especially applicable. Core Range Algebra provides a formal basis for both approaches.
Here “sequence” is not limited to textual data, but is closer to the mathematical sense. 

We will expand upon the notion of range creation later in this paper. 
[adler] Adler, Sharon C, DSSSL Document Style Semantics and Specification Language, 1989
[beeriSAL] Beeri, Catriel, and Yariv, Tzaban, SAL, An Algebra for Seminstructured Data and XML
[cameron98] Cameron, Robert D., REX: XML Shallow Parsing Using Regular Expressions, 1998
[coupayeSRS] Coupaye, Thierry, Extracting and Converting Data from Semistructured Biological Databanks with SRS, 1996
[durandMarkup] Durand, David G., DeRose, Steven J., and Mylonas, Elli, 1996
[fuhrXIRQL] Fuhr, Norbert, and Grobjohann, Kai, XIRQL: A query language for information retrieval in XML documents
[grahneSafety] Grahne, Gosta and Nykanen, Matti, 1997
[nelsonEmbed] Nelson, Theodor Holm, Embedded Markup Considered Harmful, Fall 1997
[nelsonLit] Ted Nelson, Literary Machines, 1998
[regularFrag] St. Laurent, Simon, Regular Fragmentations, 2001
[sperbergDesign1] SperbergMcQueen, C. M., and Burnard, Lou, The Design of the TEI Encoding Scheme, 1995
[unQL] Buneman, Peter, Fernandez, Mary, and Suciu, Dan, A Query Language and Algebra for Semistructured Data Based on Structural Recursion
[w3cdom] W3C DOM Working Group, The Document Object Model (DOM) Level 1, 1998
[w3cxsl] W3C XSL Working Group, Extensible Stylesheet Language
[wc3xquery] W3C XQuery Working Group, XQuery 1.0: An XML Query Language
[xmlMonad] Fernandez, Mary, Simeon, Jerome, and Wadler, Philip, Query Language for Information Retreival in XML Documents