Core Range Algebra: Toward a Formal Model of Markup

Gavin Nicol

Abstract

There have been a number of formal models proposed for XML. Almost all of these models focus on treating XML as semistructured data, typically as edge-labeled graphs, or as a tree of nodes. While this approach is fine for more traditional database applications, or situations where structure is of paramount importance, it fails to deal with the issues found in the use of XML for text. In particular, unless special functions are introduced, queries that involve text spanning node boundaries are not closed over a set of nodes, and likewise, the returned sets of nodes are not necessarily well-formed XML documents. This paper presents an algebra for data that, when applied to XML, is not only closed over the entire set of operations permissible in more traditional XML query languages, but over the operations that involve text and XML fragments that are not in themselves well-formed.

Keywords: Modeling; Querying

Gavin Nicol

Gavin Nicol is a Chief Technology Officer, and co-founder of Red Bridge Interactive, Inc. of Providence, RI. Gavin is one of the pioneers in WWW I18N, and has been heavily involved in the XML, XSL, XLL, HTML, HTTP and DOM standards. Gavin lived in Japan for over 10 years, and wrote the award-winning DynaWeb product, the first commercial server to support runtime conversion from SGML to HTML using stylesheets. At RBII, Gavin steers the companies technical vision as it relates to their XML-based repository and content management products.

Core Range Algebra

Toward a Formal Model of Markup

Gavin Nicol [Red Bridge Interactive, Inc.]

Extreme Markup Languages 2002® (Montréal, Québec)

Copyright © 2002 Gavin Nicol. Reproduced with permission.

Introduction

There are many situations where a sequence1 of data is given, and some structure is to be inferred such that the data might be further manipulated. Examples of this include traditional flat file data formats, DNA Sequences, and plain textual data. As an example, take the following piece of text.

This is a piece of italic text.

This is a sequence of characters, but there is some internal structure in the italicized portion that could possibly be manipulated. In languages such as XML, the structure of the data is made explicit via embedded markup, as shown below.

<doc>This is a piece of <i>italic</i> text.</doc>

This is a convenient way to make the implicit structure of the data explicit, and hence simplifies many forms of processing. Such markup typically needs to be nested with no overlap, though it is acknowledged that real text is often not so conveniently structured, as evidenced by papers dealing[durandMarkup][sperbergDesign1] with the issues of using SGML or XML to markup texts in the humanities.

Given the typical nesting structure, the markup lends itself to interpretation as a tree, so it is hardly surprising to see the tools and APIs supporting embedded markup to model it as such[w3cdom][w3cxsl], though there are some exceptions[adler]. One of the questions raised by using embedded markup is whether the markup is part of the text, or whether it is simply metadata about the text. In other words, is the tree the real document, or is the sequence of characters all there is? For the XML-savvy, the answer is generally that the tree is the canonical form because it represents the inherent structure of the document. An alternate, and possibly older view of documents, called parallel markup [nelsonLit] [nelsonEmbed], asserts that the markup is not part of the text, but rather an interpretation imposed over a range of characters.

This paper defines an algebra called Core Range Algebra that provides a formalism for the use of ranges as a means of dealing with the structure in sequences of data. The motivation behind Core Range Algebra is to provide a basis for a complete formalism bridging the worlds of structured data, such as XML, and semistructured data such as memos.

Core Range Algebra

This paper introduces Core Range Algebra, an algebra of ranges. Core Range Algebra alone is not powerful enough to describe the markup structures possible with XML, or indeed of any nested embedded markup structure. That said, it is capable of modeling overlapping ranges of data, which XML and most typical markup languages disallow.

Core Range Algebra is defined as an algebra over a sequence(S) of items(I) from a finite set forming an alphabet (A) and is closed over all operations on A. At the heart of CRA are the abstract data types sequence, and range described below. In the descriptions, a period is used to denote access to an attribute of a given object, so foo.bar means “the bar attribute of foo”.

Sequence Data Type

The sequence data type is one of the fundamental abstract data types in computer science: it is a completely ordered finite set of items. We use the following definitions:

Definitions

itemI

A member of the alphabet(A), or . There is no restriction on the format or structure of an item.

alphabetA

A finite set of items(I).

sequenceS

An ordered list of items(I) from an alphabet(A) such that .

lengthL

The number of items held by the sequence.

positionP

An index into the sequence which identifies an item. The position will be in 0..n-1 where n is the length of the sequence.

Syntax

              Empty:     sequence
              IsEmpty:   sequence → boolean
              Contains:  sequence
              item → boolean
              Equals:    sequence
              sequence → boolean
              Diff:      sequence
              sequence → sequence
              Union:     sequence
              sequence → sequence
              Intersect: sequence
              sequence → sequence
              Concat:    sequence
              sequence → sequence
              Insert:    sequence
              position
              item → sequence
              Delete:    sequence
              position → sequence
              ItemAt:    sequence
              position → item or null
              Last:      sequence → item or null
              First:     sequence → item or null
            

Predicates

IsEmpty(S)

Tests whether S contains any items, or S.length > 0.

Contains(S,I)

Tests whether S contains an item equal to I.

Equals(S,T)

Tests whether S and T are each of the same length, and that each item at a given position in S is equal to the item at the same position in T.

Operators

Diff(S,T)

Returns a sequence containing the items in S that are not contained in T.

Union(S,S)

Returns a sequence containing the union of S and T.

Intersect(S,T)

Returns a new sequence holding the intersection of S and T.

Concat(S,T)

Concatenate S and T and return a new sequence comprised of the members of S and T in the order (S1...Sn,T1...Tn).

Operations

Insert(S,P,I)

Insert I into S at the position P and return a new sequence. If P is out of bounds no change occurs (this could also be raised as an error condition). This operation could be defined in terms of insertion (copying) into an empty sequence.

Delete(S,P)

Delete the item at position P from the sequence S thereby creating a new sequence. If P is out of bounds no change occurs (this could also be raised as an error condition). Note that this could also be defined in terms of insertion (copying) into an empty sequence.

ItemAt(S,P)

Return the item at position P within the sequence S, or null if P is out of bounds (this could also be raised as an error condition).

Last(S)

Return the last item from S, or if S is empty, null.

First(S)

Return the first item from S, or if S is empty, null.

Discussion

Perhaps the most important operator over sequences is the concatenation operator. Given 2 sequences, a third sequence is created by joining the 2 sequences together. There is a monoid relationship in this operator given that the empty sequence represents the unity item, and it also provides us with the means to derive the Kleen closure of a subset, which is critical to regular expressions.

Range Data Type

The range data type is defined in terms of a sequence S. A range is essentially a marker for a sub-sequence of the sequence S, and consists of a position within S, and the length of the sub-sequence. We use the following definitions:

Definitions

range

A {start, length} pair, where start is a position and length is the number of items included in the range.These can be accessed using the dot operator. For example, if we have a range R, we would write R.start or R.length.

Syntax

              StartsBefore:   sequence sequence → boolean
        StartsAfter:    sequence sequence → boolean
        StartsWithin:   sequence sequence → boolean
        EndsBefore:     sequence sequence → boolean
        EndsAfter:      sequence sequence → boolean
        EndsWithin:     sequence sequence → boolean
        Within:         sequence sequence → boolean
        Same:           sequence sequence → boolean

        Create2
              :           position
              length → range
              Flatten:        range
              sequence → sequence
              EndOf:          range → position
              Move:           range
              position → range
              Resize:         range
              length →  range
            

Predicates

The following predicates are defined for ranges, where each takes two ranges, a and b. Note that all predicates are defined in terms of relationships between the start and length of ranges, hence the operations should be very efficient.

StartsBefore(a, b)

Tests whether a starts before b, or .

StartsAfter(a, b)

Tests whether a starts after b, or .

StartsWithin(a, b)

Tests whether a starts within b, or .

EndsBefore(a, b)

Tests whether a ends before b, or .

EndsAfter(a, b)

Tests whether a ends after b, or .

EndsWithin(a, b)

Tests whether a ends within b, or .

Within(a, b)

Tests whether a contains b, or . Note that this is equivalent to saying and can be derived from that.

Same(a, b)

Tests whether the start and length of a are the same as the start and length of b, or .

Operations

Create(P,L)

Given a position P and a length L, create a new range. Here both P and L are positive integers.

Flatten(R,S)

Given a sequence S, and a range R, create a new sequence holding the sub-sequence of S identified by the position and length of R. This is essentially the same as copying all the items in the sub-sequence into a newly created sequence using Insert.

EndOf(R)

Calculate the last position identified by the range R, or .

Move(R, n)

Move the range R forward of backward by n, or .

Resize(R, n)

Make the range R larger or smaller by n items, or .

Sequences of Ranges

In addition to the fundamental data types sequence and range, Core Range Algebra makes use of sequences of ranges. Functions defined in terms of sequences of ranges can be used to infer structure based on the patterns of containment and other such relationships between ranges.

Syntax

              StartOrder:   sequence → sequence
              EndOrder:     sequence → sequence
              Unique:       sequence → sequence
              Ancestor:     range
              sequence → sequence
              Descendant:   range
              sequence → sequence
              Parent:       range
              sequence → range
              Child:        range
              sequence → range
            

Operations

StartOrder(S)

Given a sequence of ranges S, return a new sequence such that the ranges are ordered in increasing order based on the start of the ranges. If more than one range has the same start, these ranges are sorted by length.

EndOrder(S)

Given a sequence of ranges S, create a new sequence such that the ranges are ordered in increasing order based on the EndOf function on the ranges. If more than one range has the same EndOf value, these ranges are sorted in increasing order of length.

Unique(S)

Given a sequence of ranges S, create a new sequence such that no two ranges have the same start and length.

Ancestor(R, S)

Given a sequence of ranges S, and a range R, return all ranges in S that are ancestors of R, or in other words, all ranges that completely contain R.

Descendant(R, S)

Given a sequence of ranges S, and a range R, return all ranges in S that are descendants of R, or in other words, all ranges that are completely contained by R.

Parent(R, S)

Given a sequence of ranges S, and a range R, determine which, if any, ranges in S directly contain R such that their start is closest to R.start.

Child(R, S)

Given a sequence of ranges S, and a range R, determine which, if any, ranges in S are contained by R, and by no other range whose start is greater than the start of R. In other words, determine which, if any, ranges are contained by R, and whose parent is R.

Note that the list of children can be ordered by applying order to the result of this function.

Expressions

The following grammar defines the expression language or Core Range Algebra; a very typical functional language. Expressions are evaluated in a dynamic environment with lexical scoping.

Table 1
name Name  
function name Function  
variable Var  
constant Const  
comparison operator Op eq ::= = | != | < | > | <= | >=  
arithmetic operator Op bin ::= + | - | * | / | %  
boolean operator Op bool ::= && | ||  
binary operator BinaryOp ::= Op eq | Op bin | Op bool  
unary operator UnaryOp ::= + | - | not  
expression Expr ::= Const constant expression
  | Expr BinaryOp Expr binary expression
  | UnaryOp Expr unary expression
  | if Exp then Exp else Exp conditional expression
  | let Var = Exp do Exp local binding
  | Function(Expr; ... Expr) function application
  | for Var in Expr do Expr iteration expression
  | error error expression

In addition to the core expression languages, the functions defined above for the sequence and range data types can be used in expressions, as can the dot notion.

Conclusions

Core Range Algebra, as defined in the paper, is a simple algebra that operates over sequences, ranges, and sequences of ranges. The intent of Core Range Algebra is to provide a basis for manipulating sequences of data, especially text, as a sequence of items, and to provide means for layering higher-level interpretations over that substrate such that operations at the higher level can be defined in terms of the underlying model.

Future Work

What is woefully missing in the existing specification is a means for constructing sequences of ranges from an underlying sequence. Given the Kleen closure, we have a formal basis for creating ranges based on regular expressions. This will not only provide a powerful means for creating ranges over unstructured texts, but also gives a means for inferring structurally recursive data structures, or to infer higher level structures over previously inferred structures.

One obvious further extension is to expand the Core Range Algebra to fully cover XML. An approach much like REX or Regular Fragmentation can be used to build the structures of XML, from the underlying sequences, and with additional functions for range construction such that elements, attributes, etc., can be constructed, a full query language can be defined over the Core Range Algebra defined above.

Related Work

There is a fairly large body of work available to support aspects of Core Range Algebra. In terms of data model Alignment Calculus[grahneSafety], a declarative string database query language is in many ways similar, as is SRS[coupayeSRS] a very widely used system in the gene research field. Of course, the basic idea of parallel markup came from Ted Nelson [nelsonLit].

For querying and algebra for semistructured data or XML, there is an large and growing amount of literature. In Buneman[unQL] a complete query language and algebra based on structural recursion is presented. In Fernandez[xmlMonad] a proposal for a formal algebra for XML Query is presented which might become the standard for node-centric processing. There are many other papers related to retrieval of semi-structured data, including SAL[beeriSAL], XIRQL[fuhrXIRQL] and, of course, XQuery[wc3xquery] itself.

For the notion of parsing XML using regular expressions, regular fragmentations from Simon St. Laurent[regularFrag] and REX[cameron98] are especially applicable. Core Range Algebra provides a formal basis for both approaches.

Notes

1.

Here “sequence” is not limited to textual data, but is closer to the mathematical sense.

2.

We will expand upon the notion of range creation later in this paper.


Bibliography

[adler] Adler, Sharon C, DSSSL- Document Style Semantics and Specification Language, 1989

[beeriSAL] Beeri, Catriel, and Yariv, Tzaban, SAL, An Algebra for Seminstructured Data and XML

[cameron98] Cameron, Robert D., REX: XML Shallow Parsing Using Regular Expressions, 1998

[coupayeSRS] Coupaye, Thierry, Extracting and Converting Data from Semistructured Biological Databanks with SRS, 1996

[durandMarkup] Durand, David G., DeRose, Steven J., and Mylonas, Elli, 1996

[fuhrXIRQL] Fuhr, Norbert, and Grobjohann, Kai, XIRQL: A query language for information retrieval in XML documents

[grahneSafety] Grahne, Gosta and Nykanen, Matti, 1997

[nelsonEmbed] Nelson, Theodor Holm, Embedded Markup Considered Harmful, Fall 1997

[nelsonLit] Ted Nelson, Literary Machines, 1998

[regularFrag] St. Laurent, Simon, Regular Fragmentations, 2001

[sperbergDesign1] Sperberg-McQueen, C. M., and Burnard, Lou, The Design of the TEI Encoding Scheme, 1995

[unQL] Buneman, Peter, Fernandez, Mary, and Suciu, Dan, A Query Language and Algebra for Semistructured Data Based on Structural Recursion

[w3cdom] W3C DOM Working Group, The Document Object Model (DOM) Level 1, 1998

[w3cxsl] W3C XSL Working Group, Extensible Stylesheet Language

[wc3xquery] W3C XQuery Working Group, XQuery 1.0: An XML Query Language

[xmlMonad] Fernandez, Mary, Simeon, Jerome, and Wadler, Philip, Query Language for Information Retreival in XML Documents



Core Range Algebra

Gavin Nicol [Red Bridge Interactive, Inc.]