Architecture and Speed of Common XML Operations

Steven DeRose
sderose@acm.org

Abstract

XML documents are stored in a variety of forms, ranging from the XML character sequence itself as a plain text file, to very complicated binaries optimized for various operations, to relational database schemas also intended for certain operations. Each of these forms is better for some operations than for others; one cannot optimize for everything at once.

This paper reviews the core operations required of an XML retrieval application, and some basic approaches to fulfilling them. For each operation, it considers how costly it is under various architectural assumptions, and how such architecture tradeoffs can affect system usability. The key operations are chosen based on DOM, XPath, and the author's experience with XML systems and development.

These findings are significant for implementors of XML systems, particularly ones that have a DOM-like level either inside or outside. However, they also have important effects on XSL users, who can easily create XSL programs that run afoul of implementation problems. This will commonly show up as a seemingly simple program that sometimes (or possibly always) takes surprisingly outrageous time to finish (or crash).

One of the less obvious findings is that although the organization of XML storage makes a huge difference, the form of identifiers used for XML locations does too. Thus, in addition to choosing an effective infrastructure (such as RDBs, text files, or hand-crafted binaries) and efficient use of that infrastructure's strengths, it behooves developers to choose effective location identifiers. Judicious use of XPath/XPointer child sequences can massively improve the speed of many common XML operations.

Keywords: Processing; XSLT; XPath; XPointer

Steven DeRose

Steven DeRose has been working with electronic books and hypertext since 1977 when he joined the team at Brown University working on FRESS [DeRo], the second hypertext system. His doctoral work is in natural language processing, specializing in computer analysis and modeling of large data collections. His 1987 article with Coombs and Renear [Coom] is a primary resource on the theory of descriptive markup.

DeRose has served, often as editor, on standards committees including XML, XLink, XPath, XPointer, TEI, OSIS, and others. He also co-founded Electronic Book Technologies where he architected the first SGML browser and search engine, DynaText, which was used for very large-scale documents such as Sun, SGI, Novell, and Boeing documentation, and resulted in 11 patents. After selling EBT he became Chief Scientist of Brown's Scholarly Technology Group, and now is an independent consultant and Chairman of the Bible Technology Group. He is a frequent speaker on markup, linguistics, literary hypertext, and other topics, and has published many articles, as well as books on SGML and HyTime.

Architecture and Speed of Common XML Operations

Steven DeRose [Bible Technologies Group]

Extreme Markup Languages 2005® (Montréal, Québec)

Copyright © 2005 Steven DeRose. Reproduced with permission.

Introduction

Given:

  • XML documents,
  • a certain manner of representing them, and
  • certain desired operations,

the very basic questions arise, "How fast is it?" and "How much space does it take?" This paper attempts to provide accessible yet specific answers to these questions.

Algorithm and data structure analysis is a well-established field in Computer Science. For some situations (and for precise analysis in many), the mathematics required is daunting. However, an approximate analysis is often relatively easy, and can be of much value in selecting technologies, or in analyzing existing systems. This paper presents approximate analyses for many XML operations, and discusses the architecture issues that affect them. It breaks no new ground in Computer Science. Rather, I hope to make that tilled ground accessible to many in the XML community who might otherwise find it rough going, and to provide a ready map relating the XML world to that known terrain.

Measurements such as "1/2 hour" or even "1/2 hour on a 1GHz G5" become obsolete almost immediately, especially because programs do not generally run twice as fast on a machine whose nominal "speed" is twice as high -- it doesn't work that simply. Therefore in analysis, the speed (or size) of particular hardware is factored out, and the result is stated in terms of what "order" of function must be used to get from the size of the input to the time and/or space required.

The classic example is sorting a list of N items. A person may alphabetize random folders by taking each in turn, and scanning the filing cabinet from A toward Z until the right place is reached. Then the new item is inserted, and the process is repeated. On average, each search goes halfway through the sequence of items that are already there; and on average, half the items are already there (at the beginning none; at the end, all but one). Thus, the average time to insert an item is very close to N/4.

Because this N/4 work must be done N times, this method of sorting takes time N*N/4. Since the factor of 1/4 will soon be overtaken by Moore's Law, better programming, or in this case a more dexterous file clerk, analysts discard it and state the speed as merely "O(N**2)". The "O" is read as "order", and the double asterisk indicates an exponent: N squared in this case.

Why does such analysis matter? For small cases it may not; as with most things, if the job is small enough you can use most any method to finish it. But try this small experiment: take any HTML file, and copy the body of it enough times so the file ends up being about a Megabyte long. Save it, then open it in any browser, or any current XML tool that shows the document fully rendered. Scroll to the bottom, resize the window horizontally, and go to lunch. The extreme delay happens because re-rendering is at best O(N) -- proportional to N, which is the length of the document -- and this already is prohibitive. An automatically formatted table with N rows is likely at least O(N**2) in most browsers, and so costs much more for its size.

This is tolerated because we as users seldom request the extremely slow operations. Horizontal resizing is fairly uncommon, but so are large documents on the Web. Internet capacity leads us to break up large documents; but so does this very slowness of rendering. So HTML authors are affected by low-level architecture choices that browser implementers and standards mavens made. Returning directly to XML operations, the XSLT programmer who asks for "preceding[1]" might be asking for an instantaneous operation, or one that takes time proportional to half the length of the document, or possibly worse. Thus, even for non-implementors some knowledge of these arcana is useful.

The key fact to notice about the order expressions is that no matter how fast a computer you eventually get, O(N**2) is always far worse than O(N) -- for large enough N. If you have a computer 1,000 times faster than mine (that is, about a decade newer), but I can do something in O(N) while you take O(N**2), then you take time N**2 while I take time 1000*N. I still win if we're sorting more than a thousand items. The same rule applies if your programmer writes 1,000 times faster code to execute the O(N**2) method. In the end, the O() value swamps advantages of computer speed or programming details.

Enumerating nodes on the XPath "preceding" axis directly from the XML text is O(N**2) if you don't have enough memory to keep track of them all in one pass (as for an aircraft manual or the entire XQuery spec). This is because as in the sort example above, a program would have to re-process the entire preceding portion for each step. Of course, the more nodes you can keep in memory at once the better you can do; but for a big enough document it won't matter how much memory you have. Thinking of it another way, for a small enough device (like a cell phone or a tiny sensor array) it will matter even for small documents.

Of course, O(N**3) is worse yet, and so on. Even worse is O(2**n), which will eventually pass up any square, cube, or other fixed exponent (for large N). Yet worse than that is a factorial function, as for example the famous "Blort Sort" algorithm: write each item on a card, then toss the cards in the air. Pick them up and check if they're in the correct order. If not, repeat.

There are far better sorts, such as Quicksort, which takes only time O(N*log(N)), with a variation that is even faster (though the analysis took a dissertation [Sedg]). Fastest of all is a method that can complete in constant time regardless of the size of the input: O(k). All of these cases can arise in XML processing; this is why if you ask for something the developers did not anticipate, you may find yourself waiting a long time.

Parameters of an XML document

The controlling parameters (the "N") for an XML document are primarily these:

First, the total number of elements. For many operations, this controls the space and time required. For almost all operations on a raw XML document, this is at least a factor. One can construct documents for which this is not true: for example, a one-element document with huge amounts of text. However, this is very unusual, and like some other possibilities mentioned below, we will ignore it here so the paper remains shorter than the OED.

Second, the fanout, or the number of children any given element has. The average fanout dominates some operations, such as concatenating all the text of a node, or finding all the descendants. Average fanout is commonly low in XML, but large cases arise: long series of paragraphs, or dictionaries structured with all entries as siblings.

Third, thedepth, or the number of containing nodes from the root down to any given leaf. This mainly affects hierarchical operations such as examining ancestors, while fanout affects flat operations such as scanning siblings.

The maxima of these parameters tend to be unimportant to run-time, but can hit hard limits in some systems. For example, the extremely fast model discussed in [Shin] fails for fanout over 255, and for some smaller depth (both limits can be increased at moderate cost). Nearly all models fail at some hard limits; once 255 or 65,535, now more commonly 4 billion or (on 64-bit architectures) a mere 16 pentillion. This paper will ignore such limits.

Typically, node-types other than elements and text have little effect. PIs and comments act much like empty element nodes. In most implementations, attribute and namespace nodes are instantly available from the element bearing them, and seldom accessed in any other way, and so accessing them (besides being relatively rare) reduces to accessing their element nodes.

One example operation: get-left-sibling(node)

The operations considered below are gleaned primarily from XPath and DOM. There are many operations besides these; but the ones here are included because they are particularly frequent, or particularly fundamental in the sense that other operations can be built from them. Both criteria matter. It is possible to build an XML system with very few operations, from which all other operations can be built. However, building the other operations becomes more and more expensive as the number of primitive operations decreases. This, in fact, in a key principle in the analysis that follows: The more information your system keeps or build ahead of time, the less it will have to figure out on the fly -- that often determines the overall costs.

To show how this fits together, we will first analyze some approaches to building the getPrecedingSibling(node) function. This could be used to implement XPath's preceding-sibling axis, or as part of various DOM functions.

If you only have getFollowingSibling(node), implementing getPrecedingSibling(node) can be very expensive -- depending on what else may be available. For example, if you can quickly retrieve the parent and the first-child of any node, you can construct getPrecedingSibling like this:

   p = get-parent(node)
   currentNode = get-first-child(p)
   while currentNode!=null
      nextNode = getFollowingSibling(currentNode)
      if nextNode==node then return currentNode
      else currentNode = nextNode
   end
   return(null)

This is not so bad -- as long as get-parent, getFirstChild, and (especially) getFollowingSibling are all quite fast. If their cost is negligible, the cost of the algorithm above is O(fanout) -- on average you will have to walk halfway through all the siblings. Since fanout in XML is often small (say, 10), this is probably ok.

Still, even with a fairly good method like this, practical realities can intrude, for example if each getRightSibling requires a disk seek (perhaps 250,000 times slower than in-memory operations). A program can afford only a few dozen disk seeks before an interactive user becomes bored (running the monthly payroll may garner a bit more user patience).

If getPreceding is available, one can traverse straight back with a simpler algorithm, walking through all the descendants of the precedingSibling, and then hitting the precedingSibling itself. At each step, the node would be tested to see if its followingSibling is the starting node. This method is typically slower than the previous, depending on the number of descendants, which will on average depend on the fanout and depth.

If your implementation lacks getParent and getPreceding, constructing getPrecedingSibling can be extremely expensive. It involves starting at the start of the document, which means a much longer walk forward: getting the previous sibling then becomes an operation of O(N) where N is the unpleasantly large total number of elements in an entire document.

As should now be clear, it matters little whether the named methods you can access in some API or product "provide" a given operation. For example, a DOM or other implementation could provide a get-left-sibling method, but construct it as shown above out of other things. This could be very slow because whatever you build on top of it is compounded by the slowness below.

A classic example of this danger was sorting in HyperCard™. Early versions had no built-in sort for the lines of a field (for example). A sort could be built, but every time the sort algorithm needed to access, say, the 24th line, it had to parse the field to locate the 24th line-break. This took time proportional (on average) to the number of lines being sorted. Likewise for moving a line as needed. Thus, a sort changed from the typical O(N*log(N)), to at least O(N*N*log(N)) -- making it unusable for lists longer then 20-50 items. Splitting up a field first required O(N**2) time all by itself. However, when a built-in sort was added, which could directly manipulate the data it was sorting, the time dropped to O(N*log(N)), acceptable for up to thousands of items even on a slow machine.

The lesson here for XML is that it is important to know what the most basic underlying operations are in a given system and just how fast they are, because if they are not fast enough you cannot build anything fast on top.

With this background, we will now enumerate the basic operations needed, and then turn to the architectures used to store XML data. The two basic ways that developers optimize certain processes are to move data from disk to RAM, and much better, to change the underlying access structures, usually by adding more pointers to the representation of each node. For example, if every node contains a direct reference to its preceding sibling, getting the preceding sibling need not be simulated by combination of .

What are the fundamental operations?

Typical processing of XML documents involves certain operations. We are not concerned with all possible operations, but with the fundamental operations needed to access, scan, and otherwise manipulate an XML document as XML.

Locate a particular node

Get node with ID/key X

Having an ID or other key on an element suggests that references to that element are likely. Yet searching for IDs is fundamentally unlike most other operations, because there is no necessary correlation between the text of an ID and its location. There are only 2 real alternatives to finding one: serial search (time proportional to document length), or consulting a pre-built list.

When working from XML source, serial search is the only option, and on average half the file must be searched. However, it is very easy to do better than this. Even if the source cannot be modified, the first time an ID is requested the search can be done, but at the same time a list can be made and saved for later.

A pre-built list occupies space proportional to the number of IDs, which is manageable for all but the largest documents. Most XML systems seem to keep such a list. The JavaScript DOM makes this available via getElementByID(). But looking up an ID in a list is itself an operation of varying cost: If the list is completely unorganized it must be serially searched, so the only savings is that an ID list is typically much shorter than the whole document. However, if IDs are extremely dense this may not gain enough. Usually the list will be organized as a tree or "binary-search" list, so the entry for an ID can be found in time O(log(#IDs)); or even better as a hash table, where time should be constant and very fast. There is minimal space cost to using a hash table for this, and the gain can be huge.

Hash tables, however, can be prone to subtle problems. One early system used only the first 16 characters when sorting IDs into a hash table, in an effort to save time. This worked extremely well until a document arrived that had over 2M IDs, all carefully assigned with long, mnemonic names to make them easy for users to manage. Unfortunately, the first 16 characters of all 2M IDs were the same, making them indistinguishable to the hash table, whose typical response went from far under a second, to nearly a minute. Because finding IDs is often a response to a user action such as following a link, time over about one second is problematic.

Get child N of node

When working from raw XML source, this operation is easily accomplished by parsing the originating node. The time is proportional to the length of the originating node (since on average half of the node will be parsed).

The usual enhancement is to go maintain a firstChild and a nextSibling reference in every node, so the system can go to the first child and count across. This works fine because the time is proportional to fanout, and fanout is almost always quite small.

The fastest method is for every node to contain an array or other list of references to all of its children. This method is unusual, however, because it means that element nodes are of varying (and unlimited) size, which is inconvenient both in databases and in program code.

An excellent compromise can be to use data structures called "skip lists" [Pugh], which employ a clever set of forward references, some of them short and some long. This method, however, is only important when fanout is large, typically for documents with very long lists, such as dictionaries and other reference works with a very long, flat level (perhaps with many other levels too).

Compound hazards

The implementation for getChildN can have unexpected consequences. Although fanout is usually small enough that retrieving a single numbered child works fine regardless of method, it is important to notice that this very basic operation may be used repeatedly. For example, one might accumulate preceding siblings like this:

i = 1
while (i<startingChildNumber) 
   node = p.getChildN(i)
   myNodeList.add(node)
   i = i+1
end

This seems innocuous, but if getChildN() is being built from getFirstChild and getNextSibling (as it usually is) then every pass through the while-loop wastefully steps through the same siblings the prior pass did. This hidden cost raises the time to O(N**2) where N is the child number you are starting at -- very expensive. It would be far faster to ignore getChildN completely and enumerate the nodes directly using getNextSibling -- even though your system's documentation may not mention any distinction at all between getChildN and getNextSibling.

Get node with XPointer child sequence CS

An XPointer child sequence is merely a list of child numbers separated by slashes. For example, /3/2/27 refers to the 27th sub-element of the 2nd sub-elements of the 3rd sub-element of the root node. Child sequences provide a very handy way of getting to any element quickly: because getChildN(N) typically has a cost proportional to fanout, interpreting a child sequence only takes time proportional to the depth of the tree times the fanout. A slower getChildN would of course be very costly here.

Note: A precise analysis is much more complex. Deeper levels normally have far more nodes, and fanout also varies with level. Also, if the child sequences are used to represent Xlink or other link targets (say, in an annotation system), far more links may be attached to paragraphs and smaller areas than to entire chapters and sections.

Get preceding/following sibling

Following sibling is almost always a quick operation, implemented by some sort of direct reference. Working directly from an XML stream, of course, the only alternative is to wait for the sibling to appear, raising the cost to involve the size of all the descendants of the starting node. The first example above showed how precedingSibling can be simulated using only getFirstChild and getFollowingSibling.

PrecedingSibling can be made just as fast as followingSibling, simply by storing the analogous direct reference. This is particularly useful to enhance backward traversal or scrolling operations, which otherwise may involve very many redundant calls to followingSibling.

This is commonly adequate, but XSLT programmers must be aware of the added cost. An implementation that uses this simulation may be extremely slow to enumerate the preceding-siblings axis in (reverse document) order. Even if the implementation does well, a manual loop to do the same thing may not.

Either forward or backward searching can be enhanced by the skip lists mentioned earlier, which facilitate getting the Nth preceding or following sibling. However, this is almost never done. In my opinion, the reasons for this are:

  • Skip lists are not as widely known as they deserve
  • For most documents, even the maximum fanout is fairly low
  • Most such requests count only element nodes, and building skip-lists that correctly side-step intervening comment, PI, and other nodes is non-obvious
  • Most requests also place conditions, as in an XPath request for the 3rd following sibling matching some particular characteristics. In such cases you have to test every intervening node for those characteristics, making skip lists or other optimizations for longer jumps useless.

Get preceding/following node

The issues for these operations are very much like those for preceding and following sibling operations, except that these axes are far longer, so the stakes are higher. The preceding operation may be rare, but the following operation is used to traverse any portion of the tree. For example, in a browser, every screen refresh may involve:

node = firstNodeOnScreen
while (screen is not full)
	send node to renderer
	node = following(node)
end

Traversing following nodes works very well from XML source: this is essentially what an XML parser provides. Because an XML parser (not a validator) can start easily at any element boundary, an implementation could even implement following via a parser: position it to the starting point, and count events. This is not possible in SGML unless extra information equivalent to a fragment declaration is available (see [Frag]).

These operations may seem to be odd axes in XPath, since they largely ignore the important hierarchical structure of an XML document. However, there are some essential tasks that cannot be modeled otherwise. For example, locating the preceding element of a given type, regardless of its context; this is commonly needed for elements like lists, figures, or footnotes, that must be numbered. Likewise, executing a link between bibliographies or tables of contents, and their referents, can be done by calculating the link target using these axes -- but not without.

In practice, I have observed that these axes are most commonly qualified by element type: all preceding footnotes, etc. Similar goals are often approached using XPath's descendants axis, which shares with these, the characteristic of being very large before filtering.

The JavaScript DOM operation getElementsByType() fits this use exactly. It can best be implemented separately by collecting the appropriate lists or threads when a document is imported or loaded, rather than by resorting to complete node axes.

Assemble/scan a set of nodes

Assemble all CDATA of subtree / descendant

From XML source, this task is trivial: parse the node and grab all the text events that occur. This is essentially the task used in rendering a subtree as well, except that non-text events are discarded.

When building a binary representation from XML, there is value in not storing the text content inline with the structure. Fine architectures can be developed, in which all nodes' representations are the same size; this has advantages in memory management and retrieval, but is not practical with text included. If text is separate, however, it can easily be stored sequentially, so retrieval or search of any subtree's text content is trivial.

As a trivial example, the text can be stored in a single file in the order it occurred, while the node structure is stored elsewhere. The text file may or may not include spaces or some other separator demarcating text nodes (for some schemas, elements imply word-boundaries). This text can then be used for proximity and other text searching, while text nodes (and perhaps others) in the structure representation are used for traversal, node search, and the like. Each node can refer to the range of the text file it contains. To get from text to structure, either the tree can be traversed searching for the innermost node whose reference includes the desired text range, range-start, or range-end; or a table of text nodes mapping text offsets directly to nodes can be maintained (the latter method is easier to update).

For most any representation, however, this operation should not be difficult.

Find parent/ancestor/ancestors

GetParent appeared earlier as part of simulating other operations. It is also important as the basis for the ancestor, ancestor-or-self and parent axes of XPath, and the DOM getParent operation. Any task involving the tree structure, such as making an outline, selecting successively large container nodes, or inheriting formatting properties, will make heavy use of this operation.

getParent itself is very expensive to simulate using other typical properties; for example, the only other XPath axis which even contains the parent node is preceding, and searching there is highly wasteful. This operation is also difficult from an XML stream.

Because of it's prevalence and the difficulty of simulating it, getParent is usually supported by a direct reference in every node. Given that and the limited depth of most XML documents (I evaluated a wide range of documents and found typical depths range from 8 to 13), gathering or searching among the ancestors of a node is straightforward and quick.

Ancestor operations do, however, reveal a subtler problem, known as locality of reference. Most sibling and nearby-node operations deal with nodes that are quite close to the starting point (at least in terms of the XML stream, and commonly in terms of the structural representation as well). In part this is because most nodes are near the bottom of the tree: there is only one root, and typically few direct children of it. Most requests for the parent of a node involve similar proximity. However, every request for all ancestors involves the root node, which can be quite distant from the starting point (as can be other nodes on the path back to the root). If retrieving these nodes involves any access to slower storage such as the disk, the time can add up quickly. Because the topmost levels of the document hierarchy affect so many lower nodes, they should be given priority for storage, kept in fast memory as much as possible.

Find value of attribute A on nearest ancestor that has it

Attributes are generally associated only with the (element) node that bears them. There is seldom need or means to access them other than by name from that node, so providing direct reference from bearing nodes to their attributes usually suffices. The most common exceptions are:

  • Collecting all attributes of a node during export, copy, etc.
  • Searching over all attributes in the document, or more commonly all attributes of a given name or on a given element type. The most common example of this is locating IDs, treated earlier. Other examples include applying styles, or perhaps the worst case, "effectivity" attributes in technical documentation. Aircraft manuals, in particular, often have enormous numbers of sections that are switched on or off depending on which options are installed on each individual aircraft. Locating and organizing section based on these applicability attributes requires special methods, but fortunately does not arise in most domains.

Test relationships

IsDescendantOf/isChildOf/isSiblingOf

IsDescendantOf and isChildOf are best performed upside-down, because getParent is already necessary: begin at the potential lower node, and run through its ancestors hoping to find the starting node. IsSiblingOf can be done most easily by comparing the parents of the two nodes for equality. These operations are not problematic.

Node equality

Whatever the identifiers for nodes (see below), they must be comparable in themselves: two nodes with identical identifiers must be distinct. This poses a small problem for relational databases, in which identical records are deemed indistinguishable. As with many other databases, records with too much (or all) data in common, are disambiguated by an extra key field: John Smith of New York #1, vs. #2.

I have seldom encountered a user need to compare separate subtrees of XML documents for whether they have identical structure and content. However, this can arise with systems that make heavy use of entities. For example, in some industries warning messages are regulated by law, and XML users tend to place them in entities so that each warning can be re-used without risk of corruption or divergent editing. If an XML system forgets that the instances originate in the same entity, this purpose could be defeated, with possibly litigious of fatal results.

Does node N precede/follow/contain/occupy N2?

These 4 relationships are all the possible ordering relationships of nodes in ordered trees. They can be very important because the textual organization and order of XML information is significant. In many databases there is no point in asking which of two records is first, except in a particular report; but this is not generally true for XML data.

Containment and it's inverse, occupation, are easily tested by enumerating one node's ancestors, and checking that list for the other node.

Precedence, however, is much harder when node identifiers do not incorporate some ordering information (that is, when they are opaque). Perhaps the most obvious method is to search the following (or preceding) nodes from one of the nodes, until the other is found (or the end of the document is reached). However, a much better solution is possible.

The ultimate ancestor or both nodes to be compared is the same: the root. So collecting all the ancestors for each node will yield 2 lists, at least the first nodes of which are equal. Scanning down both lists until the first place where they differ, the differing nodes are necessarily siblings. At that point comparing order only requires scanning through the set of siblings they are both in, rather than a much larger set of nodes.

A more detailed comparison algorithm, which works not only for nodes but for points within text node content, and for ranges extending between any two nodes or points, is available in the XPointer xpointer scheme Working Draft [XPtr2]

Modify the structure

Insert node

Inserting a node requires times that depends mainly on the number of other nodes whose references must be updated, and the number of identifiers (if any) that must be modified.

For example, inserting a node node will likely require setting the followingSibling reference in the node it goes after, and perhaps the precedingSibling reference in the following Sibling node (as well as the symmetrical references in the new node itself). The parent node may also need to be updated, as may any lists organized by element type, ID, and so on. If nodes contain their own child-numbers, then all following siblings must increment their child-numbers.

If reference identifiers for nodes are opaque, then introducing a new ones requires no changes to existing ones. However, if identifiers include any information about location, they may need to be updated as well. For example, if nodes are identified by their child-sequence numbers, the child sequence numbers of all following siblings must be incremented

Any references to descendants of those following siblings must be updated as well. For example, inserting a node after node /1/5/2 means not only that the old 1/5/3 is now 1/5/4 (the new node becoming 1/5/3), but that the old 1/5/3/1 is now 1/5/4/1. Because of this, it may be wise not to store a complete child sequence in each node, but only the node's child number at its own level.

If the newly inserted node has descendants, they can be iteratively inserted, or in some architectures that will come for free because of their ancestor being inserted.

Delete node

Deleting a node involves almost the identical operations as inserting. The adjacent nodes must be updated, and if child numbers (or child sequences) are stored in nodes, following siblings (and their descendants) must be updated as well.

If the deleted node has descendants, it may or may not be intended that they be deleted as well. If they are deleted, under most architectures it is simple because they will not be directly referenced from outside the subtree they are in (except perhaps for appearance in lists by ID, element type, etc.). If they are not deleted, the direct children of the deleted node conventionally become children of the parent of the deleted node, being inserted as just described.

Some systems speed up deletions by merely marking nodes as deleted rather than actually deleting them, and cleaning such nodes up en masse later. In that case operations such as counting node position or retrieving the nth node must account for any pseudo-deletions they encounter.

Move node

Finally, the operation of moving a node reduces to deleting it from its original position and inserting it at the new position.

Implementation characteristics and their effects

Ideally, each of these operations would be performed in constant time (O(k)), or constant time per item returned. An implementation can come closer to this idea mainly by trading space for time: that is, it can keep more direct references around so that the often-required information is readily available.

Several main approaches are used for managing XML documents:

  • Working straight from raw XML source -- not discussed further here
  • Creating an XML-specific binary format
  • Loading XML into a relational database (in one of two ways)

XML-specific binaries

[Smit] and [DeRo] both took this approach, and implemented structures with rapid node retrieval and several references in each node: parent, first child, preceding and following siblings, text content of a text node, subordinate attributes. Nodes might also contain their own child number, their global number among nodes of the same type, and their element type. With this set of information (and each reference resolvable in constant time, whether in RAM, via a unique indexed database key, or via a hash table or similar construct to disk), many needed XML operations can be well optimized.

In addition, architectures commonly include tables of IDs, and of all elements of (some or all) particular types. This is not a prohibitive amount of data; using 4-byte node identifiers (typical these days, though inadequate for a few extremely large documents), this amounts to about 40 bytes with no compression tricks.

Operations are coded individually to make best use of the information compiled during a "build" process. If a needed operation is too slow, the preferred solution is to add information to the nodes. Thus, this approach can achieve the desired performance for operations because they are optimized by the architect. However, there is a lot of data per node.

A subtle but useful space savings can be achieved by noticing that the following node is by definition equal to the first-child node if there is one, and otherwise is the right-sibling if there is one. Thus, if node-size is at an extreme premium the following-node reference can be deleted in favor of two flags, and put in the first-child or right-sibling space if there is neither a first-child nor right-sibling. Likewise, only text nodes can have text content but only element nodes can have attributes, a fact that allows further savings.

The clearest weakness of this general approach is accessing the N-th child of a node when fanout is very large, such as in dictionaries. To optimize this, a time-honored solution is the use of skip-lists [Pugh]. Optimizing this also facilitates getting the N-th following sibling or preceding sibling, if each node knows its own sibling sequence number (step to the parent, and then request the S+Tth child, where S is the starting node, and T is the target sibling's number of steps away from S).

A less obvious but more serious weakness of custom binary approaches is that it may be more difficult to update them, and especially to integrate them with version control and management. Some approaches gain much performance by clever assignment of node identifiers (see below); these may be quite difficult to update in place. Other methods that provide easier update lose some efficiency because nodes cannot remain neatly sequentially numbered while constantly changing.

RDB approaches

Relational database implementations of XML are attractive because much infrastructure comes for free: Transaction processing, audit control, rollback, and other features come with the database engine. Also, the better RDBs can scale to very large datasets and are so heavily used that they have achieved high reliability, not to mention the ready availability of developers, training, and support. Storing XML in RDBs usually is done using one of three basic approaches:

1: create a custom RDB schema for each XML schema, commonly with one table for each element and attribute type, where the tables reflect element-type-specific semantics.

2: create a generic RDB schema, with an element table and an attribute table, where the tables reflect XML's generic structural semantics rather than the semantics of particular items from a particular schema.

3: design as in (2), but encode structure into the node identifiers, rather than using arbitrary/opaque keys.

Using custom schemas corresponding to different XML schemas allows writing smarter code to handle the details of each XML situation. However, it is not clear that this matters often enough to outweigh the cost of developing new database code every time a new or changed XML schema arrives. Also, the possibility of writing different optimized database code for each element type does not mean such code will in fact be written, or be written well. Using generic schemas reduces the design work and may eliminate the per-XML-schema work; but makes it more difficult to optimize operations for particular element types.

In any case, relational tables for elements typically will assign each element or other node a unique key, and the record for a node will refer to other nodes via their keys. Such tables are easy to create for XML elements which must contain a fixed sequence of nonrepeatable elements, such as

   <!ELEMENT address - - (street, city, state, zip)>

However, typical XML schemas (other than those used to model data originating in RDBs to start with) have very few elements that constrained. Most elements have repeatable child element types, which normally results in another table and additional keys to be managed for connecting each element to its own sub-elements in that other table. A ubiquitous example is sections, which commonly contain a title followed by a mix of free paragraphs and embedded subsections, for example:

   <!ELEMENT section - - (title, (subsection | p)+)>

The repeatable group cannot readily be accommodated within a section record, because it is of unlimited size. Even limiting sections' permitted children does not really help, because it wastes enormous space (most entries in most records will be nil) and because the code to retrieve and assemble an element's children (much less an entire subtree) becomes absurdly tedious. Instead, there is typically a "children" table, in which each record contains:

  • the parent's unique key, so that just the right children can be selected given a particular parent;
  • the sequence number of the child within the parent, so one can sort the selected subsections and paragraphs into proper order;
  • the actual data for the child element, or a reference to it in its own table.

The schema-specific approach

When (as is typical) many element types permit a repeating mixture or child types, it is hard to separate elements into a table for each type. In the example above, having a table of paragraphs and a table of sections could be done, but in that case all the children of a section cannot be in the same table. Since they are in different tables, their child sequence numbers within the parent section pose a problem. On the one hand, they may be kept with the child nodes -- and updates becomes much more complex. For example, inserting a new paragraph requires selecting all the sections and paragraphs with child sequence numbers great than the insertion point. The results must be then be re-numbered.

Even if there happen to be no subsections of the parent node involved, a select on the subsection table is needed, because the parent node doesn't know that, unless it stores data about each kind of child it has, which itself involves a great deal of space and processing.

For a few element types (and thus a few tables) this is manageable. But for elements that can contain a wide variety of children, this quickly becomes absurd. For example, many elements in HTML have content "%inline;", which expands to:

(#PCDATA | TT | I | B | BIG | SMALL | EM | STRONG | DFN | CODE | SAMP | KBD | VAR | CITE | ABBR | ACRONYM | A | IMG | OBJECT | BR | SCRIPT | MAP | Q | SUB | SUP | SPAN | BDO | INPUT | SELECT | TEXTAREA | LABEL | BUTTON)

Many HTML element may contain all of these elements, including most of the %inline elements themselves. Keeping each element type in a separate table means that gathering the children of an element with this model involves a select operation in each of these tables. I find this a strong argument against schema-specific modeling, except perhaps for very specialized XML schemas. A clear exception would be XML data that originates in a relational database to begin with; in that case it seems obviously correct to move the XML form back into an RDB with the same or similar schema as its origin.

The number of tables (and consequent selects) can be reduced by combining similar element types -- for HTML, perhaps putting all the font-style elements in a single table. This does help, but the question remains, why separate the elements in the first place? There is in many cases no clear argument for doing so.

The generic approach

The alternative method of storing XML in an RDB is to directly model XML's generic structures, and have a schema that works for any XML documents. The schema in this case will define records that look remarkably like the nodes of a custom XML binary: having a unique key for the node itself, and key references to the parent node and perhaps siblings and/or at least the first child. Such references require the same maintenance as described earlier for various operations.

One advantage of RDBs over the usual custom binary approaches, is that many of these references can be eliminated, with a consequent savings for their maintenance. For example, having the parent's key in each node means that all children can be retrieved by a single select: select all records whose "parent" field is a particular value. All siblings of a node can be retrieved the same way (except for discarding the node itself at the end). So long as such selects are optimized by thorough indexing in the database, this can be quite effective.

Retrieving all ancestors requires iteration, but if each node references its parent, this is merely following keys, which should be faster than selecting nodes by any field value. Get the parent record, then grab its parent field and retrieve that record, and so on to the root.

Perhaps the biggest problem with this approach is that gathering or traversing a whole subtree is quite difficult, and these are very common operations. To gather all the descendants of a node, select all of its children; then do another select, finding all nodes whose parent field matches any of the previously-selected nodes' keys. Keep repeating like this until no more records are being added. This is a notoriously slow RDB operation known as a "fixed point join". Worse, as or after all the nodes' records are retrieved, they must be arranged into a tree by some process anyway, and the database qua database provides no such process.

Another difficulty is dealing with the text content as a contiguous whole. For example, locating the string "very good" when "very" may or may not be in a separate <emph> element can be quite involved.

Comparing two nodes for order requires some information about ordering to be kept in additional fields, such as each node's child number within its parent. This information is (repeatedly) used for sorting, for example after running a fixed-point join as just described. The child-numbers of all following siblings must also be updated whenever a node is inserted, and this requires another select (though no joins).

The Shin method

Dongwook Shin [Shin] describes a method essentially like the generic method just described, but where the key or identifier for each node, is a child sequence with one step per column (the one key difference, as describer in [XPtr], is that the child sequence numbers must count all node types (such as text nodes), not just element nodes:

GI     ROOT   LEVEL1 LEVEL2 LEVEL3 LEVEL4 LEVEL5 ...
html    1      1      0      0      0      0
front   1      1      1      0      0      0
body    1      1      2      0      0      0
h1      1      1      2      1      0      0
p       1      1      2      2      0      0
ol      1      1      2      3      0      0
li      1      1      2      3      1      0

This method has the excellent property that selecting genetic relatives of a node is very rapid. In particular, it avoids the costly fixed-point join for retrieving all of a node's descendants. For example to retrieve all the descendants of the body element above, simply select in this table for all records with ROOT=1 and LEVEL1=1 and LEVEL2=1, regardless of what is in the later level fields. To select the body element itself, additionally require that all the later level fields be 0. To select all its children, require that the one next level field (LEVEL3) be non-0, and all later ones 0.

Because of this characteristic, fewer references need to be stored in records. For example, there is no need to store a parent reference, because the parent's identifier can be constructed when needed. Sibling and child references can also be omitted. Considering these savings and the possibility of using smaller fields for the level numbers than for a unique single-field key, the use of so many fields for the node identifier is less costly than it may seem.

This method is very strong for supporting DOM or comparable XML interfaces in a relational database, because many of the most common operations become very quick. On the other hand, node insertion and deletion are slower because of the need to update the identifiers of all later siblings and all their descendants. This method also puts hard limits on documents' depth and fanout: The RDB schema must provide as many LEVEL fields as the maximum XML tree depth to be supported. A rebuild to extend this is simple but slow. Fanout is limited by whatever integer size the level fields use. 256 (one byte fields) is an obvious choice, but many documents will have large sections with more than 256 sub-elements. 65,536 is the next obvious size, and adequate for most documents -- however, a huge reference work such as a dictionary or an inventory management list could have more children at some point.

Summary of RDB methods

The author generally prefers the second and third methods for practical reasons: they work for any XML schema without ongoing design and maintenance. However, all these methods have some very costly operations unless designed and indexed carefully. For example, gathering the descendants of a node typically involves retrieving all nodes having that node in its parent field; then retrieving all nodes that have any of those nodes as their parents, and so on until no more nodes are being added (a so-called "fixed-point join"). Sorting all those nodes into document order still remains, and requires additional information or processing. Shin's method avoids this, but imposes greater update costs.

There are countless additional details to each of these approaches, and countless approaches that mix and match methods in different ways. It is hoped, however, that this summary gives a reader a basic feel for the kind of questions that must be asked of any XML storage architecture, and how to evaluate whether an answer makes basic sense or not. Although some of these architectural details may seem obscure, it is also hoped that the reader sees clearly how they can suddenly become important, even dominant issues in making an XSLT script or other XML task tractable or intractable.

Identifier types

Many operations can be facilitated by using node identifiers that express some of the tree structure, rather than opaque identifiers. The best candidate for this use is XPointer child sequences.

Merely from the child sequence itself, one can test for several important tree relationships, thus saving any need to consult the document or its structural representation (this is how the Shin method avoids many of the references otherwise needed). For example:

  • precede/follow: Begin at the top of the child sequence, and discard numbers that are equal. If a difference is found before either list runs out, the node with the greater number in the difference is later. For example, 1/28/4 precedes 1/28/5.
  • contain/occupy: Begin at the top of the child sequence, and discard numbers that are equal. If one list continue but the other runs out, the node represented by the longer list contains the other. For example, 1/7/3/8/2 contains 1/7/3/8/2/4/29.
  • direct-contain/directly-occupy: If the previous test shows that node A contains node B, that containment is direct if and only if the child sequence for B has exactly one more step than that for A.

Other operations are also eased, such as measuring the distance between two nodes on a sibling, ancestor, or descendant axis. These, however, are seldom a priority.

These features can save a great deal of time in an implementation. One can also generate the next-sibling, preceding-sibling, any ancestor, and nth-child identifiers, all from the identifiers alone (although one cannot test whether the node corresponding to any generated identifier other than ancestors actually exists).

The key difficulty is that child-sequences change with document editing. If a node is inserted or deleted, the child-sequences for all following siblings, and all their descendants, must change. Thus, update can be expensive.

Tumblers

The editing problem can be avoided if children are numbered not by integers, but by some system where a new number can always be created between any two existing numbers. For example, a node inserted between nodes 1/5/9 and 1/5/10 could be numbered 1/5/9.5. With any such system you can no longer tell how many nodes might exist between any given two nodes. However, finding all siblings, ancestors, or children, or testing order and containment, still work fine, and these are the more frequently needed tests.

Rational numbers have the required property of there always being a number available between any two other real numbers: between 3 and 4 there is 3.1, between 3.1415926 and 3.14159265 is 3.14159261, and so on. However, computers normally approximate real and rational numbers to perhaps 10-20 digits of accuracy and so about that many insertions in the worst possible order is enough to break this method. LISP, Java, and a few other programming languages support arbitrary-precision numbers, but to my knowledge no RDB does.

Ted Nelson originated the idea of "tumblers" for another purpose [Nels], but they work well here, too. A tumbler can be thought of as an infinite-base number. Each "place" can be not just a digit from 0-9, but any integer. To make this readable, the places are separated by dots: 123.4.56789.123.1. With this syntax one can always insert any number of numbered nodes between any two previously-existing nodes. Nelson also presents a more compact binary form, using a "humber" for each individual place, that can count from 0 to 2**1024.

Tumblers have never caught on, perhaps because they seem to have few uses other than tracking edits on complex documents, and that is a hard enough problem to begin with. But they do address the update cost problem for systems that use child-sequences as node identifiers, at the cost of looking (and being) rather complicated. After a great deal of editing, a node might have an identifier of:

1/4.5.123.2.6/3.2.4/28/12.8.7

The slashes as usual separate steps of a child sequence; but the child-number in each step is a tumbler rather than just an integer. Tumbler comparisons are not difficult -- they work just like child sequence comparisons (skip through all the equal components and then compare the unequal ones if any). Indeed, one can think of tumbler-based child-sequences as recursive child sequences, where the inner (tumbler) sequences encode the versioning history of the document.


Bibliography

[[Abit]] Serge Abiteboul et al. 1997. "Querying Documents in Object Databases." In International Journal on Digital Libraries 1(1): 5-19.

[[Andr]] Jacques Andrè Richard Furuta, and Vincent Quint (eds). 1989. Structured Documents. Cambridge: Cambridge University Press. ISBN 0-521-36554-6.

[[Bray98]] Tim Bray, Jean Paoli, C. M. Sperberg-McQueen. Extensible Markup Language (XML) 1.0. W3C Recommendation 10-February-1998.

[[Coom]] James H. Coombs, Allen H. Renear, and Steven J. DeRose. "Markup systems and the future of scholarly text processing." Communications of the Association for Computing Machinery 30, 11 (1987), 933-947.

[[DeRo93]] Steven J. DeRose and David G. Durand. "Making Hypermedia Work: A User's Guide to HyTime." Kluwer Academic Publishers, 1993.

[[DeRo]] Steven J. DeRose and Andries van Dam. 1999. "Document structure in the FRESS Hypertext System." Markup Languages 1(1) Winter. Cambridge: MIT Press, pp. 7-32.

[[Frag]] Steve DeRose and Paul Grosso. "Fragment Interchange: SGML Open Technical Resolution 9601:1996." November 7, 1996. http://www.oasis-open.org/specs/tr9601.html

[[ISO 10744]] International Organisation for Standardisation. 1992. ISO/IEC 10744. Hypermedia/Time-based Structuring Language: HyTime.

[[ISO 8879]] International Organization for Standardization. 1986. ISO 8879: 1986(E). Information Processing: Text and Office Information Systems: Standard Generalized Markup Language.

[[Knut]] Donald E. Knuth 1973. Fundamental Algorithms (2nd ed.). Volume 1 of The Art of Computer Programming. Reading, MA: Addison-Wesley. ISBN 0-201-03809-9

[[Nels]] Theodor H. Nelson. Literary Machines. Mindful Press, 1987. Available from http://www.eastgate.com/catalog/LiteraryMachines.html

[[Pugh]] William Pugh. Skip lists: A probabilistic alternative to balanced trees. Communications of the ACM, 33(6):668--676, June 1990. ftp://ftp.cs.umd.edu/pub/skipLists/skiplists.pdf

[[Sedg]] Robert Sedgewick. Quicksort. New York: Garland Pub., 1980.

[[Shin]] Dongwook Shin, Hyuncheol Jang, and Honglan Jin. 1998. "BUS: An Effective Indexing and Retrieval Scheme in Structured Documents". In Digital Libraries '98: The Third ACM Conference on Digital Libraries. NY: ACM Press, 1998, pp. 235-243. Ed. by Ian Witten, Rob Akscyn and Frank M. Shipman III. {shin, hcjang, hljin}@comeng.chungnam.ac.kr http://portal.acm.org/citation.cfm?id=276675.276702

[[Smit]] J. B. Smith and S. F. Weiss. "Formatting Texts Accessed Randomly." Textlab Report TR85-031. Chapel Hill: U. of North Carolina, 1985.

[[TEIP4]] Text Encoding Initiative. The XML Version of the TEI Guidelines. EI P4: Guidelines for Electronic Text Encoding and Interchange. The TEI Consortium. Edited by C M Sperberg-McQueen and Lou Burnard. XML conversion by Syd Bauman, Lou Burnard, Steven DeRose, and Sebastian Rahtz. http://www.tei-c.org/P4X/

[[XPtr1]] Paul Grosso, Eve Maler, Jonathan Marsh, Norman Walsh (eds.) XPointer element() Scheme. W3C Recommendation. 2001. http://www.w3.org/TR/xptr-element/

[[XPtr2]] Steve DeRose, Eve Maler, Ron Daniel (eds.) XPointer xpointer() Scheme. W3C Working Draft 19 December 2002. http://www.w3.org/TR/xptr-xpointer/



Architecture and Speed of Common XML Operations

Steven DeRose [Bible Technologies Group]
sderose@acm.org