Frozen streams: an experimental time- and space-efficient implementation for in-memory representation of XML documents using Java

Alex Brown

Abstract

As XML becomes a pervasive technology for data storage and processing, many adopters of the technology face a practical problem caused by the perceived slow performance of many XML processing operations, particularly in comparison to tried and trusted RDBMS-based solutions that are being replaced.

Earlier this year, a lengthy thread on the xml-dev mailing list on XML Performance1 agonised over XML performance in transactions and rehearsed several arguments over whiy/if XML processing is slow.

Taking that discussion as a starting point, this paper will consider whether indeed XML processing operations are “slow“ and if so what the fundamental causes of this slowness are. In particular it will consider the case of parsing larger (i.e., more than a few megabyte) documents when using XML system programmed with Java.

The paper will then outline a method, termed a “frozen stream” of in memory-storage of XML that differs from the common tree based approach and instead relies on a view build on stream-based XML parsing methods

The approach is described, and it features and options set out. A benchmark is shown for comparison with other methods of in-memory XML representation

The Paper goes on to demonstrate how this approach may be integrated with existing XML processing systems, and considers further enhancements that may be made in future.

Keywords: Processing; Parsing; Java

Alex Brown

Alex first became interested in structured markup when analysing literary texts for his doctorate (on early Shakespeare editions) in the late 1980s. Following this he worked as a developer on heavily object-oriented C++ application framework for cross-platform multimedia publishing, at the height of the CD-ROM boom. In 1997 Alex was one of the founding directors of Griffin Brown Digital Publishing Ltd, a UK-based company providing XML-based services and products. He is responsible for leading the company's XML consulting and implementation, and his work includes advising clients on XML/IT strategy and practice, mentoring clients” staff, writing DTDs and Schemas, and designing and developing XML software systems in C++, Java and other languages. In 2002, Alex was invited to join the British Standards Institute (BSI) Technical Committee IST/41, where he contributes to ISO/IEC JTC1/SC34 in its formation of the DSDL ISO standard, among other things. Alex writes and speaks regularly on structured markup technologies and their application to information management.

Frozen streams: an experimental time- and space-efficient implementation for in-memory representation of XML documents using Java

Alex Brown [Griffin Brown Digital Publishing Ltd]

Extreme Markup Languages 2006® (Montréal, Québec)

Copyright © 2006 Alex Brown. Reproduced with permission.

How slow is slow? How much memory is lots?

To arrive at some idea of the kind of memory footprint and speeds XML users might experience throughout this paper a sample document will be used for benchmarking. The sample comes from a real world project we undertook last year. The document is:

  • Well-formed XML with no governing DTD or schema.
  • Approx 71 MB in size
  • Record-like, in that the document element has approx 79,000 children of the same element type.
  • Document-like, in that some of the elements contained narrative text (titles, paragraphs, etc.)
  • English: all the characters are US-ASCII

— in other words, a fairly unexceptional, large, XML document.

Table 1: Benchmarks for operations on sample XML document
  Time taken2 Memory required3
Build a DOM Document 19.5 s 477 MB
XSLT Identity Transform 38.3 s 403 MB
Parse (SAX) 5.7 s < 2 MB

The table above shows the time and memory use for some common XML operations performed by applications running within a Java virtual machine. Rather than use diagnostic tools to measure memory use, the “memory required” figure was determined by finding the smallest amount of memory in which the operation could be performed. This “transient usage” figure is a meaningful one when considering users’ experiences of XML efficiency – as it represents a firm requirement. If there are not (for example) 477 MB of memory that can be allocated to this XSLT process, it simply cannot be completed.

From this consideration I propose a general principle for assessing the space-efficiency of an XML process: that the figure to be considered is the minimum amount of memory required for that process to complete. Processes which are eventually memory efficient after high transient usage may have their uses, but are of less use than processes which are memory-efficient throughout their execution.

Of course the inefficiency of DOM is well known4, and the DOM implementation used ([Xerces2]) makes no particular claims for high performance

[TODO: discussion of other in-memory model memory usage (XOM,JDOM)]

Speed vs. memory footprint

It is often held in computing that there is a trade-off between memory used and speed of execution.5 [Bentley] challenges this general assertion (“It has been my experience more frequently […] that reducing a program’s space requirements also reduces its run time”) and proposes that rather than accepting any trade-off that a path of “mutual improvement” should be pursued.6

Propositions

  • The amount of memory used to construct in-memory trees is excessive, and that there is scope for reducing this amount.
  • Any reduction should be possible without a significant adverse affect on performance.

The role of Java

Java is a popular choice of programming language for developing applications which process XML. While there are alternatives, it is usually Java that has the most depth and breadth of implementation for the various XML technologies, and anybody working with XML in commercial development will usually find the requirement to implement solutions in Java occurs fairly frequently, not least because of JVMs' reasonable cross-platform abilities.

Model vs. Implementation

XML models, DOM in particular, are such that they tend to impose or imply certain things about implementation. The lynchpin of the DOM model is the org.w3c.Node Interface, and implementing classes using its method signatures are steered towards an Object-rich representation in Java simply because they return and expect as parameters Object-subclassed objects. Indeed a naïve implementation of this interface would have a Node with many String properties, for its name, Namespace URI, etc.

As we shall see below, there are good reasons for resisting this when it comes to implementing XML in-memory schemes using Java, and indeed for bypassing what might be seen as a central Java tenet, the central role that Object and its subclasses play.

Bloaty Java?

The major contributory factor to the bloatiness of Java DOMs is an underlying bloat associated with the creation of Java objects. Consider the following simple Java program:

class Objs
{
  public static void main( String[] args )
  {
    // create one million empty Strings
    String[] objs = new String[ 1000000 ];
    for( int i = 0; i < 1000000; i++ )
    {
      objs[ i ] = "";
    }
  }
}
The memory necessary for this program to execute to completion is 43 MB (again using our simple JVM memory test). So even allowing, in this example, for the memory taken by the array's references themselves, we can still as a rule of thumb reckon that every Java String costs approximately 40 bytes of memory, even before we give it any content!7

Clearly, any memory-efficient XML storage implementation needs to work around this inefficiency.

The frozen stream

Given the XML document

<root a='value'>
  <e>foo</e>
  <e>bar</e>
  <e>zxc</e>
</root>
Then the usual programmatic representations of this document are as a tree (when using, for example, a DOM parser), or as a series of events (when using a SAX based parser).

As we have seen, the disadvantage of the tree-based models as implemented in Java is their profligacy with memory; the disadvantage of stream-based models is that they are intractable — we need to represent the entire document in memory for it to be efficiently queried.

We have also seen that an approach which implements XML document as a large number of Java Objects is doomed to use several times as much memory as the serialised document.

Thus a key component of a more memory approach must be to minimise the number of Java Objects used. More primitive types must instead be used.

Figure 1
[Link to open this graphic in a separate page]

XML document as stream

Notice that the events recorded here are more finely-grained than the events generated by SAX parsing. Where in SAX the start of an element is a composite of all that element's namespace and attribute information, in the model above the start element event is separated out so that (for example) each attribute and attribute value is an event also. This is necessary so that every phenomenon in the XML document being parsed can have a corresponding event generated so that the entire infoset can be represented without information loss.8

Notice also that many events have some additional text information associated with them, which are represented in the diagram by a box to their right.

Representation of events

In order to avoid the Object overhead discussed above the stream representation must be held in memory using primitives. A natural choice is to use one byte per event, with certain values corresponding to certain events.

So a “start document” event might be represented by the byte 0x80, a “start element” by the byte 0x81, etc. Bytes in the range 0x80 - 0xFF are reserved for representing events. The purpose of bytes in the range 0x00 - 0x7F is discussed below.

String storage

A large proportion of most XML documents is the textual data of various names (e.g. element and attribute names) and character data content (e.g. element text and attribute values). By representing these as indexes into some dictionary it is possible to reduce the amount of storage space required, particularly if the XML document has many re-used strings. Such an approach is taken by the Fast Infoset initiative ([Fast Infoset]) in which many string values are replaced by indexes into a number of string tables.

An extension of this method, and one that is particularly suitable for the frozen stream model, is to replace all strings with indexes into a table of all the distinct strings encountered during parsing

Figure 2
[Link to open this graphic in a separate page]

XML document as stream, with string table

As can be seen, the stream becomes one in which certain events have a string associated with them. In memory these are stored immediately following the event with which they are implicitly associated: so, the start of a element named “root” is represented by the byte value for an element start (0x81 say) and then a index into the string table for the string “root” – 1 in the example above. The model is similar to that of operands and arguments in assembly language programming, whereby certain operands affect the interpretation of subsequent bytes.

Encoding index value

When the number of an index value is greater than can be held in a single byte, some form of encoding is required. Because the values 0x80 - 0xFF are reserved for bytes indicating events, 7 bit values (i.e. bytes in the range 0x00 - 0x7F can be used to encode index values. Two advantages of this mechanism are that an iterator over the bytes can clearly identify the extent of bytes in an encoded value (any values with an eighth bit set acts a delimiter), and that only as many bytes as are necessary to efficiently encode the index value, are used.

Findings

To build a frozen stream representation with the same document and environment as the sample document at the beginning of this paper, the following results were found.

Table 2: Benchmarks for operations on sample XML document
  Time taken Memory required
Frozen stream construction 18.8 s 123 MB

Extending the model

The demon of scanning

The frozen stream model requires, when queried, a lot of byte-by-byte scanning. So, for example, finding a following sibling element from a start position of any element, requires that all intervening events are iterated over. Such an operation might be represented by the pseusocode:

1. Iterate forward over events
2. When a start element event is found, increment a counter
3. When an end element event is found, decrement a counter
4. If an end element is found and our counter is in its initial state, any following sibling 
   element is the next start element event
Altohugh modern computers are fast at byte-by-byte examination, and Java's performance comes close to that of C or assembly language, such byte-by-byte scanning could represent an unacceptable performance bottleneck for certain kinds of application and/or document. Two approaches that mitigate this are examined below.

Leveraging modern chip design

If byte-by-byte scanning is too slow, one approach is simply to increase the amount of stream examined for each iteration during scanning. So, if say we were examining 64 bits at a time, rather than 8 (i.e. 8 bytes rather than one), then our scanning time would be reduced if the underlying system was capable of performing such multi-byte operations efficiently - as modern computer systems are.

This is where the beauty of using bit masks comes to the fore: because we know that any byte representing (say) a start element event has a distinctive bit pattern, it is possible to construct multi-byte masks that may be logically compared with a multi-byte section of stream in order to ask the question: is there a start element event in here? The result of such a logical operation will tell us if there is, and if so which byte(s) are matched.

While this approach is possible with high-level languages, it lends itself most naturally to an assembly-language (or even hardware) implementation. Parallel pipelines and dual core chips open up further possibility for a highly optimised assembly language stream scanner.9

Signpost pseudo-events

Another (or complementary) approach is to incorporate 'signpost pseudo-events' into the stream. These use reserved values (i.e. with the high bit of each byte set) to indicate that they should be interpreted as events, not as data, but they do not correspond to anything in the XML infoset. Instead, their purpose is to route iterators more efficiently through the stream. So, for example, a signpost pseudo-event might express "next following sibling - 5000 bytes forward". An iterator searching for such a following sibling would thus be able to move straight to that position without any intervening scanning.

Again, these signposts would take an operand-followed-by-data like form as bytes in the stream representation of the XML, with the data being an encoded form of an integer indicating a byte offset.

This approach does trade-off memory use and performance. Introducing such signpost events obviously increases the size of the stored stream. It also increases the computation necessary when building the stream. The payoff however, is a dramatic decrease in the amount of memory scanning required when iterating an in-memory stream.

Integration with other software

An efficient in-memory representation of XML is all very well, but is of little use if all it does in get built and occupy memory. Clearly, to be useful there should be some means of making it usable with other XML processing software.

For example, the popular Jaxen XPath engine ([Jaxen]) provides mechanisms whereby it can theoretically be used to query any in-memory XML document representation irrespective of how it is implemented.

From events to nodes

There is clearly a close relationship between stream events and XML nodes. So, for example, an event in the stream for the start of an element corresponds to an element node; an event for a comment to a comment node, and so on.

Thus it should be possible to fake a Node-based model (and not necessarily org.w3c.Nodes) by developing a layer to adapt the event model to Nodes. But how to do this without creating large number of Objects?

Flyweights

With the event/node translation task in mind, an immediate hurdle to our approach to XML modelling becomes apparent once inter-library interfacing is attempted. Jaxen provides a Navigator class, instances of which may be used to navigate over the axes of an XML document representation. However, these iterators are predicated on returning Objects as its iterees, while we have taken care to avoid using Objects to represent phenomena in the XML.

The solution to this is to use what have been called (by [GoF]) flyweights, whereby a small number of objects are used at any one time to represent a sample from a pool of virtual objects. 10

API

While it may be argued there are already a good number of APIs for programmatic treatment of XML, and it is certainly true that designing good ones is hard, the frozen stream model naturally suggests its own kind of API, very different from tree-based, pull-based, or SAX-based ones.

The need to use flyweights suggests an iterator (or cursor) based API, since we only want to reify a single event as a node at any one time. This fits well with some likely scenarios for usage of this efficient model, particularly for XPath querying of content which - in theory - requires just such an API for an implementation.

At its simplest such an API needs a single Iterator class (perhaps implementing the java.util.Iterator Interface), and the ability to specify, in XPath terms, which axis is being iterated. Further enhancement would see the ability to set a filter on the iterator (so it iterating only certain events), and to have an "iterate-to" feature, whereby one could set a condition which when met, stopped iteration.

Conclusions

  • Existing in-memory models tend to bloat
  • Java implementations have a particular tendencies to bloat when large numbers of Objects are used
  • Much existing XML software implies or required large numbers of Objects
  • Efficiency can be required by building a system around known limitations - in the case of Java by using primitives
  • There is scope in such an implementation for architecture aware optimisations that may bring substantial performance benefits
  • Adapters and Flyweights may then be used to interface such models with existing XML software libraries

Notes

1.

see [xmldev].

2.

Timings are taken running from the command line on a 2.4GHz Pentium desktop machine using a Sun JVM (1.4.2_08-b03) running Microsoft® Windows™ XP.

3.

This was the smallest number of megabytes (given to the JVM using the -Xmx option) in which the process would run successfully to completion.

4.

See for example [Harold] p. 453.

5.

[Wikipedia] cites this as a typical Trade-off in computer science: “a program can often run faster if it uses more memory”

6.

[Bentley] p. 7.

7.

For comparison, creating one million Integer objects requires 19 MB.

8.

The view of an XML document as a fine-grained event sequence is taken even further by Simon St.Laurent's “Ripper” parser ([StLaurent]).

9.

Is there also potential for using some of the multimedia-targetted features of modern computer systems for stream processing, since the XML 'stream' might be thought of as close to a video or audio stream? Interestingly, for example, copying a subtree when using streams becomes merely a memory copy (blit) operation.

10.

The example given by [GoF] is of using objects to represent characters in a word processor, where it would be undesirable to represent every character as an object.


Bibliography

[Bentley] Jon Bentley. Programming Pearls. 2nd Ed., Addison-Wesley, 2000.

[DeRose] Steven J. DeRose. Architecture and Speed of Common XML Operation. Extreme Markup Languages 2005. http://www.mulberrytech.com/Extreme/Proceedings/html/2005/DeRose01/EML2005DeRose01.html.

[DOM] W3C DOM Working Group. Document Object Model. http://www.w3.org/DOM/DOMTR

[Fast Infoset] Paul Sandoz, Alessando Triglia and Santiago Pericas-Geertsen. Fast Infoset. Sun Developer Network. http://java.sun.com/developer/technicalArticles/xml/fastinfoset/.

[GoF] Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissidies. Design Patterns. Addison-Wesley, 1995.

[Harold] Elliotte Rusty Harold. Processing XML with Java. Addison Wesley, 2003.

[Jaxen] Jaxen: universal XPath Engine http://jaxen.org/

[Meyer] Bertrand Meyer. Object-oriented Software Construction. 2nd Ed., Prentice Hall 1997.

[StLaurent] Simon St.Laurent. What can you do with half a parser? Extreme Markup Languages 2003. http://www.mulberrytech.com/Extreme/Proceedings/html/2003/StLaurent01/EML2003StLaurent01.html

[Wikipedia] Wikipedia. http://wikipedia.org/.

[Xerces2] The Xerces2 Java Parser. http://xerces.apache.org/xerces2-j/.

[xmldev] Thread on xml-dev, Subject “XML Performance in a Transacation”. Initiated by David Carver 22 March 2006.



Frozen streams: an experimental time- and space-efficient implementation for in-memory representation of XML documents using Java

Alex Brown [Griffin Brown Digital Publishing Ltd]