Building a C++ XSLT Processor for large documents and high-performance

Kevin Jones
Jianhui Li
Lan Yi


XSLT is increasingly being used for processing very large XML documents. Existing implementation models are placing hard limits of the size of document that can be processed. We report on efforts to build a C++ XSLT processor capable of handling Gigabyte sized documents with equivalent performance characteristics to the best known existing implementation models.

We show that by representing XML data using a record format with minimal inter-record links that the memory consumption of the XSLT processor can be significantly reduced without detrimentally effecting performance, and indeed it aids the processing performance of large documents. We also show that using a simultaneous XPath processing model over this format can further significantly increase performance for large documents while not hindering the performance of small document processing.

Keywords: XSLT; Processing; Transforming

Kevin Jones

Kevin is a principle engineer and architect for XML processing at Intel®. He has worked on software XML processing solutions for several years concentrating on XSLT processing and Web services security. He joined Intel via the acquisition of Sarvega Inc, an XML appliance vendor, in August 2005. At Sarvega he led much of the XML core software development that enabled the appliance solutions to be performance competitive.

Jianhui Li

Jianhui is currently a software architect in XML processing in Intel® China Software Center. Since Jianhui joined Intel® in Sep. 2000, he has been working for multiple projects, including the IA-32® Execution Layer, an Itanium back-end for JIT compiler, and research and product developments in the XML processing area. Before joining Intel, he worked for a parallelizing compiler in Fudan University. His technical focus is static and dynamic compilation technology and XML processing acceleration technology.

Lan Yi

Lan is currently working as software development manager of the XML transformation & Query team in Intel® China Software Center. Lan joined Intel® in December 2004 and worked in the XML processing acceleration project as senior software developer. Lan graduated from the School of Computing in National University of Singapore and got his Ph.D degree in 2004. His technical and research interest is focused on compilation technology, XML processing acceleration, Data Mining, Web Mining and search engine.

Building a C++ XSLT Processor for large documents and high-performance

Kevin Jones [Intel]
Jianhui Li [Intel]
Lan Yi [Intel]

Extreme Markup Languages 2007® (Montréal, Québec)

Copyright © 2007 Kevin Jones, Jianhui Li, and Lan Yi. Reproduced with permission.


XSLT is increasingly being used for processing large XML documents. Existing implementation models are placing hard limits of the size of document that can be processed. We report on efforts to build a C++ XSLT processor capable of handling Gigabyte sized documents with equivalent performance characteristics to the best known existing implementation models.

This work is largely driven by customer requests for large document handling where large is typically considered to be anything above the limits of existing implementations but below a size which mandates disk-based storage. For 32-bit systems this can be defined as approximately 0.3GB-2GB documents, allowing for processing and operating system memory overheads. In addition, we find synergy between the needs of this type of processing and hardware XML parsers which do not interface easily to existing models.

The approach described is to employ a data representation which supports minimal inter-record linking to provide a small in-memory representation and to offset any reduction in basic data access speed that results from this approach by using simultaneous XPath evaluation. We compare this work to the performance achieved by the recently announced Intel® XSLT Accelerator Software library[IntelXSLT 07].

XML Data Representation in XSLT

The design of most XSLT processors has been strongly influenced by the needs of processing transitory data. This is not an inherent requirement of the underlying XPath data model, as XQuery implementations over RDBMS have been demonstrating along with many others [Huck 99] [Infonyte PDOM], but a reflection on what is considered the common use cases for XSLT.

For transitory data scenarios the representation of XML documents as object models has been a frequent design choice because they allow for simple implementations of XPath and exhibit good performance. At least two processors, Saxon & Xalan have chosen to employ an "array of records" representations[Saxon TinyTree] [Xalan DTM] but largely in response to Java performance considerations where the memory overhead imposed by how objects are represented in memory and garbage collection are significant drawbacks to an object model design.

For our own C++ processor, an object model has always underpinned a significant performance advantage over other implementations, but to meet the demands of large documents we must now re-think this approach without significantly degrading performance.

Representing the XPath data model in an object model is a rather straight forward process in which each data item is stored in an object and these objects are connected together via pointers into a tree structure that corresponds to how the data items in the data model are connected. The primary advantage of the object model approach is simplicity both in construction and in its use to evaluate XPath expressions. XPath implementations based on naïve sequential searching of object models also perform well on simple expressions over small data sets which make them a good choice for the types of XPath processing commonly found in XSLT processing.

The primary limitation of the object model approach is the memory required to create the tree. It is frequently suggested that a good 'rule of thumb' is 3-5x the document size being represented. To demonstrate this effect we have measured the memory usage of the libxslt processor[libxslt] while performing four simple transformations on a 155kB document. The results are shown in Figure 1.

Figure 1: libxslt Memory Use
[Link to open this graphic in a separate page]

In figure 1(a) the style sheet contains one template which matches the root node of the document but contains no processing instructions. Here we see a steep raise in memory usage during parsing and object model construction followed by a steeper fall as the object model is destroyed. The total peak memory used is approximately 800kB leading to an object model/document size ratio of 5.1. This figure is slightly higher than expected but includes non-document specific memory usage so is an overestimate. In 1(b) the style sheet is changed to include a count operation of all the data items in the document. Again we see the same rise but this time the memory usage is sustained for a short period during processing, as we would expect, before it declines.

In 1(c) the style sheet used is an identity transform, a transform that copies the input to the output unmodified. In this case we see two growth phases, an initial one to load the input document and style sheet followed by a second during the construction of the output document object model. In many use cases there is no need to produce an output object model as this example does, but it can be required for pipelining of processing in some scenarios. The final diagram shows the memory use of a more typical style sheet for the input document. The primary difference we see from 1(c) is a higher first phase memory use (from the compilation of the more complex style sheet) and a slower second phase growth from the added processing complexity in the style sheet.

This type of memory profile is typical of an object model based XSLT processor, although it is often hidden from analysis by the use of arena memory management and/or garbage collection. There are two primary concerns with the large memory use, the size of document that can be processed is limited and the scalability of applications servers can be reduced.

Large document handling is often cited by customers, not as a current need, but as a safe guard for future system capacity planning purposes. One option for limiting this risk is to partition the data before transformation and re-assemble it afterwards. While this can work in some cases it requires exceptional processing for large data sets and is cumbersome to implement. The more fundamental problem of server scalability can currently only be addressed by limiting the number of concurrent transformations, this is very undesirable because of the increase in latency it can cause and the possibility of significant under utilization of multi-core processors which require either parallel workloads or high numbers of concurrent single thread workloads to achieve maximum throughput.

In this paper we describe the result of our attempts to re-think how our XSLT processor is implemented to allow the processing of transitory data to extend beyond its current limits and to ensure XSLT processing can continue to play a central role in future web & application server designs.

Record representations

An alternative to representing XML document as object models is to use a sequence of records. These records can be viewed as binary encodings of events produced by an XML parser. For example, a set of record types can be used to represent typical SAX like events such as document begin & end, element begin & end and namespaces, attribute and text events. If the the records are stored in document order they can be replayed sequentially to simulate the parsing of the document. The records may also contain links to other records so that user can iterate the records in a more flexible manner than scanning them in sequential order, e.g. an element record may link to its parent and following sibling record. With these links, a record representation can support searches along the various XPath axis with performance approaching that of searching over an object model. There are many variants to the record based approach that tradeoff performance, memory size and complexity for other features such as the ability to process data as it is being parsed, the ability to re-locate the data in memory and the ability to stream data to a third party as it is being created.

Using a record representation reduces memory consumption when compared to an object model approach. This is achieved primarily via a reduction in the number of links between the records/objects but it can also be in the per-object overhead that some languages impose as is the case for processor writen in Java. In a classic representation of a tree we see each element object as having four pointers, one each for parent, first-child, previous sibling and following sibling. In a record representation, as the records are logically located in a linear contiguous address space, no link is needed to locate any record from another record. In practice, some minimum set of links are always needed to improve the search efficiency.

Record representations that can be stored or streamed can also be used as more general purpose "Binary XML" formats. In all the "Binary XML" formats there is a tradeoff between size and processing speed. We aim for near maximum speed while minimizing size and avoiding the need for schema agreement between two parties. The recent work of the W3C EXI working group suggests that compression is the key goal but this profile will not match our needs.

Our data format

The format we use is a binary encoding of the structure and text contained in XML documents based on the XPath data model that provides features that enable it to be processed and transmitted more efficiently than the originating XML documents.

The format contains three distinct types of data that are merged into one data stream that can either be stored or streamed to a recipient as required. The core data set contains the event records for each parser event type. The element event record contains a next sibling link and a parent link to support efficient searching. The event records are supplemented by a symbol table for common name storage and a DataGuide[Goldman 77] that provides a structural summary of the data to aid indexing and searching operations. The symbol table aids both the sequential searching operations by allowing symbol identity searches to be performed (as apposed to string matching searched) and helps improve conversion times on other platforms which have expensive string construction.

Figure 2: Data format layout
[Link to open this graphic in a separate page]

The format is streamable in the sense that a producer can encode blocks of data and a receiver can decode them as they arrive without the need to wait for subsequent blocks before starting processing. The streaming ability of the format can be used either "in process" to construct pipelines or across a network to improve latency.

The encoding is memory position independent as all records fit in a linear logical address space and the link used in element record is an offset between two addresses. This feature allows us to deploy a storage model which maps the logical data address space to actual physical locations. A virtual storage interface allows the use of replaceable memory managers so that the data may be held in main memory, shared memory or disk or some combination of these optimal for the processing being performed. The format is also designed to support documents in excess of the 32-bit boundary size, it currently has a theoretical limit of 32Gbytes per a document. XSLT Processing has been successfully performed on multi-Gbyte documents on modest hardware by employing this format.

Hardware parsing interface

Parsing XML in hardware is a conceptually straight forward operation but presents challenges in interfacing the output of the hardware to a software stack. There are at least three possible models for such an interface,

  • Shared memory models -- both the hardware and software have some shared memory space through which to communicate. This is the easiest approach but practically hard given current hardware architectures for commodity servers.
  • Structure Stream -- the hardware parser creates a stream that describes the structure of the data which can be transferred into main memory. The stream is relative to the original document so that both are needed for processing.
  • Full Stream -- the hardware parser creates a stream that describes the full aspect of the data which can be transferred into main memory. The stream can be processed standalone without reference to the original document.

Our format is consistent with the last approach here, a full stream model. The primary advantage of the shared memory model is that it could create an object model representation directly but we aim to show that this is not needed to achieve high-performance and hence the complexity it imposes is not required. The structure stream approach has approximately the same complexity to implement as a full stream but is more restrictive for downstream operations as it is more difficult to update and is less memory efficient than a full stream model because of the need to maintain the XML document in memory as well as the structure information.

Simultaneous XPath

In order to maintain our current performance levels we have to balance the slower navigation speed of the record based data format by improvements elsewhere. The area we choose was to replace sequential scanning in XPath by a simultaneous evaluation approach.

Simultaneous evaluation of a set of expressions is an important performance enhancement in the areas of XML document classification and publish/subscribe systems. Traditional simultaneous XPath implementations compile XPath expressions into automata. The created automaton accepts events from a parser and performs actions to produce results. Since the acceptance of each event can be viewed as constant time operation the processing time for any given document is constant with respect to the number of expressions being evaluated.

Suciu implemented a Lazy DFA approach[Suciu 02] for fast simultaneous XPath evaluation with acceptable memory usage. However, this model suffered from limited XPath axis support and limited predicate handling functionality. Inheriting from the Lazy DFA model we employ a new Path Map (PMap) algorithm for more general simultaneous XPath processing.

PMap query model

The PMap algorithm uses multiple bi-directional automata to describe a pull-processing search space for a set of expressions. In implementation the PMap approach consists of a compiler and a runtime. The compiler accepts XPath expressions and produces virtual machine code. The runtime implements the virtual machine and executes the code.

In the PMap algorithm, we use a graph of nodes with multiple entry points to describe each query. The query graph defines an algorithm to be followed by the runtime to collect relevant result data for each XPath expressions. In the query graph we logically define,

pmap_node ::= [(axis, pattern_entries)] ;
pattern_entries ::= [(pattern, actions, next_node)] ;
where axis is the XPath axis, pattern is an extended name/node test pattern, actions is a list of result actions (possible empty) to be performed, and next_node is a link (i.e., edge in the graph) to the next pmap_node for further processing following the given pattern. According to the definition a query graph actually consists of pmap_node in Figure 3. For convenience, we simply call pmap_node node in the following parts of this paper.

Figure 3: PMap node structure
[Link to open this graphic in a separate page]

For a given axis in any node, its associated pattern_entries are strictly ordered and mutually exclusive, which means that the processing of a data item needs only to follow the first matched pattern entry to continue processing. When a data item matches a pattern the corresponding actions are first performed and then processing continues with the corresponding link next_node if it is provided. If the processing transits from a previous node to the current node, the previous node is pushed into a stack where records a path of the graph. After searching all axis of the current node, the processing will transit back to the previous node recorded in the stack top.

Figure 4: PMap query graph for expression 'a/b', 'a/c' and 'a/*'
[Link to open this graphic in a separate page]

To give a small example, for expressions 'a/b', 'a/c' and 'a/*' we could represent them by the query graph in Figure 4 with two nodes. Their result node sets are collected into R1, R2, and R3 respectively.

An interpretation of this query graph would be,

  • At node1 search the child data items of the context item, for each item found
    • If the item matches the pattern 'a' continue at Node 2 (with the matching node as the new context data item)
  • At node2 search the child data items of the context item, for each item found
    • If the item matches the pattern 'b' save the location of this item to R1 & R3; continue with next item
    • If the item matches the pattern 'c' save the location of this item to R2 & R3; continue with next item
    • If the item matches the pattern '*' save the location of this item to R3; continue with next item

PMap query boundary

XPath expressions predicates can contain any XPath expression thus it is not feasible to evaluate them in the general case as part of a query search. However, in some cases it is possible to "inline" the predicate handling directly within the PMap search. For example, in "a[@id='foo']/b" the query structure could be built as in Fig. 5. As shown in the PMap query graph, the search will transit from node2 to node3 only when the predicate is true for the current node 'a', and it will be backtrack to node1 after all attribute node are checked.

Figure 5: PMap query graph for expression "a[@id='foo']/b"
[Link to open this graphic in a separate page]

In the general case the PMap query results are post-processed to correct for predicates that were not handled directly. To achieve this, the query results store context information around the parts of an expression that have predicates in them. Care has to be taken with this approach as it can create many intermediate results that need post-processing. Work is ongoing to find the ideal split between evaluation via the PMap and the post-processing stages to limit this effect.

In some use cases, mainly related to document classification using XPath it is desirable not to wait for completion of a query to report results. We term the process of reporting quickly as early matching. This is achieved in the current implementation via a static analysis that inserts callback instructions into the PMap query graph as needed.


The XPath processing module needs to provide the matched nodes as a result set (a nodeset) in a way that that is efficient for an XSLT processor or other usages. Ideally we would like nodesets to be represented as an array as that would allow for minimal iteration costs and help improve memory locality. However, as our XPath module processes multiple XPath expressions simultaneously, the results for each expression are interleaved.

A simple scheme to handle this issue would be to reorganize the results during post-processing. The solution can work well when the document scanning cost is high but in XSLT many XPath expressions are relatively trivial to evaluate and are used on small documents leading to the cost to reorganize being large relative to the search cost.

The wide variability in the types and hence costs of XPath evaluation motivated us to modularize the XPath engine (which was previously considered part of the XSLT processor for efficiency reasons), so that it can provide an iterator API for accessing the results of processing. This allows the XPath module to select different data organizations strategies for different scenarios by using dynamic profile information gathered on previous results sets. The storage model become adaptive to the type of evaluations it is being asked to perform to minimize the impact of result interleaving and associated access costs.

Modularizing the XPath processing component also provides a number of other benefits for internal development and third party integration and more importantly in our context for integration with future hardware XPath engines. We are particular pleased to find that efficient use of the PMap approach requires a better code organization that frees us from compromising modularity to achieve our performance goals.

Current results

We first did the experiments to study how the record representation impacts the XPath search performance and how much the simultaneous XPath processing can help on XPathMark benchmark. We also carried out a series of experiments to compare the newly constructed XSLT processor to the original object model based XSLT processor. The experiment environment is as below:

OS, Red Hat Enterprise Linux AS release 4, 2.6.9-42.ELsmp
Woodcrest 2.66G with 4 CPUS, 4M cache, 3G RAM
icc9.1.038, multi-threaded release build

In our experiment, we used the performance measurement methodology similar to that used by Intel® XSLT Accelerator 1.1 [PerfMeasure 07].

Performance impacts of record representation and PMap

We use XPathMark[XPathMark 05] as our benchmark in experiments to study the performance impacts of record representation and simulation XPath (i.e., PMap) evaluation. XPathMark consists of a set of queries which covers the main aspects of the language XPath 1.0, including different axes, node tests, Boolean operators, references, and functions.

In the experiment, the base line model, named naïve-DOM, uses a naïve XPath search algorithm and the object model representation. We compare its performance with the naïve XPath search over record representation using the same algorithm, which is named as naïve-stream. To show the performance improvement of PMap algorithm, we measure PMap algorithm with the record representation, named PMap-stream.

We measured the throughput for the three query models on a range of input document sizes, specifically 64KB, 128KB, 256KB, and 512KB. The throughputs are not affected much as the input document size increases as shown in Figure 6. Figure 7 shows the geo-mean throughput of each mode based on the result in Figure 6.

Figure 6: Performance of naïve-DOM, naïve-stream, and PMap-stream on XPathMark
[Link to open this graphic in a separate page]
Figure 7: Average throughput of naïve-DOM, naïve-stream, and PMap-stream on XPathMark
[Link to open this graphic in a separate page]

XPath search performance is heavily dependant on the data format of XML representation. In Figure 7 we can see that the throughput of naïve query mode drops about 25% when the data representation changing from record model to object model. This is caused by the higher overhead of iterating event records compared to that of iterating an object model. As we expected, the PMap query on record model reduces the number of iterations on event records thus significantly improved the throughput to about 1.5x of that of naïve query on original object model.

Large XML document transformation

To study the capability of our newly constructed XSLT processor to transform large input document, we build a series of XML documents with size scaling from 10MB to 1.05GB by duplicating the data from a real customer case. We tested and collected the memory consumption and throughput of different modes in Figure 8 and Figure 9 correspondingly.

Figure 8: Memory consumption of naïve-DOM, naïve-stream, and PMap-stream XSLT
[Link to open this graphic in a separate page]
Figure 9: Throughput of naïve-DOM, and naïve-stream, and PMap-stream XSLT
[Link to open this graphic in a separate page]

In Figure 8, we can see that the naïve-DOM XSLT processor uses about 2GB memory to process the 250MB document. As the sharp increase of memory consumption (i.e., 6~12x of document size in our experiments), the naïve-DOM XSLT processor crashes when processing documents with size larger than 300MB. Both naïve-stream and PMap-stream XSLT processors need much less memory (about 1.6x of input document size in our experiments) thus can even process the input documents larger than 1GB.

The throughput data in Figure 9 shows that the naïve-stream XSLT and PMap-stream XSLT perform better than the naïve-DOM XSLT in average. Both and naïve-stream XSLT and the PMap-stream XSLT can process document with size equal or larger than 300MB while naïve-DOM cannot. Furthermore, the PMap-stream XSLT performs about 20% better than naïve-stream XSLT in average.

Performance results on XSLTBench

To study the performance of XSLT processing affected by record representation and PMap, we tested the three XSLT processors (i.e., naïve-DOM, naïve-stream, and PMap-stream) on the XSLTBench test suites.

The XSLTBench test cases were selected with a set of simple guidelines on what can be considered a good test case.

  • The test should be publicly available so that anyone can examine how it works.
  • The test should solve a real-world production environment problem.
  • The test should be typical of a class of problem commonly solved using XSLT.

The XSLTBench covers a variety of use scenarios, including producing HTML documents by transforming input XML documents, algorithm for solving eight-queen problem, highlighting keywords in document, RDF parser and etc.

The final throughput comparison result is shown in Figure 10.

Figure 10: Performance of naïve-DOM, naïve-stream and PMap-stream XSLT processing on XSLTBench
[Link to open this graphic in a separate page]
Figure 11: Average throughput of naïve-DOM, naïve-stream and PMap-stream XSLT processing on XSLTBench
[Link to open this graphic in a separate page]

The input document size in XSLTBench varies from about 0.5KB to 150KB. So the cases in XSLTBench are relatively small XSLT examples. The naïve-stream XSLT and PMap-stream XSLT have been optimized by a few extra optimizations (e.g., record representation specific optimizations, pattern match optimization, etc) and it performs bettre than the naïve-DOM XSLT in our experiment. This shows that even when processing small XSLT cases the record representation can still be comparable to or even outperforms object model in average with optimizations. However, in Figure 11 we observe that the PMap query model does not benefit XSLT processing as that does for XPath evaluation (see Figure 7). After intensive investigation, we attribute this to the relatively small scale of XPath expressions in search in XSLTBench cases which can benefit little from PMap query compared to naïve query.


From the experiments in XPath evaluation and XSLT processing, we can conclude that

  • The record representation consumes much less memory than the object model representation
  • For large document cases, the XPath and XSLT based on record representation performs much better than that based on object model due to lower memory consumption
  • For small document cases, the XPath and XSLT based on record representation may lose some performance
  • PMap can improve the query on record representation especially when the iteration on document is heavy and multiple XPath expressions can be simultaneously processed

Related work

This work fits within a broader strategy for improving XML processing at Intel that also encompasses exploring models for parallel processing of XML data by hardware, software and combined solutions. The most directly related effort is a parallel processing software run-time for XSLT based on the same XSLT engine as this work. The parallel XSLT processor exploits natural course grained parallelism at the XSLT instruction level to automatically allow a transform to be executed with multiple threads and hence over multiple processor cores. This XSLT processor will be particularly useful for reducing the total execution time of the larger transforms that this work enables and in turn reducing the system memory pressure via swifter execution.

A more direct route that is being explored is to increase the processing efficiency of XML by exploiting fine grained parallelism opportunities at the instruction set level. Work is ongoing to influence future hardware design in this area. An example of this approach was announced as part of the fall 2006 Intel Developers Forum as new instructions to be made available in the Nehalem processors. Two new instructions are being provided which are specifically designed to speed up character processing and they can be employed as part of an XML parser.

While these projects are significant individually, the thrust of the work on XML at Intel is in long term analysis and planning efforts to ensure good alignment between future processors and the needs of applications. The management of in-memory XML data is a critical area to ensure that both hardware and software can be best employed in combination to achieve significant processing gains. The data models described in this document have already evolved to aid this goal better and are likely to continue to do so for the foreseeable future.


[Goldman 77] Roy Goldman and Jennifer Widom. Dataguides: Enabling query formulation and optimization in semistructured databases. Technical report, Stanford, 1977.

[Huck 99] Huck G., Macherius, I., Fankhauser, P.: PDOM: Lightweight Persistency Support for the Document Object Model, OOPSLA Workshop Java and Databases, Persistency Options, Denver, 1999

[Infonyte PDOM]

[IntelXSLT 07], 2007.


[PerfMeasure 07] Performance White Paper -- Boosting XSLT Transformation Performance with Intel® XSLT Accelerator 1.1, 2007.

[Saxon TinyTree]

[Suciu 02] Processing XML Streams with Deterministic Automata, Dan Suciu et al, University of Washington, 2002

[Xalan DTM]

[XPathMark 05] M. Franceschet. XPathMark: An XPath benchmark for XMark., 2005.

Building a C++ XSLT Processor for large documents and high-performance

Kevin Jones [Intel]
Jianhui Li [Intel]
Lan Yi [Intel]