STnG — a Streaming Transformations and Glue framework

K. Ari Krupnikov
ari@cogsci.ed.ac.uk

Abstract

STnG (pronounced “sting” 1) is a framework for processing XML and other structured text. In developing STnG, it was our goal to allow complex transformations beyond those afforded by traditional XML transforming tools, such as XSLT, yet make the framework simple to use. We claim that to meet this goal, a system must:

  1. support and encourage the use of small processing components
  2. offer a hierarchical tree-like view of its data
  3. factor out facilities for input chunking through a pattern/action model
  4. not provide processing facilities of its own, instead invoking processors written in existing languages
STnG is built around common XML tools and idioms, but can process arbitrary structured text almost as easily as XML.

In the first part of this paper, we show how these requirements result in powerful and flexible systems, and how they can be achieved. The balance of this paper describes a processing framework we have developed in Java that implements these requirements. It is available for download at http://stng.sf.net.

Keywords: Processing; Transforming

K. Ari Krupnikov

STnG — a Streaming Transformations and Glue framework

K. Ari Krupnikov [Research Associate; University of Edinburgh, HCRC Language Technology Group]

Extreme Markup Languages 2003® (Montréal, Québec)

Copyright © 2003 K. Ari Krupnikov. Reproduced with permission.

Requirements

Small processors

Experience shows that combinations of simple, small and interchangeable processors are easier to develop and maintain than ones that are large and monolithic. While everyone agrees in theory with this sentiment, in practice, one finds a surprising number of software products on the market that seem to implement the opposite strategy. We can observe that some frameworks encourage small, independent components, while others don’t.

UNIX pipes are a poster child for small, cooperating components:

Sometimes a problem at hand can be solved by the application of a single filter, but more commonly it breaks down into subproblems solvable by filters joined together into a pipeline. This use of tools is often cited as the heart of the UNIX programming environment. [KerPike 1984]

We speculate that a key to encouraging small components is providing simple interfaces between them. The interface between components in a UNIX pipeline is about as simple as it can get — all a process needs to participate is to read a stream of bytes and write another one. This “filtering” model is very powerful and is extensively documented as a software pattern [BegedDov 1995] [Meunier 1995]. Many different variations are are possible. In particular, the stream need not consist of characters; higher-level constructs can flow between filters in a pipeline. As an example, SAX filters [SAXF] exchange markup-level events. A common characteristic of filtering pipelines is their linearity; every datum passes through every filter in order (Figure 1). Each filter must maintain its own state.

Figure 1: A linear filtering pipeline
[Link to open this graphic in a separate page]

It is interesting to note that UNIX pipelines are typically started by a shell. Although the shell provides some rudimentary looping and branching, it does not specify any processing. It merely puts together building blocks written in other languages.

Another key to encouraging modular design is the availability of pre-packaged, parameterized components — again, UNIX is a good example. Section 3 offers a sampling of pre-packaged STnG components (“stingies”) included in the distribution.

Hierarchical structures and recursive decomposition

The structure of many text documents is complex. For some processing, representing text as a one-dimensional sequence of strings (as sed [sed] does) is adequate. For some tasks, a slightly more complex, two-dimensional matrix of records and fields (as used in awk [awk]) works well. But for many tasks, these are inadequate, and a generalization of the model is required. A useful generalization is a tree model.

Some data are already encoded in tree formats, such as XML or SGML documents, and matrices as used in simpler processing models can readily be represented as trees. Surprisingly, perhaps, data that at first glance appear tabular often have much richer structure. Consider the output of ls, a UNIX utility that lists files in a directory:

Figure 2: ls -l /
total 67
drwxr-xr-x    2 root     root         2048 Nov 16 22:30 bin
drwxr-xr-x    2 root     root         1024 Mar  2 18:51 boot
drwxr-xr-x    6 root     root        34816 Mar  2 18:52 dev
drwxr-xr-x   23 root     root         2048 Mar  2 18:53 etc
drwxr-xr-x    4 root     root         4096 Mar 30  2002 home
drwxr-xr-x    4 root     root         3072 Mar 30  2002 lib
drwxr-xr-x    2 root     root        12288 Mar 30  2002 lost+found
drwxr-xr-x    4 root     root         1024 Mar 30  2002 mnt
drwxr-xr-x    2 root     root         1024 Aug 23  1999 opt
dr-xr-xr-x   58 root     root            0 Mar  2 18:49 proc
drwxr-x---    3 root     root         1024 Mar  2 18:32 root
drwxr-xr-x    3 root     root         2048 Nov 16 22:30 sbin
drwxrwxrwt    4 root     root         1024 Mar 15 04:02 tmp
drwxr-xr-x   20 root     root         1024 Mar 30  2002 usr
drwxr-xr-x   17 root     root         1024 Nov 16 18:58 var

Ignoring the first line, this listing may appear to consist of 15 records that each contain 9 fields — a 15x9 matrix that can be easily processed with, e.g., awk. But in fact the structure is different; some fields have internal structure while others are parts of larger structures. The first field contains 10 separate data: an entry type character that indicates whether the record describes a directory, file, socket, device, etc., followed by a linearly serialized 3x3 matrix of user permissions. Conversely, fields 6, 7, and 8 are all parts of one attribute — the last-modified date. To make matters worse, if time is given in field 8, it has two sub-fields within it — one each for hours and minutes.

This issue of implicit structure is not limited to “flat files”; in XML documents, one often finds delimited or otherwise structured data in character content or attribute values. An example that is built into the XML 1.0 Recommendation is tokenized attribute types: NMTOKENS, IDREFS and NOTATIONS — whitespace-delimited lists of values. Many XML applications mandate other implicitly-structured data; SVG is one such application:

Figure 3: Source XML of an SVG graphic and a rendered image
<svg width="3cm" height="50px" viewBox="0 00 60 50">
  <polygon points="15,20 , 45,20 35,40  5, 40"/>
</svg>
[Link to open this graphic in a separate page]

Experience has shown that attempts to require explicit markup for all structured data in an XML document are unrealistic, and if implemented, result in instance documents that are so verbose as to be almost impossible to author manually. It may be possible to imagine a fully marked-up version of SVG (Figure 4), or even XSLT (an early draft called for an XML encoding of XPath expressions), but it would be wrong to require it, because different processors need different levels of structure to be exposed [Dodds 2001]. A processor that displays a date in a user interface can treat it as an obscure string; one that sorts on dates must know their structure. To be flexible, a processing framework must not only provide a tree-like view of its input data — it must also allow individual processors to specify their own ways of breaking that input into a tree.

Figure 4: A possible SVG-like document with explicit structure
<svg>
  <width units="cm" width="3"/>
  <height units="px" height="50"/>
  <veiwBox x1="0" y1="00" x2="60" y2="50"/>
  <polygon>
    <point x="15" y="20"/>
    <point x="45" y="20"/>
    <point x="35" y="40"/>
    <point x= "5" y="40"/>
  </polygon>
</svg>

Small chunks of data

A point that was not obvious to us when we started working on this project was that to stay simple, a processor must work on small chunks of input at a time. Clearly, if a program attempted to process something as complex as a binary encoding of a popular office document format in a single pass, it would be prohibitively complex. But what we found was that much simpler processors chunk their input. This logical chunking is independent of physical buffer size and efficiency. A few familiar examples illustrate this point:

  • sed operates on its input one line at a time (lines are separated by newline characters). Characters from preceding or subsequent lines are unavailable without some effort.
  • awk operates on its input one “record” at a time, which is further subdivided into “fields”. What constitutes records and fields is configurable at runtime. In our experience, one of the most useful features of awk is the ease with which a user can specify this chunking.
  • XSLT operates on tree representations of XML documents one node at a time. Although all nodes in a document are available to all templates, good design usually dictates keeping processing as local as possible.

All of these tools can be forced not to chunk their input, or not to chunk it intelligently (you can set RS="$ ^" in awk, or you can process an entire document from the root template in XSLT), but they allow and encourage chunking.

Chunking and the pattern/action model

An important corollary to chunking is that the chunker is in a position to decide what component should process a chunk.

An awk program explicitly consists of patterns and actions:

    /^-/     { print "regular file", $0 }
    /^d/     { print "directory", $0 }
    $5>20000 { print "is really big" }
The interpreter breaks its input into records, and then checks which patterns each record matches, executing the action specified for each matched pattern. Patterns in awk are usually formulated as regular expressions to be tested against a record or a specific field in it. Patterns can also be “relational expressions” that compare values of variables, fields, or the current record. It is important to observe that this relatively simple syntax relieves the programmer of writing an explicit loop over input bytes, breaking them into lines and then testing individual characters before deciding what action to take on a line.

awk’s processing model breaks for input that cannot be reduced to a two-dimensional matrix (see Subsection 1-2on hierarchical decomposition above).

In XSLT, “template rules”, or actions are applied to nodes that match specified XPath patterns:

    <xsl:template match="section/para[3][img]">
      <p class="thirdParaWithGraphics">
        <xsl:apply-templates/>
      </p>
    </xsl:template>

After a parser has chunked the input into a tree structure, the processor recursively walks that tree, checking each node in it against the available patterns and “applying” the appropriate action to each node.

XSLT’s processing model breaks for input that has structure that is not explicit in markup (see Subsection 1-2 on hierarchical decomposition). It is possible to process arbitrary strings in XSLT, and version 2.0 will introduce more facilities for this, but the programmer needs to specify explicit looping or recursion where explicitly marked-up data would be processed automatically.

Pattern/action model and non-linear filtering

We can generalize this idea of recursive chunking, pattern matching, and performing actions on chunks into a chunk-match-action model.

In this model, we can factor out the common task of matching chunks against patterns and choosing processors to perform actions. Since our system operates on trees, patterns must be expressed in a language that can address parts of trees. XPath [XPath] has two characteristics that make it a good choice: users are familiar with it, and a number of implementations exist for different environments.

A component that can track document state and choose processors for parts of that document resembles a switch construct in many imperative languages. As with switches in imperative languages, our switches can be nested to arbitrary depth. Such nesting results in a non-linear pipeline, where a datum’s path through the pipeline is determined at runtime by the switches it passes through.

Figure 5: A non-linear switched pipeline
[Link to open this graphic in a separate page]

Since development cost of the switch component is amortized over the lifetime of the framework, it can be more elaborate and better optimized and debugged than a subsystem in any individual filter. Moving state-keeping to a separate component also reduces the combined memory footprint of a pipeline since state is not duplicated between filters.

Removing issues of context from individual action components makes them simpler. Consider a document that describes a tube fitting assembly (Figure 6). To successfully complete the assembly, a technician would need tools and parts listed in assembly instructions, but assemblies in which this product is used are irrelevant. A filter can look up availability of parts in a warehouse and mark missing ones, but it doesn’t need to know how to discriminate between required and non-required parts. A switch can decide whether or not it is necessary to look up a part’s availability.

Figure 6: Which parts numbers are required for assembly?
<product>
  <name>156 MINSTAC - Tube Fitting System for leakproof .156" OD tubing connections<name>
  <used-in>
    <part number="TUTC9531910L"/>
    <part number="TUTC9531915L"/>
    <part number="TUTC9531930L"/>
    <part number="TUTC9531960L"/>
  </used-in>
  <assembly-instructions>
    <!-- ...  -->
    <step number="3">Slip the <part number="TMBA9501910Z">Ferrule</part>
       onto pin of the <tool part-number="TTTA9500127A">Coupling
       Assembly Tool</tool>. <note>See page 129 for 1/4-28 Flat Bottom
       Ferrule</note>.</step>
    <!-- ...  -->
  </assembly-instructions>
<product>

A switch can keep track of context while filters take action on data.

[Link to open this graphic in a separate page]

A fragment of a product description document, adapted from http://www.theleeco.com/EFSWEB2.NSF/0/28dbb3e7044c2f378525672f004d8070?OpenDocument

While we have identified switching as a common task, both chunking and chunk-processing are tasks which vary from application to application. Input to a pipeline usually comes from an XML document chunked by an XML parser, but fine-grained chunking (often called microparsing) can also be thought of as a way of transforming content. As an example, a microparser that can convert SVG to an explicit structure, such as in Figure 4 (or back), can simply be included as a filter in the pipeline. A number of standard chunker and “de-chunker” filters are included in the STnG distribution, and new ones are added as use cases are identified.

An important corollary to using chunker-filters is the ability to process arbitrarily structured text, not just XML. In extreme cases, input to a pipeline may come from a “parser” that simply reports a fictitious start element tag, followed by the entire content of a text document as character data with no XML structure, followed by a fictitious element end tag. All chunking in these cases is accomplished by filtering microparsers.

Chunking and streaming

Input chunking straightforwardly leads to output streaming: given adequately self-contained chunks, a processor can output the result of transforming a chunk of input as soon as it is finished with it, without waiting for additional input, or even end of input. Streaming is useful for three reasons:

  • reduced memory footprint of individual processors
    This is particularly important for inputs whose size is comparable with available memory. Inability to stream is sometimes cited as a drawback of XSLT in this respect and has inspired non-standard extensions in a number of implementations [SaxonP], as well as XSLT-like streaming languages [STX].
  • reduced latency between processors
    Data are available to consuming processes (or users) before the entire input has been processed.
  • exploitable parallelism
    In multi-tasking environments, or when processes run on separate physical machines, parts of a pipeline can execute concurrently.

Note that some filters cannot stream their output. A sort algorithm must wait until it reads the last chunk of input before writing the first chunk of output. Inasmuch as streaming is an implementation detail of a specific filter, a pipelined framework should allow streaming and non-streaming filters without discrimination. Benefits of streaming are manifest even in pipelines that include some non-streaming components.

Use existing languages to write processors

There are plenty of general-purpose and domain-specific languages in use today. To justify inventing a new language, one must show convincingly a substantial benefit, as the costs are high and clear: cost of implementation, barriers to adoption and difficulties integrating “legacy” components (i.e., those written in existing languages). Elsewhere in this paper, we have observed that one of the most successful modular processing frameworks, the UNIX shell, specifies no processing of its own. Conversely, awk does specify a language for data manipulation. It has been our experience that the procedural aspects of awk are lacking compared to other procedural languages; it is awk’s ability to chunk input and direct chunks for different processing that makes it so useful, not its procedural logic.

We have already mentioned that we use XPath for pattern matching. Initial chunking is normally performed by a standard XML parser. For filtering, we defer to three existing facilities: XSLT transformers, the underlying operating system, and the language in which the framework is implemented.

For subtrees that exhibit explicit marked-up structure, XSLT templates are often the easiest way to specify a transformation. XSLT stylesheets can be included verbatim or by reference in a pipeline, and branches of an input document that match specified patterns will be transformed according to templates in those stylesheets. A simple use of the XSLT filter is to break a document that is too large to fit in memory into a number of smaller chunks, each of which can be transformed with XSLT.

On operating systems that support the notion of standard input and output streams, external OS-level filter processes can be invoked from within a pipeline. These filters may or may not be XML-aware. If the filter is not XML aware it can still be used simply by wrapping it in a component which strips XML markup or translates it into a form that the filter expects. In this way, a pipeline can incorporate filters that are shipped in binary-only form.

Where procedural logic is required for processing, chunkers and filters may be written as routines that can be invoked from the pipeline engine. Issues of integrating components written in different languages within a single OS-level process are beyond the scope of this work. Our implementation, described in the next section, is written in Java, and as such supports standard Java compile- and runtime linking mechanisms. Future implementations in other languages may offer other linking models.

Irrespective of implementation language, two standard APIs are supported for custom processors, [SAX] and [DOM]. Other tree-oriented models such as [JDOM] and [XOM] can be accommodated.

Implementation

In the first part of this paper, we have enumerated our requirements for a streaming processor and the rationale behind them. The following sections describe our implementation of STnG that is based on these requirements. STnG is written in Java and supports standard Java APIs and frameworks for XML: SAX, DOM, and JAXP. Where possible, we tried to use existing Free Software components instead of writing our own. In particular, we used an XPath implementation that is part of the Saxon XSLT processor from Michael Kay [SAXON]. STnG is distributed under the GNU [GPL] and is available for download from the STnG website, http://stng.sf.net.

STnG processing model

SAX2 is the underlying streaming model used in STnG. A pipeline starts with a SAX parser and ends with a SAX content handler. The content handler is typically a serializer that outputs the result of a transformation as an XML document. Between the parser and the serializer, a STnG pipeline specifies filters that affect information items that pass through the pipeline and switches that control the flow of these items. The smallest chunk of information that can pass through a STnG pipeline is a SAX callback.

Switches and cases

The most important component supplied with STnG is an implementation of the switch that was discussed in Subsection 1-3-2. A switch consists of a number of cases and an optional default.

Figure 7: A STnG switch
<st:switch context="preceding-sibling">
  <st:case xpath="a" recursive="false">
    <!-- filters -->
  </st:case>
  <st:case xpath="b" recursive="true">
    <!-- filters -->
  </st:case>
  <st:default>
    <!-- filters -->
  </st:default>
</st:switch>

Cases are expressed as XPath patterns. An event that passes through a switch will be compared, in turn, against all cases in that switch until an XPath pattern is found that matches that event. In contrast with XSLT, no precedence rules are defined: the first case that matches is selected. A case may be recursive, meaning that all events that represent child nodes of an event that matched this case will be routed to this case without further matching. A recursive case with pattern “p” is functionally equivalent to a non-recursive case with pattern “node()[ancestor-or-self::p]”, if it appears as the first case in a switch.

A switch operates by building a tree representation of the input document. As events are reported to switch, the more memory is required to hold their corresponding nodes. In many situations, these nodes are not required for processing — e.g., if they are not addressed by patterns in any cases. To minimize runtime memory usage, a switch may discard some nodes. In the current implementation, the user may specify one of three discarding strategies.

  • A switch may keep all nodes it has been notified of. This makes all nodes on the “preceding” axis available in addition to the current node.
  • Upon receiving notification of an element’s end tag and routing that event to the appropriate filter, a switch may discard that element’s children. This makes nodes on the “preceding-sibling” axis available in addition to the current node.
  • Upon receiving notification of an element’s end tag and routing that event to the appropriate filter, a switch may discard that element altogether. This makes nodes on the “ancestor-or-self” axis available.

Note that the amount of context a switch keeps affects functions such as “position()” and “count()”.

The ContentFilter interface

The main interface in STnG, ContentFilter, is an extension of the SAX ContentHandler interface. A ContentFilter is a handler that can report events to another handler (with or without modification). A ContentFilter takes notification of SAX callbacks such as start-element and characters, performs some processing based on these events, and reports these, or other events to a downstream content filter or handler. All standard components shipped with STnG, including the switch, implement this interface. For a discussion on important differences between this filtering model and traditional SAX filters, see Section 4-1 in appendix A.

STnG syntax

A STnG pipeline is expressed as an XML document. Figure 8 shows an example:

Figure 8: A sample STnG
<st:transform
     xmlns:st="http://ns.lib.aero/STnG"
     xmlns:java="http://ns.lib.aero/foreign/java.sun.com"
     xmlns:sax="http://ns.lib.aero/foreign/sax.xml.org/java/contentHandler/methods"

     java:class="TestTransform"
     java:package="uk.ac.ed.ltg.kari.stng.sample">
  <st:chain>
    <st:switch context="preceding-sibling">
      <st:case xpath="a" recursive="true">
        <st:switch context="ancestor-or-self">
          <st:case xpath="x:*[@*]" recursive="true">
            <st:dom-handler java:class="org.example.stng.DOMTransformer"/>
          </st:case>
          <st:default/>
        </st:switch>
      </st:case>
      <st:case xpath="b" recursive="false">
        <st:content-filter>
          <sax:characters access="public" return="void" arguments="char[] ch, 
            int start, int length" throws="SAXException">
            System.err.println (length + " characters read");
          </sax:characters>
        </st:content-filter>
      </st:case>
      <st:case xpath="processing-instruction()" recursive="false">
        <st:markup-stripper/>
      </st:case>
      <st:case xpath="c" recursive="true">
        <st:xslt href="file:///home/ari/STnG/test-data/ids.xslt"/>
      </st:case>
      <st:default>
        <st:passthrough/>
      </st:default>
    </st:switch>
    <st:unique-id-filter/>
  </st:chain>
  <st:content-handler java:class="uk.ac.ed.ltg.kari.xml.util.PrettyPrinter"/>
</st:transform>

Events flow through the pipeline from the top down; an event enters the pipeline and passes through each filter in the order they appear in the STnG document. If a switch is specified in the pipeline, the event is routed to a branch of that filter that matches the event.

Using XML to specify STnG pipelines makes it easy to interpret them — and not just for the runtime environment. An XSLT stylesheet is included in the STnG distribution that generates SVG visualizations of STnG pipelines. Figure 9 shows the same pipeline graphically.

Figure 9: An SVG visualization of listing in Figure 8
[Link to open this graphic in a separate page]

STnG uses a number of different namespaces to specify pipelines. The main namespace is “http://ns.lib.aero/STnG”. All elements and attributes that describe core STnG functionality are in this namespace. We used a separate namespace to describe features that are specific to the Java implementation. In the sample above, the content handler is specified by it Java class name:

<st:content-handler java:class="uk.ac.ed.ltg.kari.xml.util.PrettyPrinter"/>
We hope that this separation will make it easier to port STnG to other languages.

Compilation

STnG can operate in either interpreted or compiled mode. In interpreted mode, a runtime engine reads a pipeline specification and instantiates filters using Java reflection. In compiled mode, an XSLT stylesheet generates a Java source file that can then be compiled and executed by a standard Java compiler and runtime environment. An intriguing option for future experimentation is to incorporate compiled XSLTC translets [XSLTC] into compiled mode when XSLT stylesheets are used from within STnG pipelines.

Both interpreted and compiled modes have been prototyped. At present, development is concentrated on compiled mode.

STnG and XPath

STnG uses standard XPath [XPath] as conditions in switches. Using XPath allowed us to use existing components and relieved us of inventing a new tree-addressing language and our users from learning one. There are, however, a number of limitations on XPath in a streaming environment. While any XPath pattern can be used in a STnG switch, some patterns that would match nodes in XSLT match nothing in STnG, and some expressions return results that would be different if used in XSLT.

In STnG, the context node is always the event that is passing through a switch; no information is available on nodes that follow this node in the input document. This means that predicates that address nodes on the “following” or “descendant” axis will always fail. As an example, the pattern “a[b]” will never match any events. Depending on how much context is available to a switch (see Section 2-1-1 above), predicates that address nodes on the “preceding” or “preceding-sibling” axis may fail even if the document contained nodes that would have matched these predicates.

XPath functions such as position() and count() have similar constraints, and their results, too, depend on amount of context available to a switch. Note that the expression “position()=last()” always evaluates to true, as the context node is always the last node that has been seen by the switch.

SAX makes attributes available in the start element event, which means that predicates that address nodes on the “attribute” axis will succeed or fail as expected (e.g., “xsl:stylesheet[@version='1.0']”). Note, however, that since SAX does not define a separate event to report attributes, patterns that address attributes specifically (e.g., “@version”) will never match any events.

A real-world example using STnG

At the time of this writing, [SVG] is not accepted as a format for illustrations in EML [Extreme Markup Languages] papers. Proceedings of this conference are published in two formats: web and print. For web output, PNG bitmaps offer wide support in user agents. For print, SVGs can be converted directly without loss of quality. All of the graphics in this paper started their life as SVG documents; some were authored manually, others were generated from STnG documents. In this section, we present a STnG we used to convert the XML source of this paper into one that is valid with respect to the EML DTD, complete with the required PNG files.

In the EML DTD, graphic elements are declared EMPTY and have a figname attribute which points to an external graphics file. In our source document, SVG fragments appeared as children of graphic elements.

We chose this example because many of the readers will be familiar with the problem domain if not the specific DTD. This example is also interesting because it requires using different techniques to satisfy its simple requirements — no single tool can effectively handle structure-driven transformations, using external libraries and validation.

Figure 10: A STnG to generate PNG from inline SVG
<st:transform
     xmlns:st="http://ns.lib.aero/STnG"
     xmlns:java="http://ns.lib.aero/foreign/java.sun.com/"
     xmlns:svg="http://www.w3.org/2000/svg"
     xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
     xmlns:fo="http://www.w3.org/1999/XSL/Format">

  <st:reader>
    <!-- (3.1) set features and properties to control the parser -->
    <st:feature name="http://xml.org/sax/features/namespaces" value="true"/>
    <st:feature name="http://xml.org/sax/features/namespace-prefixes" value="false"/>
    <st:feature name="http://xml.org/sax/features/validation" value="true"/>
  </st:reader>

  <!-- (3.2) duplicate the stream - strip SVG form one, keep it in the other -->
  <st:tee>
    <st:chain>
      <st:switch>
        <!-- (3.3) build DOM nodes of document fragments that contain SVG -->
  <st:case xpath="graphic/svg:svg" recursive="true">
    <st:dom-handler>
            <!-- (3.4) use inline Java code for non-XML transformations -->
      <java:handle-dom
              access="public" return="void" arguments="Node node" throws="SAXException">
        try {
      PNGTranscoder t = new PNGTranscoder();
      OutputStream ostream =
                      new FileOutputStream(<st:value-of select="../@figname"/>);
      TranscoderOutput output = new TranscoderOutput(ostream);
      t.transcode(new TranscoderInput(node), output);
      ostream.flush();
      ostream.close();
        } catch (Exception e) {
      throw new SAXException (e);
        }
      </java:handle-dom>
    </st:dom-handler>
  </st:case>
  <default>
    <passthrough/>
  </default>
      </st:switch>
      <!-- (3.5) validate the result of a transformation -->
      <st:validate language="http://www.w3.org/XML/1998/namespace"
                   href="dtd/extremepaperxml.dtd"/>

      <!-- (3.2) split the stream again, generate HTML from one branch,
           write the other to standard output -->
      <st:tee>
  <st:chain>
          <!-- (3.6) use an external stylesheet to transform this part of the pipeline -->
    <st:xslt href="xslt/extreme.html.xsl"/>
    <st:serializer href="paper.html">
      <st:output method="html"/>
    </st:serializer>
   </st:chain>
       </st:tee>
    </st:chain>
    <st:chain>
      <st:xslt>
        <!-- (3.6) use an inline stylesheet to transform this part of the pipeline -->
        <!-- this stylesheet overrides a single template from extreme.fo.xsl -->
  <xsl:transform version="1.0">
    <xsl:import href="xslt/extreme.fo.xsl"/>
    <xsl:template match="graphic[svg:svg]">
      <fo:block>
        <fo:instream-foreign-object max-width="336pt">
    <xsl:for-each select="@scale">
      <xsl:attribute name="content-width">
                    <xsl:value-of select="."/>%</xsl:attribute>
    </xsl:for-each>
    <xsl:copy-of select="svg:svg"/>
        </fo:instream-foreign-object>
      </fo:block>
    </xsl:template>
  </xsl:transform>
      </st:xslt>
      <st:serializer href="paper.fo"/>
    </st:chain>
  </st:tee>
  <st:serializer>
    <st:output doctype-system="EML2003KRUP0522.dtd"/>
  </st:serializer>
</st:transform>

(Numbers in parenthesis refer to descriptive sections below)

This STnG reads a source XML document and produces:

  • A set of PNG files
  • An HTML document with links to the generated PNGs
  • An XSL-FO document with inline SVG
  • An XML document which is identical to the source, except that all inline SVGs are stripped
Below is a description of components or “stingies” used in this transformation and how they interact to produce the desired output. Section numbers correspond to numbers in parenthesis in Figure 10.

Controlling the parser

STnG is built around SAX2. Transformations start with a SAX XMLReader that generates evens that will flow in the pipeline. If no reader is specified in a STnG, a system-default reader is used. You can control the reader’s behavior through standard or custom SAX properties and features. In this example, the parser is configured for namespace-aware validation.

Applying several processors to one stream

It is possible to duplicate event streams in a STnG pipeline and apply different processing to each copy. In this example, we “fork” the stream twice: one branch strips SVG fragments and converts them to PNGs, while the second, unmodified branch is converted to FO with an XSLT stylesheet. The first branch is then validated against a DTD and duplicated again to produce HTML on one hand and XML on the other.

The name “tee” refers to a UNIX utility that performs a similar function.

Tree-oriented document fragment processing

STnG makes it easy to combine streaming and random-access processing models. In this example, STnG builds branches of the source document below a specified path as DOM nodes. The switch directs svg elements to a DOM builder, while all other events pass downstream unmodified. The case is recursive, which means that the switch will send all descendants of an svg element to the same handler without further XPath checks. Conveniently, this satisfies two requirements at the same time: removing SVG fragments from the document and building PNGs based on them.

A case in a switch can be left empty; in this case STnG will simply drop all nodes that match that case. Conversely, a DOM handler may modify the contents of the fragment it handles, or substitute another fragment for it and then send that new fragment down the pipeline.

Inline procedural code

To generate PNG, we needed to use an external library — in this case, Apache [batik]. While it uses an XML DOM, this part of the transformation can hardly be achieved with pure XML tools — to link to the library, we need at least some procedural code. Since STnG is compiled to Java source as an intermediate step (Subsection 2-3 above), it is easy to include literal Java statements in a STnG; the content of java:handle-dom is compiled into a Java method2.

Literal procedural code that is included in a STnG has full read access to the document context. Note the following line:

new FileOutputStream (<st:value-of select="../@figname"/>);
The value-of element is translated to a method call that evaluates the XPath expression at runtime. This string is then used to create a file as a destination for the PNG. Naturally, this string is different for every graphic. figname is an attribute in the EML DTD that specifies the path where EML expects to find an external graphic, so that is where this STnG puts the generated file.

While we have no immediate plans to port STnG to other programming environments, there is very little in STnG that is Java-specific. The features that are specific to Java are clearly marked with a separate namespace to make the distinction between core features and implementation-dependent ones as clear as possible. In much the same way i18ned messages often appear side by side, it should be possible to include a sibling element to handle-node, one in a namespace that refers to a different programming language.

Validation

Often, one wants to validate the result of an intermediate transformation. As often, one wants to validate some parts of a document but not others — a schema might only be available for certain element types, or the processor may treat parts of the document as obscure payload that will be validated by another component or at another site, or part of the input may come from a reliable source while others need debugging.

STnG makes it easy to insert validators at arbitrary points in a pipeline. A validator is simply another filter that can be inserted anywhere a filter can be used, e.g., in a switch. STnG uses the [JARV] architecture for validation; JARV implementations exist for XML DTDs, W3C XML Schema, and RELAX NG. In our example, validation occurs at the beginning and end of the pipeline — the parser is configured for validation though a SAX property; the XML output is validated by an explicit instruction.

While not strictly a validation feature, STnG makes it easy to to view intermediate results, which can often help debug a transformation. You can insert a serializer between any two filters to see what one is sending to the other.

XSLT

STnG can include XSLT stylesheets either inline or by reference; our example shows both.

To generate HTML output, we used an unmodified EML stylesheet because the document that was fed into that transformation conformed to the EML DTD. We couldn’t use the standard stylesheet for FO output because our input document included inline SVG which the stylesheet is not designed to handle. To include SVG in the generated FO, we overrode the XSLT template that handles graphics. The resulting stylesheet is small, and since it is not used anywhere outside this STnG, we included it verbatim.

Conclusion

The entire transformation requires only one pass on the source document and is comfortably described in one STnG. We could achieve the same effect with standalone XSLT stylesheets, Java programs, and perhaps a makefile to describe dependencies between these components, as well as between intermediate results. While for some elaborate tasks such complexity is warranted, STnG can simplify many common processing scenarios considerably.

Note that beyond complexity that grows with the number of different components required to accomplish a task, a standalone application would include considerably more code than the DOM handler fragment in this STnG, as it would need to instantiate and configure a parser and navigate to the desired fragments using custom code, as well as handle potential errors — all tasks factored out into STnG. More likely than not, this custom code would not be as robust as a standard, reusable component.

Appendix A: related work

SAX filters

SAX filters [SAXF] offer a way to combine SAX-based processing components in a linear pipeline. There are two important distinctions between SAX filters and STnG. First, the fact that SAX filters can only be chained in a one-dimensional linear pipeline means that components must maintain their own state. Second, the base interface for SAX filters, XMLFilter, derives from XMLReader, while the base interface in STnG, ContentFilter, derives from ContentHandler. A SAX filter “pretends to be” a parser, and programs interact with it as they would with one — by setting content and error handlers and calling “parse()” methods. An important upshot of this is that a SAX filter can only process entire documents — no smaller granularity or “chunking” is possible.

Figure 11: STnG (top) vs. SAX Filter (bottom) pipeline
[Link to open this graphic in a separate page]

Since SAX filters hide their content handlers and only expose methods related to parsing, SAX filters cannot be included as filters in a STnG pipeline without modification. However, since SAX filters normally encapsulate a content handler, required modifications to their source code are usually small. SAX filters that derive from XMLFilterImpl can often be converted to STnG filters simply by declaring ContentFilter as an implemented interface, since XMLFilterImpl already implements all methods required for ContentFilter (albeit in different interfaces). Note that a STnG pipeline as a whole can be compiled as a SAX filter.

STX

“STX is an XML-based language for transforming XML documents into other XML documents without building a tree in memory. An STX processor transforms one or more source streams of SAX2 events according to rules given in an XML document called STX stylesheet and generates one or more result SAX2 streams. Each incoming event invokes one or more rules that can, e.g., emit events to the result stream or access a working storage.” [STX] STX was inspired by XSLT, but attempts to remedy the latter’s requirement to hold an entire document tree in memory.

STX defines its own data model (based loosely on SAX, inasmuch as SAX can be a basis for a data model), its own language for addressing nodes in a document (STXPath) based on XPath, and its own language for generating output based on input.

Two prototype implementations of STX exist: Joost, a Java version by Oliver Becker (http://joost.sourceforge.net/) and XML::STX (http://stx.gingerall.cz/stx/xml-stx/index.html), a perl implementation.

XPipe

“XPipe is an approach to manageable, scaleable, robust XML processing based on the assembly line principle, common in many areas of manufacturing.” [XPipe] It is an ambitious proposal by Propylon’s Sean McGrath that extends from simple component reuse all the way to distributed p2p pipelines (XGrid) orchestrated by elaborate job control languages (XJCL). Sadly, the XPipe website hasn’t been updated since February 2002.

We find this observation from a section of [XPipeP] titled “XPipe philosophy” particularly in tune with some of our own:

  • Get data into XML as quickly as possible.
  • Keep it in XML until the last possible minute.
  • Bring all your XML tools to bear on solving the data processing problem.

Regular fragmentation

Regular fragmentation [regfrag] is a package written by Simon St. Laurent. It “is designed to allow developers to fragment chunks of element content into smaller pieces during the course of parsing with a SAX-compliant parser.” RegFrag chunks character data on regular expressions, wrapping elements around matching chunks. RegFrag is a SAX filter, intended for use in linear pipelines, and, as such, includes its own mechanisms for state-keeping.

Figure 12: A regular fragmentation
<?xml version="1.0" encoding="UTF-8"?>
<fragmentRules xmlns="http://simonstl.com/ns/fragments/">

<fragmentRule pattern="(\d{2,5})(\d{2})-(\d{2})">
<applyTo>
  <element nsURI="http://simonstl.com/ns/test/" localName="gYearMonth"/>
  <element nsURI="http://simonstl.com/ns/test/" localName="myYearMonth"/>
</applyTo>
<produce>
  <element nsURI="http://simonstl.com/ns/types/" localName="century" prefix="type" />
  <element nsURI="http://simonstl.com/ns/types/" localName="year" prefix="type" />
  <element nsURI="http://simonstl.com/ns/types/" localName="month" prefix="type" />
</produce>
</fragmentRule>

</fragmentRules>

Work is under way to repackage regular fragmentations as STnG filters.

XML pipeline definition language

XML Pipeline [xpdl] is a W3C note written by Norman Walsh and Eve Maler. It describes linear pipelines that operate on entire documents (document-sized chunks, if you will) and is mainly concerned with the order of operations to be performed on a document and input and output redirection. In its operation, it is similar to the UNIX make utility.

Notes

1.

STnG stands for Streaming Transformations and Glue. It does not, as some have suggested, stand for “STnG, The next Generation.”

2.

The actual Java code required for SVG to PNG conversion was slightly linger; we trimmed non-essential parts of it here. Most of the code dealt with converting a generic DOM that DOMHandler builds to a Batik SVG DOM — a surprisingly involved procedure.


Acknowledgments

The work reported in this paper was made possible, in part, by by the UK Engineering and Physical Sciences Research Council under a grant to the University of Edinburgh, and in part by Sun Microsystems.

The author is indebted to Tiphaine Dalmas who took an early interest in STnG and provided valuable user feedback. She also suggested the word “stingies” to describe STnG components and wrote the code to generate SVG visualizations.

The author would also like to thank Tim O’Donnell and Henry Thompson, head of the HCRC Language Technology Group, for their helpful comments on a draft of this paper.


Bibliography

[awk] Aho, Alfred V., Brian W. Kernighan, and Peter J. Weinberger. “AWK — pattern scanning and processing language.” Softw. Pract. Exper. 9, 4 (April 1979), 267-279.

[batik] Batik SVG Toolkit, Apache Software Foundation. http://xml.apache.org/batik/.

[BegedDov 1995] Beged-Dov, Gabe, et al. The Pipeline Design Pattern. Rogue Wave, 1995.

[Dodds 2001] Dodds, Leigh. Parsing the Atom. http://www.xml.com/pub/a/2001/04/25/deviant.html.

[DOM] XML Document Object Model. http://www.w3.org/TR/2000/REC-DOM-Level-2-Core-20001113/.

[GPL] Stallman, Richard, et al. GNU General Public License, Free software foundation. http://www.fsf.org/licenses/gpl.txt.

[JARV] “A vendor-neutral, implementation-independent and schema language independent interface for XML validators.” http://iso-relax.sourceforge.net/JARV/.

[JDOM] “The better mousetrap,” Java DOM. http://www.jdom.org/.

[KerPike 1984] Kernighan, Brian W., and Rob Pike. The UNIX Programming Environment. Prentice-Hall, 1984.

[Meunier 1995] Meunier, Regine. The Pipes and Filters Architecture. (Ch. 22) In Pattern Languages of Program Design. Addison-Wesley, 1995.

[regfrag] Breaking Element Content into Child Elements. http://www.simonstl.com/projects/fragment/; see also http://regfrag.sourceforge.net/.

[SAX] Simple API for XML. http://www.saxproject.org.

[SAXF] SAX Filters. http://www.saxproject.org/?selected=filters.

[SAXON] Kay, Michael. Saxon XSLT processor. http://saxon.sourceforge.net/.

[SaxonP] Saxon preview mode. http://saxon.sourceforge.net/saxon7.4/extensions.html#saxon:preview.

[sed] sed(1).

[STX] Streaming Transformations for XML. http://stx.sourceforge.net/.

[SVG] Ferraiolo Jon, et al. Scalable Vector Graphics. http://www.w3.org/TR/SVG/.

[XOM] Harold, Elliotte Rusty. XML Object Model. http://www.cafeconleche.org/XOM/.

[XPath] Clark, James, and Steve DeRose. XML Path language, 1999. http://www.w3.org/TR/xpath.

[xpdl] Walsh, Norman, and Eve Maler. XML Pipeline Definition Language, W3C Note. http://www.w3.org/TR/xml-pipeline/.

[XPipe] McGrath, Sean. “Big Aspirational Plan.” http://xpipe.sourceforge.net/.

[XPipeP] XPipe presentation. XML SIG NY, Feb 2002. http://xpipe.sourceforge.net/BinaryStuff/xpipeny.ppt.

[XSLTC] XSLT Compiler. http://xml.apache.org/xalan-j/xsltc/.



STnG — a Streaming Transformations and Glue framework

K. Ari Krupnikov [Research Associate, University of Edinburgh, HCRC Language Technology Group]
ari@cogsci.ed.ac.uk