What programming language designers should do to help markup processing

Sam Wilmott

Abstract

Programming Languages can either directly implement application-specific functionality or provide tools for implementing such functionality. Typical languages do both. There is a particular tool — coroutines — that is missing from mainstream programming languages. The availability of coroutines in these languages would help significantly in a wide range of applications, including markup language processing and the text processing operations usually associated with it. This presentatation describes what coroutines are, what their role is, various attempts at implementing them, and what it will take to make them available.

Keywords: Processing; Programming

Sam Wilmott

Sam Wilmott is the lead programming language researcher at Stilo Corporation, and architect of the OmniMark programming language. He has worked on document markup standards since the late ’70s, has served as Canadian representative on the ISO SGML committee, has implemented an SGML parser, and has built the interface between an XML parser and the OmniMark language.

What programming language designers should do to help markup processing

Sam Wilmott [Stilo Corporation]

Extreme Markup Languages 2003® (Montréal, Québec)

Copyright © 2003 Sam Wilmott. Reproduced with permission.

Some background and context

The general thesis of this presentation is that to be effective tools, programming languages need to fit the needs of the applications implemented in them. This can be done in two ways: being specialized to an application domain, or being general.

A programming language can directly implement required functionality. This is the role of domain-specific or specialized languages. A programming language can provide the tools with which required functionality can readily be implemented. This is the role of a general-purpose language. In real life, almost every programming language is a mixture of specific and general features:

  1. there are some things they do well and easily “right out of the box”,
  2. there are some things that can readily be implemented with the tools they provide, and
  3. there are some things that are hard to do — this doesn’t, in and of itself, indicate a shortcomming in the language — it’s just a fact of life.

Early computers were primarily used to process numbers in scientific, engineering, or commercial contexts. Programming languages had two goals:

  1. to make it easier to do numerical processing, and
  2. to make it easier to control the resources of the computer — “system programming”.

Early programming languages include:

  1. Fortran — General (numeric processing)
  2. Cobol — General (numeric processing)
  3. C — Specialized (bit fiddling/system programming)
  4. Lisp — Specialized (data structure manipulation)
  5. Snobol — Specialized (text processing)

Increased availability of more computer resources — most notably the introduction of more powerful graphics facilities — inspired two complementary trends in the ’80s and ’90s:

  1. the availability and use of “libraries” of pre-packaged services, and
  2. object-oriented programming.

Object-oriented programming represents a major step away from a numerically-oriented approach to computer programming, towards a more general approach to computer-stored data. It is a major programming feature, in that it has a significant impact on how one writes programs.

That’s where things now stand, more-or-less, at least with respect to “main-stream” programming languages: C, C++, Java, and C#. Specialized programming languages are still around — Python, Perl, etc., and, of course, OmniMark — which support alternative styles of programming.

The distinction betwen general and specialized programming languages continues to exist, and probably always will. Specialized languages can afford to be experimental — you can always go back to the main-strem languages when they don’t do the job. General-purpose languages are conservative — you do similar things in similar ways in C++, Java, and C#, with variations only in the details. Interesting features can be found in most specialized programming languages — features that we’d like to see in general-purpose languages, but which those languages’ inate conservatism keeps from joining the main-stream. However, coroutines are such a basic functionality, of such general use, that they should be part of all main-stream general-purpose languages.

About coroutines

The particular thesis of this presentation is that coroutines should be part of main-stream programming languages:

  1. They are a well-tried facility — they’ve long since passed out of the realm of the experimental.
  2. They are a basic language facility, like control structures (conditionals and loops).
  3. Most importantly, coroutines are useful.

For the text-processing and markup-language processing community, coroutines are especially important — there are many examples where they are the best way of doing things.

The following illustrations attempt to explain the what and why of coroutines.

Many information-processing applications actually consist of two or more processes, with one process producing data and another making use of or consuming it:

[Link to open this graphic in a separate page]

(The idea here is that the individual processes travel from left to right as they do their work — it might have helped if there were arrows on the blue lines pointing to the right.)

There’s nothing particularly markup-processing-specific about this. Rather, it describes not only the relationship between a markup parser and its client, but many of the other related activities, such as filtering and converting data both on input and output. The producer-conumer relationship between parts of a program, or between programs, is very general:

[Link to open this graphic in a separate page]

There are two key points to note about coroutines. First of all, the processes in all these illustrations are synchronous. This means that there is only one process running at a time, and control is passed back and forth between them explicitly. There’s no explicit syncronization mechanism required when using coroutines.

Secondly, the relationship between processes is intrinsicly symmetric. For example, this is the first illustration flipped upside-down — in spite of the relableing, it describes the same processes and the same relationships:

[Link to open this graphic in a separate page]

Coroutines differ from functions in their implementation in requiring a different organization of the information usually stored on the computer’s stacks. A typical function-based program has one stack:

[Link to open this graphic in a separate page]

Each entry on the stack typically corresponds to the invocation of a function. “Frames” are pushed on the stack with each call, and popped off with each return from a function:

[Link to open this graphic in a separate page]

Computer architectures since the early ’70s have supported this kind of program organization by providing machine registers and instructions that directly model and manipulate a stack.

The relationship between called functions and their callers is primarily encapsulated in what happens on the stack. Coroutines require two or more stacks, because each process has its own path through the program’s logic:

[Link to open this graphic in a separate page]

The relationship between coroutines cuts across stack boundaries. Within both the producer and the consumer, functions may be called and may return, with data passed between the coroutines at any stage:

[Link to open this graphic in a separate page]

The single most important producer-consumer relationship for markup language processing is that between a markup parser (an XML parser, for example) and the application that processes the marked-up data. For example:

[Link to open this graphic in a separate page]

The ideal arrangement is that illustrated above, where the processing application can think of itself as being in charge, requesting information from the markup parser when needed. In practice, with XML parsers implementing the SAX interface, for example, it’s the markup parser that’s in charge:

[Link to open this graphic in a separate page]

Worse — and here’s the key problem we have when using programming languages like C, C++, C#, and Java — is that the true picture is more like the following:

[Link to open this graphic in a separate page]

What’s going on is that the SAX parser calls an application-supplied method for each piece of element information. It’s a function-call relationship, not a coroutine relationship.

Program state can be represented as data or as control-flow state. A function/method-based process requires all persistent state to be represented as data. In the SAX parser case, the applcation has to store away where it is in objects between each piece of element information. It can’t just ask the SAX parser for the next piece of information. In contrast, a coroutine-based SAX-like interface would allow persistent application state to be represented both as control-flow state and as data state.

Coroutines are not limited to two-process relationships. It’s often useful to have one or more filtering/conversion processes intervening between primary producer and consumer processes:

[Link to open this graphic in a separate page]

And communication between coroutines is not limited to pairs. It’s commonly the case that some data requires an intervening processor, and some not:

[Link to open this graphic in a separate page]

Producer-consumer relationships are typically a property of large (i.e., real-world) programs, so any small examples of coroutining code will require a bit of imagination and faith on the part of the reader. Here’s a very basic example of what programming with coroutines in C++ would look like. Here’s the “mainprogram” — which is itself a coroutine — everything is — that creates a new coroutine, and then starts talking to it:

Figure 1
Coroutine* the_other = new Coroutine (myself);
for (;;)
{
  the_other->resume ();
  if (the_other.terminated ()) break;
  // Use the data produced by the coroutine.
}

And here’s the coroutine’s code:

Figure 2
void my_coroutine (Coroutine* invoker)
{
  for (;;)
  {  // Do some stuff to produce some data.
     invoker->resume ();
  }
  // The invoker resumes when the coroutine exits.
}

The key thing to note about the two coroutines is their symmetry: they both are sitting in “for” loops while communicating with each other. Here’s a slightly larger example, giving in outline, what using a coroutine-based use of a markup parser would look like:

Figure 3
// The markup parser's client application.
void main ()
{
  Parser* markup_parser = new MarkupParser ();
  for (;;)
  {
    markup_parser->resume ();
    if (markup_parser.terminated ())
       break;
    switch (markup_parser->eventType ())
    {
      case elementStart:
        // invoke a function/method for the element
    } // etc. ...
  }
}

// The markup parser itself.
void chapterElementHandler (Parser* parser)
{
  parser.resume ();
  if (parser->eventType == elementStart ()
      && strcmp (parser->elementName (), "TITLE") == 0)
    // process chapter title
  for (;;)
  {
     parser.resume ();
     if (parser->eventType == elementEnd)
        return;
     // process other chapter content
  }
}

The SAX model of markup processing uses control-flow state on the XML parser side of things, and uses data state on the client processing side of things. An alternative processing model is the DOM model of markup processing, which uses data state on both sides — the markup parser builds a tree, and the client application uses it. This makes sense for a wide range of applications. But where it makes sense is where the nature of the data encoded in the XML markup is in some “natural” way best represented by a tree structure. It’s the data, at the end of the day, that determines what kind of processing best suits it. Where you’ve got serial, narratively structured data, a more serial or streaming form of processing is more appropriate. This is where a SAX-like model is best, and where coroutining is desirable.

Implementing coroutines

There are a number of ways of going about getting coroutining. First of all, you can use threads. Threads do the job. Each thread has its own stack — that’s the key thing.

But threads are an expensive way of managing a tightly-coupled relationship between processes. Threads are asynchronous, whereas coroutines are synchronous. Besides the fact that coroutines can be easier to use — their user doesn’t have to think about the asynchronicity issues — there’s a difference in the processing cost of managing aynchronicity. We’ve measured an up to a 12:1 difference in performance in tests.

Another approach to implementing coroutines is to do it oneself. After all, it doesn’t take much low-level code. Here’s what it takes to switch between coroutines on an Intel platform:

Figure 4
void CoroutineSwitch (coro_stack_info* fromCoroutine,
          coro_stack_info* toCoroutine)
{
   __asm {
      push   ebx
   push   esi
   push   edi
   mov    eax,fromCoroutine
   mov    dword ptr [eax],esp
   mov    eax,toCoroutine
   mov    esp,dword ptr [eax]
   pop    edi
   pop    esi
   pop    ebx
   }
};

(As a side note, the code illustrates a motivation for coroutines; switching happens fast — as fast as calling and returning from a function.)

(As another side note, the above code is what’s expected by the Microsoft Assembler. Following is the same thing — though not quite the same code — for “gcc”, as used under Linux.)

Figure 5
void CoroutineSwitch (coro_stack_info* fromCoroutine,
          coro_stack_info* toCoroutine)
{
   asm ("movl %ebx,-4(%ebp)");
   asm ("movl %esi,-8(%ebp)");
   asm ("movl %edi,-12(%ebp)");
   asm ("movl 8(%ebp),%eax");  /* fromCoroutine */
   asm ("movl %esp,(%eax)");
   asm ("movl 12(%ebp),%eax"); /* toCoroutine */
   asm ("movl (%eax),%ebp");
   asm ("movl -4(%ebp),%ebx");
   asm ("movl -8(%ebp),%esi");
   asm ("movl -12(%ebp),%edi");
};

As noted above, this form of coroutine switching performed up to 12 times faster than using threads. But the experiment failed — when used in a program that also used threading, it caused the program to crash under both Windows and Linux. It turns out that the Linux run-time library uses the stack pointer to identify which thread it’s in when accessing “thread-local data”. Moving the stack pointer around the way the above code does confuses its access attempts. Windows run-time library seems to use the stack pointer to determine the edge of the “heap” when allocating memory (“new” or “malloc”), and moving the stack pointer around seems to make it think that memory has become corrupted.

There have been other attempts to directly implement coroutining facilities for C and C++:

  1. Copy out the whole of the computer’s stack on each control transfer. This is very expensive, especially when switching back and forth rapidly.
  2. Put the “stacks” of each coroutine one after the other, nested on the “real” computer stack. This may not work if the stack pointer is not at the edge of the “real stack”. In any case it is fragile, probably suffering from the same difficulties as our attempt.

One approach to coroutining is for a service to implement it itself. The Apache C++ SAX Parser implements coroutining in its “progressive parse” functionality — call the “ParseFirst” method to create the parser coroutine, and repeatedly call “ParseNext” to resume it. But “ParseNext” doesn’t return all of the interesting events — in particular, the opening and closing of input streams are not returned as “ParseNext” events. So it’s useful for some purposes, but not for all. But it is definitely a step forward. And it’s a very expensive way in general of providing such functionality — it puts a large burden on the shoulders of the programmer.

Another way of implementing symmetric processes is to go beyond coroutining. Stackless programming languages that don’t use a “stack” at all for invoking functions. In such languages, you can suspend a process at any time and then return to it later. “Stackless Python” is the currently popular embodiment of this idea, although it’s been around for a long time: a programming language called Oregano did it in about 1971, for example.

Lots of programming languages support coroutines. Icon and Scheme are only two of many currently in use. But they’re all special purpose languages. None of the “mainstream” languages support coroutines. The argument of this presentation is that we need this functionality in the mainstream languages because they are used a lot and are still the best way of writing applications in many contexts.

So how do we get coroutines?

We need coroutines in our main-stream programming languages. Why? because:

  1. They are a major programming feature — they have a significant impact on how one programs.
  2. They are a tool for implementing user-level functionality.
  3. They have been around for a long time; there’s a great deal of experience with them.
  4. They would be very easy to add to all main-stream languages.
  5. They can’t be easily and reliably supported by a service library, object-oriented or not.
  6. They support a further major step away from the traditional numerically-oriented bias of the main-stream programming languages.
  7. They have been sadly overlooked.
  8. They would help significantly for the processing of markup languages and related work.

Why shouldn’t be just ignore the current so-called “main-stream” languages and go with languages that do implement coroutines? Because:

  1. The main-stream languages are widely used, are supported on a wide range of platforms, are generally stable, and are popular.
  2. They are currently used, and will continue to be used, in a wide range of applications that would benefit from the use of coroutines.

Why should the markup community worry about this stuff? Isn’t it the job of some other community? Maybe yes, but we can’t afford to wait for them:

  1. Academics know about this stuff — have for over 30 years — but they have more interesting things to do.
  2. Main-stream programming language vendors think things are fine the way they are.

The only folk with an interest in getting something done are the user community. It will take a while. But we’ve got to start somewhere.

So how are we going to start the ball rolling? Well:

  1. Get coroutine support added to main-stream languages — through the standards committees and compiler product companies.
  2. Encourage further research on implementing coroutines.
  3. Spread the word.

What programming language designers should do to help markup processing

Sam Wilmott [Stilo Corporation]