All About Pattern Matching

Sam Wilmott
sam@wilmott.ca

Abstract

The content of XML documents is the poor relative of structured markup, and seems to not have been given the attention it deserves. More capable text processing tools are required, like more flexible pattern matching and better support for markup language processing. This presentation discusses the need for more powerful text processing capabilities, discusses different key features of pattern matching, and provides an example of improved pattern matching facilities in the Python programming language.

Keywords: Processing; Markup Languages

Sam Wilmott

Sam Wilmott is an innovator and implementer of programming languages and markup languages, with a long and productive experience in these fields. Sam has a long history in the publishing industry in all its forms, and is the architect of the OmniMark programming language.

All About Pattern Matching

Sam Wilmott

Extreme Markup Languages 2004® (Montréal, Québec)

Copyright © 2004 Sam Wilmott. Reproduced with permission.

And Now For Something Completely Different

Textual data has become an increasingly important component of all data processing over the last 20 years. Markup languages, XML included, are the most obvious embodiment of this development. The wide-spread popularity of programming languages like Perl and Python is another embodiment. And of course, there's the Internet, which is largely a text-based application.

Text processing requirements are not well served by "main-line" programming langugaes like C++ and Java. Their support for XML is part of what's needed, but isn't all that's required all of the time. If we don't want to spend an inordinate amount of time developing our text processing applications, we have to use something like Perl, Python or OmniMark. But even these languages don't always provide us with as powerful tools as they could do.

This presentation starts with a background on programming languages, on text processing and on the close relationship of markup language technologies and general text processing technologies. It is followed by a discussion of the need of powerful pattern matching features and more declarative ways of processing marked-up data in our programming languages, and illustrations of improvements in both these areas using the Python programming language.

Text Processing And Markup Languages

Markup language description languages, like XML, are an important component of a wide range of applications. However they are not everything. General text processing functionality is also an important component of a wide range of applications, both in conjunction with an separate from the use of markup languages.

Why Do We Need Text Processing?

The increased widespread popularity of markup languages, especially XML, in the last decade has greatly helped in the handling of textual data. However, there are many aspects of text processing not fully dealt with by markup, including:

  • Where all the potential information in a document is not marked up, and later uses of the data require revision or enhancement of the markup. This is the normal state of any rich textual data for reasons of incomplete knowledge -- who knows what future users of data will want of it? -- and economics -- who can afford to investigate and allow for all future uses?
  • Where data has its own structure -- such as that in telephone numbers, most part number systems and specification languages -- and even when it's know this structure is of use, it's both uneconomical to fully mark it up, and doing so would compromise the clarity of the data.
  • Where data is being converted to or from some other form than XML.
  • Where interesting information is encoded in attributes.
  • Where there are better encodings of data than using XML. (Yes, Virginia, for a significant number of uses this is the case.)
  • And because what we can do is limited, in part, by the tools we have available to us.

The State of the World Vis-a-Vis Text Processing

Commonly used programming languages have not kept pace with the changing requirements of applications and their developers, especially in support of text processing. This is largely because the languages that are most widely used all have a non-text bias, or are descended from such languages -- primarily numerical, data base, or systems programming languages -- languages in which, traditionally, text processing played a minor role -- and it's hard, and even impossible in many cases, for a programming language to break free from its original design. Even current languages with a text processing intent, such as Python, could be beneficially improved in its text handling capabilities.

Because of the limits in what can done with an already designed programming language, text processing has been ghettoized -- in spite of the heroic efforts of many, including the members of the W3C and the teams behind programming languages such as Perl, Python and OmniMark -- either to functional libraries that provide limited functionality and don't always fit well into their containing language, or to specialized languages that are limited in their use and in their support of other kinds of functionality.

Original design isn't the only thing that limits programming language development and use. An equally limiting constraint is the pernicious influence of "familiarity" -- the often reasonable reluctance amongst users to have to learn new techniques when what they already know suffices. But what happens when the familiar doesn't suffice? Often it's the case that the familiar form is the one made available, even when a more effective tool could be provided. Later in this presentation there's an example of this phenomenon occuring in Python, in its pattern matching capability.

To a significant extent the emphasis on structured data processing in the XML community has contributed to the ghettoization of text processing in general. It's not that structure isn't important. It's just that it's not everything. In the XML context, the text, the content, is the poor relative of the text. Which is unfortunate, because after all, it's the data that the XML is there to serve -- structure of the XML sort is a technique for organizing and packaging the data.

The solution to the current shortcomings of major programming languages is not best dealt with by inventing yet another programming language. Doing so would not change that unfortunate situation -- such a language would sit in a corner and be ignored by most programmers. Text processing is a general requirement. Text, structured and un-, is increasingly part of all applications. Consequently, all programming language designers need to acknowledge the requirements of text processing.

So What Does This Have To Do With Markup Languages?

Markup language and general text processing technologies are closely related technologies -- joined at the hip, as it were -- and need to be considered together:

  • Markup languages are a tool: a tool used to organize and explicate the structure of data, typically textual data. For reasons including those stated earlier, structured markup is a partial solution to the problem of organizing and encoding textual data so that it is readily processible. The role of structured markup, and the tools for processing it, cannot be well understood in isolation from the other processsing requirements of the data.
    This is not to say that there are not a large number of applications where using structured markup and tools for processing it are not sufficient. There are, after all, a large number of applications wherein structured markup is of no use. It's just that there are a large number of applications were more than structured markup is needed. (The world is a big place.)
  • Markup language users include the implementors of markup language technology. Members of this community need to be aware of and have available the tools required to effectively implement markup language technologies. Text processing tools are a key component of what such implementers need to be aware of and use.
  • Technologies such as the Internet, network protocols in general, and data base systems are all "somewhere out there" with respect to markup language processing -- they are where data for processing comes from and where goes once processed. General text processing technology is different. It's not arms-length from markup language processing in the sense that these other technologies are. Rather it's used in the same process as markup language processing -- the tools for doing this processing, to be most widely effective, need to have both markup language processing and general text processing functionality available. Tools that have one or the other can be very useful, but are limited in their applicability.
  • The users of markup languages are currently, quite rightly, the majority of those who do text processing. And those users, again appropriately, largely identify themselves with this community. So they are the most appropriate community for discussing issues involving general text processing.

Things go wrong when it's not understood that markup language processing is often just part of an application's requirements:

  • There is such a things as excessive use of XML. In part this excess is the result of inadequate alternate tools, or lack of awareness of better alternate tools that address some of the problems posed by user applications. The markup language community and its reputation are not well served by inappropriate use of its technology. Excess parochialism detracts from the technology's credibility.
  • While XML is undoubtedly of widespread use in encoding data and process specifications, and while there are significant tools for displaying/presenting XML-encoded data, it's considerably less clear that XML is the most appropriate medium for the human entry either of data or of process specifications. Front-end tools obviate much of the data-entry burden, especially for textual data, but sometimes we just need a language that we can type in.
    Yet we see wide-spread use of XML for the human entry of processing specifications. Sometimes this is a matter of preference, and sometimes a marketing issue. But sometimes it's because tools are readily available for implementing such specifications in XML, and tools are not readily available -- they don't exist, they are not known, or they are hard to get to (e.g. too expensive) -- for alternative implementations.
  • Major and important projects are unnecessarily abandoned for lack of cost-effective ways of completing them. Markup languages have themselves enabled major and important projects: the Web being a case in point. But projects that require other implementation technologies in conjunction with markup language technology can be and are abandoned when those other technologies are not sufficiently supported. Worse still is situation where there is sufficient support for the other technologies, but those involved are not aware of them.

These are all issues that need to be considered by the markup language community. Not necessarily by everyone in that community, but certainly by those who consider themselves experts. This presentation attempts to discuss some of the requirements and possibilities of general text processing, and the discussion should be understood in the context of the requirements of complex markup language applications.

And for those of you who still think that text processing is just about markup: those who only know their own technology are not experts, they are salesmen.

One Step Forward -- Pattern Matching

It would be a step forward in supporting text processing to improve the pattern matching functionality in the programming languages we use.

About Pattern Matching

So what could be done about the shortcomings of major programming languages vis-a-vis text processing? One possibility would be to improve the support of pattern matching in these languages.

Many early non-text oriented programming languages are still well known: COBOL, FORTRAN, PL/I, Pascal, Algol, C, Lisp and Smalltalk. But the major text processing languages from the same era have largely been forgotten, especially SNOBOL4 and Icon, and they don't have any commonly known descendants. It's in these latter languages that we can find lessons for current use, especially in how they approached text processing and pattern matching issues.

Why pattern matching? Well for a start, pattern matching is the most natural way of processing textual data. Just about everything we find in text can be most easily described by patterns. And pattern matching can be used in a lot of ways, not just for text processing.

Pattern matching is primarily used for three purposes:

  • as a test to recognize or validate input data,
  • as a way of extracting data from input, and
  • as a way of processing/transforming data.

There are two models of pattern matching that currently appear in programming languages:

Regular Expressions

Regular Expressions expressed in the style used in the Unix "grep" program. This style is used in the Perl and Python programming languages.

Algebraic Expressions

Pattern matching incorporated into a language's general expression syntax. This style is used in the SNOBOL4, Icon and OmniMark languages.

This presentation uses these five languages for examples of the two basic forms of pattern matching, and as the basis for discussing features of pattern matching.

Regular Expression Pattern Matching

The Regular Expression form of patterns is more familiar and popular than the Algebraic form, primarily because of the languages and tools that support it.

Just by way of an example, here's a Regular Expression pattern that matches an XML start tag (including allowing ">" inside a quoted attribute value):

<[a-zA-Z]([^>"']+|"[^"]*"|'[^']*')*>

This pattern requires that the text it matches starts with a less-than and letter, and matches up to and including the next greater-than, allowing greater-than's and the "other" quote with quoted parts of the start tag.

(This example is used to illustrate a variety of features of pattern matching. It's not trivial, but it's not too large either. You might notice that when used with grep, Perl or Python, you have to quote and add extra escaping to the above pattern. That's a requirement of how strings are entered in the language itself, not part of pattern matching, so I'll ignore it for the purposes of this example.)

I'm expecting most of you to be familiar with this form from experience with grep, Perl or Python. If you're not, it works as follows:

  • Find a "<".
  • Find one of "a" to "z" or "A" to "Z" -- i.e. match a letter.
  • Find zero or more instances of what's inside "(" and ")*". The "*" says "zero or more, and the "(" and ")" bracket what's repeated.
  • Within the "(" and ")*", each time round, find one of:
    • one or more of anything but ">", """ or "'" -- the "^" says "anything but",
    • a """ followed by everything up to the next """, and the following """ itself, or
    • a "'" followed by everything up to the next "'", and the following "'" itself.
  • Finally, make sure you find a ">".

Regular Expression syntax packs a lot of expression into a small space, and it's familiar to a lot of people. The down side is that it's not easy to read, and it's not easy to see what is going on at a glance, especially as patterns get larger. Perl supports extra spaces in pattern to help with readability.

Algebraic Pattern Matching

Algebraic pattern matching probably won't be familiar if you haven't used SNOBOL4, Icon or OmniMark. These aren't the only languages that use it, but the others are even more obscure.

(The term "Algebraic" is used here largely because there doesn't seem to be a commonly-used word or phrase for this kind of pattern matching syntax.)

The key difference between Regular Expression patterns and Algebraic patterns is that:

  • Regular Expression patterns are typically a "sub-language". The patterns are a separate language, not part of the language in which they are used. For this reason, they are generally quoted in strings (Python) or delimited in some special way (Perl).
  • Algebraic patterns are part of the language. They are expressed using functions, operators, variables and string literals in the usual manner for the language in question.

The above example of matching a start tag is expressed in Algebraic pattern matching programming languages in quite a different way from the "grep" syntax. Most obviously, the pattern isn't itself a string, although it can contain strings. In SNOBOL4 it's:

"<" (&lcase | &ucase) arbno (break (">'" '"') | '"' break ('"') '"' | 
"'" break ("'") "'") ">"

In Icon it's:


="<" & any (&lcase ++ &ucase) & |(upto (">'\"") | 
"\"" & upto ("\"") & "\"" | "'" & upto ("'") & "'") & =">"

In OmniMark it's:

"<" letter ( [\">%"'"]+ | '"' [\'"']* '"' | "'" [\"'"]* "'" )* ">"

What's going on is the same as for the Regular Expression example.

You can probably see that these languages are quite similar in the approach they take. There's a bit of syntactic difference, but it's not a big deal, although it adds a small impediment to jumping from one language to another.

The first obvious difference from Regular Expression syntax is increased verbosity. It helps for a number of reasons:

  • It makes complex patterns more easily readable.
  • Processing can be incorporated into the pattern itself.
  • Pattern matching can readily be incorporated into other kinds of expressions.
  • Pattern matching syntax is easily extensible, both by the language implementers and by language users.
  • These advantages don't necessitate the performance hit of a fully interpreted language.

These features all put more of the pattern matching processing burden on the shoulders of the programming language itself.

Perl's pattern matching sub-language supports the essential functionality described here: you can evaluate Perl code within a pattern, and have expressions that evaluate to patterns within patterns. But not without a fully interpreted implementation.

Invoking Pattern Matching

To be used, patterns need to be applied to a string or source of input. There are different ways of doing this, applicable in different contexts and for different purposes.

The most obvious thing you can apply a pattern to, and what you're largely limited to in SNOBOL4, Icon, Perl and Python is a string:

SNOBOL4:

<string> <pattern> = <replacement> : <gotos>
<string> <pattern> : <gotos>

Icon:

<string> ? <pattern>
<string> ?:= <pattern>

Perl:

<string> ~= s/<pattern>/<replacement>/
<string> ~= /<pattern>/

Python:

sub (<pattern>, <replacement>, <string>)
search (<pattern>, <string>)

OmniMark:

<string> matches <pattern>
do scan <string> match <pattern> ...

As in (where the contents of the parentheses is captured in "x" where possible):

SNOBOL4:

"ab(cde)fg" "(" break (")") . x ")" : s(foundit)

Icon:

if "ab(cde)fg" ? (arb & ="(" & (x := arb) & ")" then ...

Perl:

if ("ab(cde)fg" ~= /\((.*)\)/) {
   x = $1; ...
}

Python:

result = search (".*\\(\(.*\\))", "ab(cde)fg")
if result:
   x = result.group (1)
   ...

OmniMark:

do scan "ab(cde)fg" match unanchored ("(" any** => x ")") ...

Some Pattern Matching Issues

There are a few issues that need to be considered before this presentation rolls to a close. These are really implementation issues, but they all have a major impact on what functionality is available to users. The main reason for presenting these issues here is so that users who are looking for pattern matching functionality in the programming languages hey are using have an idea what they need and what they can live without.

Backtracking

The classical "grep" pattern matching algorithm attempts to find the longest string that can be matched by the pattern it's given. So, for example, ".*a" will match up to and including the last "a", not the first one.

Matching the longest string requires the algorithm to shoot past what it's looking for -- to see if there's another one -- and then back up until it stops at the last match. This is called backtracking. So it can be expensive, especially when the input text is large.

Backtracking is useful. However, it is (arguably most) often the case that it's the next whatever that you want to find, not the last. As a consequence, the grep variants have introduced "non-greedy" versions of "*" and "+" (i.e. "*?" and "+?"). So, for example, ".*?a" matches up to and including the first "a", not the last one. Another argument against backtracking is that it's relatively easy to write a pattern that finds the last whatever without using backtracking.

Another, slightly more subtle use of backtracking is to retry alternatives when a later part of a pattern fails. For example, the grep pattern:

(a|ab)bc

applied to the input:

abbc

firstly succeeds in the "a" alternative of the "or", and succeeds in matching the "b" following the "or" group, but fails to match "c". It then backs into the "or" group and tries the "ab" alternative, and this time succeeds to the end.

This second use of backtracking is harder to argue against -- many people find it "obvious". However, one can live well without it, because the non-backtracking form (that only tries the first successful alternative) is easily explained, because a programmer can write patterns that don't need backtracking (such as abb?c for the above), and because non-backtracking algorithms generally run faster than backtracking ones.

This is not intended as an argument for or against backtracking, but rather as a discussion of the issues.

The Joy of Being Lazy and Gluttonous

If one wants to match things in a large chunk of text, the match-one-pattern-in-a-string approach to invoking pattern matching often fails. This is where "lazy" input streams come to the fore. A lazy input stream is one that delivers its input data a bit at a time -- like generators in Python deliver data items one at a time.

If one wants to match many separate things in input text, it's valuable to have the pattern matcher "gobble-up" what it matches, so that the remainder of the input is available to the next attempt to match. Gluttonous pattern matching is a good fit for lazy input data, because it's in large chunks of input data that one is most likely to look for and find many things.

The only language of the five discussed that fully supports lazy input and gluttonous pattern matching is OmniMark, although these techniques can be used to some extent in Icon and Python.

Scale issues

The lazy and gluttonous approach is really a scale issue: as things get bigger, the techniques that work for small things become less effective, and new ways of doing things are needed. In OmniMark, for example, there are three basic ways to invoke pattern matching, each of them working at a different level of scale:

  • The "input MATCHES pattern" form matches input against a pattern. A "matches" can be used inside any expression and is useful at that level.
  • The "do scan" and "repeat scan" matching statements combine the "matches" logic with a conditional statement, and are useful when what is being done is of the statement sort.
  • The "submit" and "find" facilities feed input to a set of rules -- which are essentially functions that are driven by matching input data rather than being explicitly called.

The division of functionality into the "statement" and "rule" level is used to good effect in XSLT, where it's applied to element and content selection -- sometimes defined by top-level "rules", and sometimes nested within other rules.

Not all these levels are required all the time, but it's useful to have them a lot of the time.

Implementing Better Pattern Matching

What we need is more text processing power in commonly available programming languages.

Pattern Matching in Python

Contrary to common expectation, it turns out that one doesn't need text processing specific features to have powerful text processing power. What one does need is for a language itself to be sufficiently general that text processing capabilities can readily be implemented in the language itself. This, in the same way that a programming language typically doesn't directly supply trigonometric functions, but is sufficiently powerful for trigonometric libraries to be implemented using the language itself.

Python makes a good first target for an attack from the Algebraic-philes. The latest releases of Python have much of the general functionality required to support powerful text processing.

Powerful text processing doesn't mean Python's "re" library. The Python 2.3 Tutorial says that the language's functionality is inspired in part by Icon, so the "re" (Regular Expression) pattern matching is a bit of a disappointment. It may be familiar to grep and Perl programmers -- and that's good -- but it's not really a good fit with the language. Amongst other things, significant parts of the advance functionality in Perl patterns isn't available in the Python implementation, largely because Python isn't fully interpreted the way Perl is, and so you can't readily embed Python code within strings the way you can in Perl.

A New Pattern Matching Model

An alternative to Python's "re" library would be to implement Algebraic pattern matching using (Version 2.2) Python's advanced function definition capabilities. Here's the example given earlier for the five languages, reformulated in the Algebraic form:

subject = MatchingInput ("ab(cde)fg")
pattern = NoneOfP ("(") [0:] & IsP ("(") & \
          NoneOfP (")") [0:] >> "x" & IsP (")")
if subject ^ pattern: ...

The first thing you'll see is that there's definitely verbosity! It isn't any more readable than the Regular Expression form, but then it's not notably less readable than the Regular Expression form either.

To help understanding, the functions used are:

  • "MatchingInput" converts a text input object (e.g. string or file) to a pattern matchable input.
  • "&" is an operator that matches only if both its argument patterns match.
  • "NoneOfP" matches any character that is not one of those given.
  • "[0:]" matches zero or more of the pattern it suffixes.
  • "IsP" matches a piece of text.
  • ">>" sets a dictionary entry named by its second argument to the text matched by its first argument, if any.
  • "subject ^ pattern" applies the pattern the given input, and returns True or False, depending on whether the pattern matched. "Subject" is an Icon term for a pattern matchable input.

("\" is Pythonese for continue the expression on this line on the next. The default is to end an expression at the end of a line.)

That's it. In this implementation, both patterns and subjects are "first class" values, meaning that one can assign them, pass them to other functions and repeatedly use them. Patterns are callable functions, which makes them particularly easy to use.

There's no actual need to assign the subject and pattern to variables -- the whole thing can be done in one statement:

if "ab(cde)fg" ^ NoneOfP ("(") [0:] & IsP ("(") & \
                 NoneOfP (")") [0:] >> "x" & IsP (")"):

The only thing wrong with giving a string (or MatchingInput) directly to a pattern match is that that's the only pattern that get's a whack at that input. Assigning a MatchingInput to a variable name and using that is best most of the time.

The real advantage of what's going on here is that it's very flexible in how it can be used.

Fun and Games with Python

To show how easy Algebraic pattern matching was to implement in Python, and to illustrate what's meant by a language with sufficiently powerful features to implement text processing, here's an implementation of the NoneOfP pattern matching function:

class NoneOfP (charset):
   # Match one character so long as it is not a member of "charset".
   def Match (subject):
      if not subject.AtTheEnd () and \
           subject.Buffer [subject.Pos - subject.Offset] not in self.charset:
         subject.Pos += 1
         return True
      else:
         return False
   def __init__ (self, charset):
      self.charset = charset

The two uses of NoneOfP with their repetitions in the previous example match up to and not including a "(" in the first case or to a ")" in the second case.

NoneOfP doesn't actually do any pattern matching itself, but rather returns a function that can do the pattern matching when required. What NoneOfP is provide the true matching (the "Match" method) with the pattern-specific information it needs to do its match: in this case the set of characters on which it is to "break", or stop its matching. All patterns return an object with a "Match" method that takes just one argument, the input to be matched against, so they can be used interchangeably.

Going Further

To keep things simple this presentation limits its examples to pattern matching on single-octet character streams. Most of what's said here applies to other interesting cases, including Unicode streams, variable-width characters (as in UTF-8) and non-text streams (e.g. streams of parsed markup tokens, lexical tokens of a programming or specification language, or binary data).

The implementation of Python pattern matching that accompanies this presentation isn't going to blow your socks off performance-wise. It is, after all, intended as a prototype and demonstration. However, it seems to be important enough a feature to make it reasonable to consider what needs to be added to the Python programming language (and others) to provide a better implementation. But that's another story.

Conclusion

So what's to be learnt from all this?

The thesis of this presentation is that basic text processing facilities are an important part of any "modern" programming langauge, and that such facilities are not that far away from us. Also that the features of apparently different programming languages can be usefully provided in the same context.

This has been a short description of how major features of text processing functionality can be readily provided in a sufficiently powerful programming language. In and of themselves these features are important and useful. But also they doesn't cover everything we could make good use of. But again that's another story.

Most importantly, what's wrong with programming languages vis-a-vis text processing is not that they don't have sufficient specific text processing functionality, but rather that those languages are not sufficiently powerful in general. Text processing doesn't need more ghettoization -- the world of programming languages needs to be more accepting of different programming techniques, and needs to provide the tools in which they can be implemented. And the last thing we need is yet another programming language.

So why aren't sufficiently powerful and general programming languages all around us? And why has it taken Python this long to get this far? Python is more the victim than the culprit in all of this. Most of the good ideas in programming languages are too well hidden -- out in plain sight, they're easy to find if you're looking -- but hidden nonetheless.

This presentation is very much a work in progress. I hope and intend to take it further.

The major contributor to hiding many good ideas is the way the industry focuses on individual techniques and aspects of technology: object-oriented programming, XML, Java, the Web. As valuable as all these are, they blind the industry to what else is going on in the world. In the words of Marshall McLuhan, "Knowledge creates ignorance."

Postscript and Links

If you want to find more about the language features discussed in this presentation, most accessible information appears as part of the description of languages that implement the features. The easiest places to start are the Python and Icon programming languages.

I strongly recommend looking at the Python programming language if you haven't already. If you are a Python programmer and the "curried" functions are new to you, they were introduced in Python 2.2. They are well worth looking at. If you've been wondering what people mean when they wax eloquent about "functional programming", this is their killer tool.

Another good source of ideas is the Icon programming language. It's where Python got its inspiration for its generators and iterators, and is the origin of the Algebraic form of pattern matching used in the above Python examples.

More information on each language discussed here is available from their respective primary web sites:

Icon: www.cs.arizona.edu/icon
OmniMark: developers.omnimark.com
Perl: www.perl.com, www.perl.org, www.cpan.org
The Perl 6 and the Parrot Project are moving Perl and Python in the direction of providing the core tools we need to advance our work: www.parrotcode.org
Python: www.python.org
This is one answer to getting rid of a lot of the problems with what we do: getting rid of the stack. Visit www.stackless.com
SNOBOL4: www.snobol4.com and www.snobol4.org

The Python work described in this presentation can be found at: www.wilmott.ca/python


All About Pattern Matching

Sam Wilmott
sam@wilmott.ca