When “It Doesn’t Matter” means “It Matters”

B. Tommie Usdin
btusdin@mulberrytech.com

Abstract

Opening plenary address

Keywords: Markup Languages; Modeling; Information Architecture

B. Tommie Usdin

Tommie Usdin has been working with XML and XSLT since their inception, and with SGML since 1985. Ms. Usdin co-chairs the Extreme Markup Languages conference, and chaired Markup Technologies and the first few years of the SGML'XX conferences. She was co-editor of Markup Languages: Theory & Practice, a peer reviewed quarterly publication published by the MIT Press. Ms. Usdin has led teams doing document analysis and DTD development for medical reference works, scientific and technical textbooks, industrial manuals, legal treatises, and historical literature. She has taught SGML and XML to executives, managers, technical writers, publications staffs, and typesetters. Her courses have varied from high-level overviews of the concepts underlying SGML and XML, to the impact of conversion to these markup languages on the workplace, the technical details of DTD development and maintenance, document analysis, how to tag and correct autotagged documents, and details of particular SGML and XML applications.

When “It Doesn’t Matter” means “It Matters”

B. Tommie Usdin [Mulberry Technologies, Inc.]

Extreme Markup Languages 2002® (Montréal, Québec)

Copyright © 2002 B. Tommie Usdin. Reproduced with permission.

A little while ago I had a very strange conversation, in which I was attempting to help the non-technical “owner” of a collection of narrative documents communicate with the programmer who had written some XSLT stylesheets for use with those documents. The document “owner”, a manager of a group of subject matter specialists who wrote and maintained the documents, was furious because the documents were getting mangled on display. He had tried to talk to the support programmer about the problem, and the programmer had, apparently, been completely non-responsive. By the time I got there they were both angry, defensive, sure that the other was irrational and stupid. After just a little investigation it became clear that the problem was that contents of the documents were not appearing in the end-use display version in the sequence in which they had been created. The writers of the documents (and their manager) were very upset about it. I didn’t actually find that difficult to understand; the creators of most prose — and most verse for that matter — spend a lot of time deciding on the sequence of structures such as paragraphs and lists. The programmer didn’t see what the problem was; all the text was there according to the rules of the document model. And it was true that the document model…. (Oops, I should probably be politically correct and call it a schema — at least a schema with a small “s”, even if it was a DTD and thus not a schema with a large “s”. Sometimes this stuff seems awfully silly.) Anyway ... it was true that the document model specified the sequence of the top two or three levels of the document hierarchy, and that below that anything could appear in any sequence as often as was needed. Which, he claimed, meant that the sequence, by definition, didn’t matter, so it didn’t matter if the XSLT stylesheet that converted the XML into the markup for the proprietary typesetting system rearranged them.

And the programmer knew that this was okay in XML. After all, I had told him that the sequence of attributes on an element didn’t matter, and that XML processors might resequence them without believing that they had made any change to the document, and in fact that they had not made a change to the logical tree that was the XML document. So, I had told him, nothing he wrote was allowed to count on the sequence of attributes. Changing the sequence in which element content appeared when the model is an optional repeatable or group seemed like the same thing to him.

And after his users started making a fuss he called a friend of his who had a lot more XML experience than he did, and asked if the sequence of elements in an or group mattered. And the friend said:

If the sequence had mattered they would have specified it. For example, in my invoices, the metadata has to come first, then information about the purchaser, then information about the seller, then information about the products. The sequence of that stuff matters, and it is specified. But inside the information about the products, for each product, I can supply product number, product name, quantity, unit price, and total price in any sequence I want. The people who receive the invoices will sort them out if they aren’t in the sequence they expect, so it doesn’t matter if I put them in alphabetical order by tag name, or shortest to longest, or leave them in the sequence they come out of the database in. Resequencing really doesn’t matter.

Our programmer had tried explaining this to his users, but they had all started talking at once, and getting very excited, so he decided to ignore them. Sometimes the best thing an IT person can do is to ignore the user’s requests and give them what they really need, after all. He found it easier to write and maintain an XSLT stylesheet that output all of the paragraphs before the lists, followed by the tables. Mixing them all up, the way the writers had, would have required a far more complex stylesheet. He took pride in the small number of templates in his XSLT stylesheets. Everyone knows that short programs are less likely to be buggy than long ones.

But the users weren’t buying it. So, where was the problem? Well, there were a lot of problems here.

I think these last two are closely related; the differences between the two main categories of XML users, tools, philosophies, etc. are:

These two camps, not yet armed but clearly thinking about it from time to time, are fighting for control of the XML standards space in some interesting ways. For example, most agree that the 80/20 rule should govern XML specifications. But unfortunately many seem to mean by that: “we need a spec that provides my 80 percent of the possible capabilities but that doesn’t slow me down by providing your 20 percent”.

Loose and Tight Specifications

Let’s talk about loosely and tightly specified document types.

Ideally, it seems to me that a tightly specified document type can be thought of as one in which all knowledgeable encoders would create the same XML files given identical content. There is one and only one way to tag any particular content. Variations in tagging the same content are due to errors. Thus, any differences in correct XML for two instances of the document type are meaningful. Any time a specification allows more than one way to encode the same content, and there is no documented difference in the meaning of the two encodings, there is looseness in the specification of the document type. Two people with the same understanding of the content might produce different, correct, XML documents that have no difference in meaning. (The syntax may differ but the semantics do not.)1

And this is not only not a good thing, it is a very bad thing. Users keep demanding flexibility and extensibility, and in one sense giving them syntactical choices is giving them flexibility. But it is not giving them useful, usable flexibility; it is giving them headaches. They want flexibility in the content they create and communicate; they want rich semantics. They are not interested in flexible syntax!

I keep talking about syntax and semantics. Let’s take a side trip there. How many of you are sure that you know the difference between syntax and semantics? Oh, good. There are quite a few of you who know a moderate amount about it. (If you are new to this jargon-space you may not know that syntax is the format in which the data is encoded — XML syntax includes rules such as tags are enclosed in pointy brackets, and attributes reside inside start tags.2 If you have been in the jargon-space for a long time, you may be ready to talk about how fuzzy the line between syntax and semantics can get when you push the markup metaphor really hard.) So, what I’m saying when I say that “we want rich semantics not rich syntax” is that we want the ability to express more and more meaning, not that we want more and more complicated tagging schemes. I do not pretend to believe that you can tightly specify a document type if you limit the specification to that which can be checked using a validating parser. Such a tool can check to ensure that the document meets all of the syntax rules that were expressed in a DTD or other schema. But it cannot check to ensure that the content of the data means what it is expected to mean.

Inferred information

Another problem with loosely specified document types is the same as a common problem with loose specifications. I once worked with a group of technical writers who produced and maintained the documentation for a very complex type of machinery. Most users of the machinery worked with several of the machines and up to a dozen of the manuals from one time to another. So, the technical writing group had a specification (which they called a “style guide”) that said in what sequence things should be described, what needed to be defined and where the definitions should be put, that the style and font was for each structure in the books. And there were, as I suppose is not surprising, several things that could be expressed in several equivalent ways. One of these, unfortunately, was warnings. If there was a warning associated with a step in a procedure list they could:

  • include the warning in the step to which it applied, typically in a box immediately after the title of the step but before the description of the step; or
  • put a boxed warning before the first step in the Procedure; or
  • put a box around the entire procedure list in conjunction with either of the boxed warning locations
It turns out that most of the writers put the warnings inside the steps, but a few boxed the entire procedure and put the warning at the beginning of the procedure. They all agreed that the two displays meant the same thing; but most of the writers believed that it was ugly to box the whole procedure if the warning was in a step, and besides it was hard to get the box properly balanced. A few of the old-timers in the technical writing group persisted in using the boxed warning at the beginning of the procedure, and boxing the entire procedure, because it looked better.

Are you surprised to hear that the users of these manuals inferred from the two formats that one of these warning displays indicated that there was a more serious condition than the other did? The users were inferring something the authors had not implied.

Encoding Documents

If you give me several ways to encode the same content and don’t give me guidance on what each of them means, you have not given me increased expressive power, you have given me:

  • increased effort to encode (because I have to decide which method to use; I don’t want to think about that; I want to think about my content!)
  • increased likelihood that the recipients of my content will fail to use it or, worse, misinterpret it
  • reduced ability to add future complexity to the system by adding meaningful alternate encodings (by using them up in meaningless ways)
  • increased complexity of all down-stream applications that need to process the content.

Let me give a few examples. (First, let me apologize to the people who worked very hard on the specifications I am about to discuss. All of these were developed by hard-working people with good intentions, and all are being used successfully in some environments. And nobody likes to be called typical, or to have their work called out as an example of an unfortunate practice. So, I apologize. Now, get over it.)

We don’t want rich syntax, we want rich semantics. Variations in syntax without documented associated semantics are harmful rather than helpful. I don’t want ten ways to express the same thing, or even two ways to express exactly the same thing.

An example: I remember a table model (I think it was in an early version of the TEI, but it might just have been in a discussion draft of an early TEI version) that allowed users to model a table as either a series of rows that contained cells (implicitly creating columns) or as a series of columns which contained cells (implicitly creating rows). My first thought on seeing this was that it was very powerful. Sometimes the dominant organizing principle of a table is rows. Rows are totaled, rows are described, and if you are interested in only part of the table it is likely to be a group of rows. And sometimes the dominant organizing principle of a table is columns; it is columns that are totaled, a reasonable subdivision of a table would be a few columns, etc. I loved it! And I told one of the editors of the specification how powerful I thought it was. That distinction was important, and none of the table models I had seen to date allowed me to express the primary organizing principle of the tabular data. And he pulled out a hatpin and popped my balloon. I was misinterpreting the specification; it allowed either encoding because they were logically equivalent and there was no rational reason to require people to use one or the other. From either encoding the same matrix could be constructed for display or computation and the guidelines did not want to be unnecessarily restrictive. There was no difference in the meanings of the two approaches; I was inferring something they had not implied (or in another jargon, I was making an unlicensed inference). Oh. So this meant that I could receive the same data in dramatically different sequences and had to know that it was the same. This meant that there were not guidelines on which to use, and that tools that were to read, process, even compare document content had to be very complex. This was not cool. It was putting an unreasonable burden on text encoders (deciding which orientation to use without guidance) and an onerous burden on those intending to receive content from several sources and re-use it. Which was the major intent of the guidelines in the first place (as it is for most XML applications to this day).

To be fair, when I was writing this talk I went to the TEI to check my facts on this, and found that in the current version of the TEI Guidelines (P4 for those of you who don’t follow this particular specification) the table rules are row-based only, and the guidelines simply describe how to encode tables. In P3, however, there is a footprint of this bizarre thinking:

It is to a large extent arbitrary whether a table should be regarded as a series of rows or as a series of columns. For compatibility with currently available systems, however, these Guidelines require a row-by-row description of a table.

Well, they have ignored the possibility of using these different orientations as information, but at least they have picked one. This will make interchange of data a lot easier!

Please don’t misunderstand me. My friends in the TEI world are not especially prone to this sort of sin. I call them out not because they did this silly thing, but because they saw the error of their ways and reformed. I could name a host of current XML efforts that are unreformed, unrepentant, and in fact ignorant of the error in their ways, who are doing the same thing!

Nested structures versus recursive structures

In many (perhaps most) of the groups I have worked with to design and develop document models, there comes a time when a decision on how to handle nested narrative structures must be made. There are basically two options:
recursion

sections contain titles, followed by paragraph-like stuff, perhaps followed by sections, which contain titles, followed by paragraph-like stuff, perhaps followed by sections, which … you get the idea

designated levels

section-1s contain titles, followed by paragraph-like stuff, perhaps followed by section-2s, which contain titles followed by paragraph-like stuff, perhaps followed perhaps by section-3s, which ...

Most of you have heard of, if not participated in these discussions. They tend to become very passionate, in my opinion of several reasons:

  • there are real advantages, at least to some people some of the time, to both approaches;
  • there are perceived disadvantages, at least to some people, of each approach (although when pushed to think about it, all knowledgeable people agree that the disadvantages can be overcome with some one-time work)
  • people who haven’t understood what has been going on in the meeting for hours think they understand this issue, and chime in (often dogmatically, often based on irrelevant experience).

The only wrong answer to this particular question is “either”. Unfortunately, this is the easiest answer for committees to agree on. Why? Because nobody in the design committee has to “lose”. And the committee justifies this by saying to itself “we’ll allow either approach because flexibility is a good thing”.

My skin, like the skin of most people my age, has scars. Some were caused by accidents, some by efforts to help me (surgeons may be helping, but they also cut holes in people that leave scars, you know).

Look at your favorite XML specification. How many places does it allow two (or more) options that mean the same thing? These are scars; some left by accidents, most left by efforts to help. However, like the scars left by physical wounds, they are not only ugly, they reveal structural weaknesses. Every time a specification development committee “solves” a problem by allowing two ways to do something it reduces the information-carrying potential of the information; it makes an internal committee problem into a user problem. This, it seems to me, may be expedient, but ignores the whole point of the exercise. So, I have a few general principles for those of you involved in specification development. By the way: by specification I mean not only the various flavors of schema we write (DTDs, W3C XML Schema, RELAX NG, and the other toddlers in the kindergarten), I also mean the specifications for those specification languages. It is just as harmful when there are several ways to express the same constraint in a schema language as when there are several ways to tag the same data in a document.

General principles for specification development:

  • specify as tightly as possible
  • specify as clearly as possible
  • recognize that it is likely that part of your specification will be machine validatable and that part of it may not be. That doesn’t mean that the part than cannot be machine validated is any less a part of the specification.
Do not specify more tightly than possible. The goal is to increase the expressive power of the language, not to decrease it.

By specifying as tightly as possible you:

  • …reduce the number of decisions needed when using the specification
  • increase the quality of interchange and the information value of users of the specification
  • reduce the mis-information that user will infer even if you didn’t imply it.

Don’t overdo it! Many of the people, especially technical writers, who hated SGML, and many who are coming to hate XML, are fighting to express themselves despite the constraints of overly limited document models. If you specify more tightly than “possible” you take expressive power away from content creators, just as you do if you provide too limited a vocabulary for them to express themselves. And what will they do about it? Sabotage. Tag abuse. Ignore the cool new system and use the un-cool but effective old one. As they should. Some of the under-specification I see is a direct response to experience with over-specified applications.

So, to those of you who work on/with specification development: make decisions! Specify as tightly as possible. Note: I am not asking you to tie the author’s hands; I am not asking you to specify more tightly than possible. If there is information that an author could create, allow a way to express it. If the invoice would mean something different if when number of pairs of shoes come before the shoe sizes, then allow them to come in either sequence in the document type definition and have the creator of the invoice decide which sequence will convey the appropriate information. But if it means the same thing which ever sequence they come in, then pick one and require it! Under-specification causes different problems than over-specification, but it no less dangerous.

Under-specification leads to increased costs to create, manipulate, and use information according to the specification. This is more difficult to see immediately than the problems with over-specification, but perhaps more dangerous.

If there is information to be conveyed in the sequence of the paragraphs, lists, and tables, then let the specification allow them to be in author-determined sequence. If there is information to be conveyed in which elements are used, either select one for each meaning or document the differences in meanings between the various elements.

The poor misled programmer I talked about at the beginning of this talk had several problems. I can’t leave that poor guy without mentioning that the goal of having the smallest number of templates possible in an XSLT stylesheet is quite perverse. It is a recipe for heavy use of deeply nested for-each instructions, and for the development of un-maintainable pretzel code.

But that isn’t his biggest problem: He didn’t understand the difference between “it doesn’t matter; there is no information here” and “it can’t be specified because the content creator will supply it”. Given the current state of many of our specifications, I can see how he got so confused. I can even sympathize. But I think the appropriate solution is to specify those document types more tightly where nothing but confusion and expense are lost by tightening the specification, and to respect the information content that is left.

So, for the rest of this conference, and perhaps even after you go home, please listen carefully when someone says “it doesn’t matter”. Figure out if they mean “there is not information to be conveyed here” or if they mean “there is a great deal of information to be conveyed here”. Or if they don’t know which they mean because they haven’t thought it through. And if they haven’t thought it through; question it. Sometimes “It Doesn’t Matter” means “It Doesn’t Matter”, and something “It Doesn’t Matter” means “It Matters”.

Notes

1.

At Extreme 2002 Anne Wrightson disagreed with this position, suggesting that it was "almost" right. The exception being that when a model (such as a schema or DTD) were designed to merge content created in several places it might be a good idea to allow the "styles" of each of the source data formats. I agree that there are times (and models intended for consolidation of content from several sources are among them) when balancing requirements may make compromising this goal desirable. But that doesn't make it less desirable, even in these situations; it is one of several competing design goals; simplicity of data conversion may be another design goal.

2.

During the presentation of this at Extreme 2002 Ken Holman objected to my descriptions of syntax and semantics, insisting that the use the application made of the data was the semantics. I disagreed then, and I disagree now. This “backwards” view is only possible if the people who create the data have a consistent view of the intended semantics of the content. Even in the case that an application does something other than what was intended with data, that can only be successful to the extent that data was created with a consistent enough semantic view.


When “It Doesn’t Matter” means “It Matters”

B. Tommie Usdin [Mulberry Technologies, Inc.]
btusdin@mulberrytech.com