The Dichotomy of Markup Languages

Sam Wilmott

Abstract

Markup languages serve a variety of purposes and uses. A useful categorization of these uses is to divide them into computer uses and human uses — what machine need and what people need.

Current markup languages and XML Schema tend to serve a mixture of computer-support and human-support purposes. This mixture is the cause of some of the difficulty users have with markup languages and XML Schema, and the cause of confusion, both in the development and use of standards.

This presentation first describes the contrasting and common requirements of computer and human use of markup languages. It then goes on to describe how existing XML Schema can be used within these application areas. Finally, it suggests what can be done in the standards area to help reduce the confusion.

Keywords: Markup Languages

Sam Wilmott

Sam Wilmott is the lead researcher at OmniMark Technologies, and architect of the OmniMark programming language. He has also worked on document markup standards since the late '70's, and has served Canadian representative on the ISO SGML committee.

The Dichotomy of Markup Languages

Sam Wilmott [OmniMark Technologies]

Extreme Markup Languages 2002® (Montréal, Québec)

Copyright © 2002 Sam Wilmott. Reproduced with permission.

Who Are Markup Languages For?

This is the big question when it comes to determining the "why" of markup languages.

The answer is quite simple. Markup languages are for the use of:

  • computer programs, that create and consume marked-up data, and
  • human beings, who enter and read textual data as well.

Computers (or more properly, computer software) and humans share a number of requirements in common — we're denizens of the same universe, after all. At the same time, each have their own very distinct requirements. In determining the design of a class of markup languages, one has to take into consideration the specific needs of the use of the markup languages (computer or human). Good design also requires that the common features of computer and human use be taken into consideration — so that different classes of markup languages are not arbitrarily distinct.

In the following, the phrase "markup language" covers any generalized data encoding scheme that captures both data and the structure of that data. For human uses, that means what we're used to calling a markup language: XML etc. For computer use we can include a variety of similar ways of doing things, such as the "binary" markup language, ASN.1.

Characteristics of Computer-Use Markup Languages

Data formats for machine use — for computer-to-computer communication — should serve the needs of programming. To this end they should make writing the programs that parse and create marked-up data as simple as possible. The idea is to minimize the work required — by both the progrmmer and the computer — and to allow programmers to focus on the applications they are creating, rather than on using the marked-up or encoded data.

The following all contribute to making the job as simple as possible:

  • It should be easy to recognize, and to distinguish between data and markup.
    It should be easy to look at a piece of data and say "this is data" or "this is markup".
  • There should be a minimum of variants in data formats.
    Each character or other piece of data, and each mark, should have, ideally, one encoding. This is generally called "normalization" — having a unique encoding for a piece of data.
    As well, similar things should be encoded similarly, reducing the number of different cases computer programs have to deal with even when there are a variety of different things being encoded.
  • Markup and character recognition should be context-free.
    This is another take on minimizing variations. For data, this means that the same character, for example, should be encoded the same way no matter where it occurs. For markup, and for "escaped" data, the same mark or escape should be encoded the same no matter where it occurs.
    Amongst other thing, for marks, this means that different elements should have different names, even where context would readily distinguish between same-named elements in different contexts. In particular, a "title" of a "chapter" should be a different element, and should have a different element name, than the "title" of an illustration. Unless, that is, they really are the same kind of element in every respect.
  • There should be nothing in the data and markup that is not required for the transmission.
    Most importantly, this means that computer-to-computer data should not have extra "white space" or comments within it — neither serves the needs of the computer. If comment-like annotations need to be transmitted, for later information or other use, then these annotations should be marked up like any other data, because they are data — something is either data, or it is not.
  • Binary encoding is preferable, but not necessary.
    There is no need for a computer-to-computer transmission to be "clear text" — be directly readable by human beings. Where such reading is required, some kind of computer-to-human translation of the data can usually be done without a lot of difficulty.
    Generally, it's preferable to go with a "binary" encoding, where the "bytes" and "words" of the data stream are explicitly numbers — numbers are "computer friendly" in the way that words are "human friendly". Clear-text and non-binary encodings tend to require extra processing.
    That said, there's nothing really wrong with a clear-text encoding for computer-to-computer data, so long as it's kept simple.
  • Redundancy is a mixed blessing.
    Duplicated information is often just a waste of time. Anyone whose ever written a program to parse RTF knows all about unnecessary duplication of information and the trouble it causes.
    On the other hand, duplication can help — where the alternative is hunting around for information or having to store away information just in case it's needed later. Knowing where such duplication helps tends to be dependent on what use will be made of the data.
    So redundancy, for computer purposes, is generally a bad thing.
  • Data is serialized for transmission.
    Data is usually communicated between computers and between computer devices in a serialized or linear form: one bit or byte followed by another bit or byte, and so on. Serial communication is typically cheaper, easy to implement, and lends itself to high-performance techniques. Most of the things you plug into your computers these days are serial devices — the parallel devices are hidden inside.
    Convenience for transmission isn't the same thing as convenience for use. A lot of computer programming is better served by tables, by tree structures and other direct-access organizations of data. You really don't want a program searching for a file on your hard drive starting at the edge and examining the data track-by-track until it gets to the hub in the middle.

None of these are absolute requirements for computer-to-computer communication. Computer software can deal with violations of all of these criteria — but only with extra work — which can be costly and irritating when repeated across a variety of applications. The key point is, there is no good reason to violate any of these criteria.

Characteristics of Human-Use Markup Languages

Data formats for human use — for human-to-computer, computer-to-human or even human-to-human communication — have a substantially different set of constraints on their design.

Again, the overriding need is for ease in parsing (reading) and creation (writing). But it's humans doing it this time, not computers. And therein lies a difference.

Decoding vs. Reading

When parsing data, computers decode — humans read. Thinking about what "decode" means when people do it gives you an idea of the difference between these two activities:

  • Decoding is looking at each character and each markup delimiter individually, and assembling them into some sort of organization.
  • Reading is looking at whole chunks of data at once — whole words, whole phrases, whole marks — and interpreting what's read at this higher level.

Decoding demands ease at the character and delimiter level, but reading demands ease at the word and paragraph level.

Design Criteria For Human-Use Markup Languages

The criteria for designing human-use and readable markup languages are more numerous and less precise than those for computer-use markup languages. Are you surprised? You shouldn't be. Human beings are way more complex than current computers, and will be for a while yet.

Human-use criteria include the following:

  • Minimize noise.
    For human readability, this is the biggie. Nothing kills readability like having to pick your way amongst a forest of tags and fancy character encodings. "Noise" is anything that isn't the data (in an "as is" sense) that doesn't need to be there for comprehendability.
    This is where SGML wins over XML: just the ability to omit end tags in sequences of same-level elements greatly reduces the required tagging, and greatly reduces the "noise level" of a marked-up document.
  • White space is your friend.
    White space greatly improves readability:
    White space groups groups of things.
    White space separates separate things.
    White space isn't noisy — it is the absence of noise.
    In general, white space provides a strong visual clue as to the logical organization of information.
    But care has to be taken when processing white space: sometimes it's data and sometimes it's just white space — just put there to improve readability. You have to be able to distinguish between these two uses, in a manner that corresponds to human use, to be able to use white space for readability.
  • Meaningless optional variations are fun, but is generally not a good thing.
    Meaningless optional variations in markup and data entry occasionally help in human data entry. Markup that's appropriate in one context may be not appropriate in other contexts. People from different cultures — human cultures, or computer use cultures — may be used to different conventions for encoding data. Dates are an example.
    Meaningless optional variation may be fun, but it's not necessarily a good thing. It can quickly get to be difference for difference's sake, and can make reading difficult — forcing the reader to switch back-and-forth between conventions. The less successful shorthand features of SGML tended to encourage arbitrary variation in markup.
    The main appeal of meaningless optional variations is political rather than technical — giving people a choice encourages them to feel that they are in control.
  • Meaningless required variation is no fun at all.
    The one thing worse than meaningless optional variation is meaningless required variation — forcing people to use different forms for the same thing in different contexts. With required variation, data entry operators and other readers are continually asking themselves "where the heck am I now?"
    Like meaningless optional variation, meaningless required variation has a political aspect, but a negative one — it seems to deny control to the human — with the tail wagging the dog.
  • Meaningful variation is a good thing.
    Meaningless variation shouldn't be confused with meaningful variation. Meaningful variation is, for example, the omission of a component when there's a legitimate case for data missing.
  • Big things should be big, and small things should be small.
    Tags for large structures can legitimately have long names. It's a lot easier to see large structures if their marks are correspondingly large.
    On the other hand, small structures should have small tags, or even shorthand marks. For small marked-up things, it's very easy for the markup to be as big as or bigger than the thing being marked up — and that makes the thing being marked up hard to read. It's a matter of making marks fit what they are marking up.
  • Common things should be small, and rare things should be big.
    Bigness and smallness notwithstanding, there's another consideration in how big a well-designed mark should be: how common or how rare is the thing being marked up? Something that's rare should have a largish mark, even if the thing marked up is small, to make it easier to figure out what it is — it's easier, with good design, to make a large mark (with two or three words in it, even) easy to figure out than it is a small one.
    On the other hand — and there's always an on the other hand, that's the whole point — commonly occurring things should have small marks. A comma makes a great list item separator, for example.
    What about commonly occurring big things you ask? You've said that rare small things can have large marks. So should common big things have small marks? Well, big things can't occur all that commonly, or at least the marks for them can't — their size means that relatively few of them will fit in a document.
    Really small marks have got a bad reputation because, like any system of marks, there has to be a conventional understanding of their meaning amongst their users, both human and computer. In the "good old days", pre-computers and when computers were first being used for text processing, there a whole system of conventional marked for typographic properties — italic, bold, paragraph starts and the sort. That's where the reverse "P" for paragraph break come from. We'll probably not see a new set of conventions for really short marks any time soon, but it's a possibility sometime in the future.
  • There should be as little escaping as possible.
    Escaping is what you do when you don't have a key for a character or a mark. Using character references (e.g. "&lt;", "&#60;" or "&#x3C;") or using "<" to start a tag are both escapings.
    The point here is that although escaping is almost always necessary in a markup language, it's still not nice, and shouldn't be used unless there's no other way.
  • Escaping should be as uniform as possible.
    What escaping is used should have as little variation as possible. Amongst other thing this means that as few characters as possible should be "used up" in escaping. The fact that "<" and "&" start all of the marks and escapings in XML is a good thing. It might have been better had character entities looked like "<#lt>", "<#60>" and "<#x3C>".
  • Escaping shouldn't be ugly.
    Ugly means distracting — making reading harder.
  • Use context.
    This is that "last but not least" of this list. Human readers are good at adapting to context, and reading is generally made easier by judicious use of context.
    Don't confuse this with optional or required variation. "Good" context use doesn't mean having more than one encoding for a particular mark, but using context to qualify unique-in-context marks.
    There are many examples of context, and good uses of context:
    The above-cited examples of using a comma as a list separator, no matter what kind of list you're in.
    It makes sense to call a "title" a "title", no matter what it's a title of, and no matter what properties a title may have in different contexts.
    What needs escaping can depend heavily on context. At the simplest level the dichotomy of paragraphs and "as is" text parts draws a line between where "code" characters need escaping and where they don't.
    Context is the main indicator of whether spaces, tabs and new-lines are data or just white space.
    Making marks small often requires some help from context to be made unambiguous.
  • Clear-text encoding is necessary.
    Humans are used to reading certain kinds of codes — the alphabet for Western readers, for example. As much as possible, recognizability of the codes should be aimed for. On the other hand, newly invented codes are appropriate for some kinds of new use. For example, mathematical and chemical symbols work well in their domains of discourse.
    What the clear-text requirement does say is that "binary" encodings of data are have no place in human-use markup.
  • Redundancy is a good thing — up to a point.
    Humans have no more use for unnecessary duplication than do computers. But some redundancy does help readability. And where humans prepare data, redundancy in the markup can be a significant aid in detecting errors.
    Redundancy is used to detect errors in computer encoding of data — that's what parity bits and cyclic redundancy checks are all about. But such checking is usually done outside of a markup scheme.
  • Text is serialized for reading.
    It's said that "a picture is worth a thousand words". But that doesn't take into consideration the cost of a picture. I'd say most pictures cost a lot more than do a thousand words.
    The point is, in spite of the appeal and importance of multimedia — be it the views, sounds and smells of a spring day, or the latest noisy movie or computer game — human beings communicate through language. And language is a linear medium: one sound follows another, one word follows another, one letter follows another.
    This doesn't mean that this is the way human beings work: we aren't linear beings, nor is our view of the world linear. It's just about how we communicate.

Writing Is Reading

The description of the human-use requirements for markup languages focuses on readability. You might argue that people write or enter marked up documents, and only read some rendering, formatting or display form of them — that they don't read the markup.

But writing is very much a complementary activity to reading. We read as we write. Readability is a very important requirement of any writing system. One can enter data, or write using a system that's data entry oriented, and which makes data hard to read, but you're back in the land of encoding rather than writing — focusing on each individual piece of data, and not looking at overall patterns.

Comparing and Contrasting Computer-Use and Human-Use Markup Languages

The human and computer requirements for markup languages differ in many ways, but are also similar in many ways.

The Differences

The computer-use requirements and human-use requirements described above are quite different from each other, and actually conflict in a number of aspects:

  • Computers most easily recognize data and markup when examining data on a byte-by-byte basis — humans most easily recognize data in context.
  • Computer processing is impeded by extraneous filler (white space) and redundancy — human reading is often helped by extraneous filler and redundancy.
  • Computers most easily process "binary" encoded data — humans "clear text".
  • Only the uniqueness of codes and markup matters to a computer — for a human other design criteria are also important.

Based on these differences it's easy to see that one might want quite different markup languages for computer use and for human use.

However, if the markup language designs get close enough both for the computer-use and the human-use ones, then it's probably a good idea to go back to a single design for both. Generally speaking, when there's a human vs. computer disagreement as to preferences, computers can conform to the constraints of human-use markup much more easily than vice-versa — and compromises can made that way.

What They Have In Common

Appearances notwithstanding, computer use and human use of markup languages have a lot in common:

  • They both abhor ambiguity.
  • They both have little use for arbitrary variation in a markup language — difference for differences sake.
  • "Clear text" encoding (using readable characters rather than binary numbers) is acceptable to both: usable by computers, necessary for humans.
  • Both computers and humans prefer to communicate using serialized data (words following words) to non-serialized data.

The Role of Schema

Schema, be they DTDs, XML Schema or some other form, have somewhat different roles and statuses when used with computer-use and human-use markup languages.

DTDs and Schema have a variety of purposes:

  • They serve as a promise as to the form of a marked-up document.
  • They serve as a standard against which a document can be validated.
  • They serve as a guide in authoring.
  • They serve as a set of invariants against which a document can be parsed (allowing SGML-like markup shorthands).

The first two purposes (promise, validation) have a role in transmitting both computer-use and human-use marked-up documents. The last two (authoring, parsing invariants) are almost exclusively a matter for human-use markup languages.

The different roles of DTDs and Schema vis-a-vis computers and humans, leads one to think that there should be two kinds:

  • A description of the allowed elements and their structure, the allowed attributes of each element, and the allowed data forms (datatypes) for each.
  • A description of human input conventions, including shorthands and name aliases (the meaning of "title" in different contexts).

At present, the various standards for DTDs and Schema, to different degrees, mix these two roles, and none of them are either one or the other, or even particularly rich as both together.

A Short Aside — Surface Structure vs. Deep Structure

The differences between computer-use and human-use markup languages echo a property of human languages postulated by Noam Chomsky in the '60's: the presence of both "surface structure" and "deep structure" in language:

  • Deep structure is an underlying grammar that captures information. The deep structure of the languages we speak have a lot of commonality. It's even been postulated that our DNA codes some of the major properties of human language deep structure. On the other hand, in the same way that we each see the world slightly differently, each of us applies a (slightly) different deep structure to the language that we hear and read.
  • Surface structure is that of the sounds of our language. It describes each of our languages, and the differences between them. It's the "grammar" we learn in school.
    Surface structure and deep structure are related to each other by what Chomsky calls a "Transformational Grammar" — a mapping between the deep structure and surface structure — and the other way around. Formally, surface structure is defined in terms of deep structure.

In short, deep structure is about what's common between human languages — surface structure is about what's different between them.

The deep structure and surface structure dual view serves us well in thinking about the differences between computer-use and human-use markup languages:

  • There's one deep structure. The deep structure captures the actual information content of structured data.
  • There can be many surface structures. The surface structure is a means of conveying information between people, in a culture-specific manner.

The culture-specific appropriateness of different surface structures notwithstanding, we humans need common languages to understand each other. And we need common human-use markup languages for the same kinds of reasons. And we're of one technoculture in any case.

Developing Markup Language Standards For Wider Use

How do current developments satisfy the needs of humans and computers?

Where Are We Now

First of all, a markup language and its schema mechanism together form a language — or more properly, a language definition mechanism, because each DTD and each XML Schema defines a markup language. So we've got to look at them in pairs, which is mostly XML and something else:

  • "Well formed" XML — DTD-less XML — is close to a computer-use markup language — it doesn't make much concession to human use other than it being clear text. But there are still some human-use factors:
    White-space handling, as indicated by the "xml:space" attribute is a human-use issue.
    Optional variants in attribute value delimiters (quote vs. apostrophe) is a human-use thing as well — a convenience for the human, but of no use for computers.
    XML's built-in character entities are occasionally optional variants, but their primary use (e.g. using "&lt;" in text, because "<" isn't allowed except as part of markup) is where they are the only or the primary form allowed.
    XML use is closest to satisfying its computer-use requirements when it's a direct representation of the XML "information set" for the document.
  • XML's DTDs serve primarily the promise and validation role of DTDs and schema, and so fit the computer-use model.
    Entities can be used to introduce a bit of optional variation, but their primary use is in defining cross-document linkages, which is very much consistent with computer use. However the putting entity definitions in the DTD is very human-use oriented. A computer-use linkage is best served by embedding standard-form linkages (URLs, XPath specifiers) directly in the markup (usually in the attributes). DTDs make life easier for human use by factoring out the details of linkage information into the DTD, adding an extra reference step in the entities' names, and leaving the human user with only the entities' names to deal with in the marked-up document.
  • SGML's DTDs and SGML Declarations go much further than XML's DTDs in serving the needs of human use. Quite a bit of markup shorthand and optional variation is available.
    XML's residual human-use features (at least in conjunction with XML DTDs) are largely a hold-over from its SGML heritage.
  • XML Schemas add to the functionality of DTDs in both computer-use and human-use directions:
    Data type validation — checking what's allowed as data in different contexts — is consistent with computer-use. It's about what's allowed as data and information.
    Contextually defined element types are a human-use markup thing. They are about using more friendly and shorter names base on where they occur.
  • Relax NG takes the human-use aspects a step further, extending contextual definition to allowed attributes. It's strength in this area contributes to its popularity amongst those using XML as part of an information modeling system.
  • XML Namespaces, although they have a superficial resemblance to contextually-defined elements and attributes, are actually a computer-use feature. Namespaces solve the problem of having more than one kind of document within a processing stream, and needing to distinguish between the name use in each document kind. They do this by extending the namespace-qualifying the names in each document separately.

Where We Seem To Be Going

If it seems from the above analysis that the SGML and XML family are schizophrenic, then you understand the point I'm trying to make — they are.

For many purposes there's nothing wrong with this. XML's few and useful human-use features don't detract from it's primary computer-use, and save having to have two kinds of markup languages that would otherwise be closely similar.

On the other hand, human-use features creeping into XML Schema languages is not a good thing, at least not in the way it's happening. What's we're getting from that is markup languages that fit into the big gray fuzzy area between computer-use and human-use ones, with no real idea of what purpose they are serving.

Because they are neither fish nor fowl, none of these fuzzy schema languages satisfy well-defined requirements, and people will just keep inventing new ones, adding features they want, and removing those they don't. It's not that the new things they add — like W3C XML Schema's and Relax NG's contextually-defined elements — are bad ideas in any way in and of themselves, it's that they don't address real world requirements — they're just good ideas.

A major issue here is that while quite generalized computer-use requirements are easy to state and satisfy, even stating, much less satisfying, generalized human-use requirements is very much harder. Human-use requirements consequentially tend to get addressed in a case-by-case manner — specific, rather than general, human use requirements are much simpler to deal with. The difficulty of dealing with generalized human-use requirements for markup languages (or for most anything else, for that matter) means that we're not going to completely solve the problem for a long time yet, if ever. But it does mean we have to understand the difficulty and where the problems lie, or we'll go nowhere — or at least progress will be much slower than need be.

Syntaxes For Schemas

A closely related problem to that of fuzzy-targeted markup languages is the kinds of syntax used in schemas themselves:

  • Things started out with specification notation that was totally uncoupled from what it was describing: things like BNF and the similar syntax rules used to describe ASN.1 documents. These syntax-describing notations were intended primarily for human use, and weren't necessarily the notation used for the syntax definitions used by software parsers.
  • Next came the SGML-originated DTD syntax, which was more tightly coupled to the document. DTDs are intended for use both by humans and computers, and the notation is a compromise between those two kinds of requirements, plus the need to distinguish a DTD from the document it prefixes.
  • It was deemed by many that the DTD syntax is user unfriendly, and should be abandoned in favour of something more familiar, so XML Schemas used an XML defined language themselves.
    As well, it was recognized that, because a schema describes a class of documents, placing it in each instance of the class is inappropriate. This is not the same as packaging the schema with the instance for transmission purposes, which is a good thing — when of use.
    Defining schemas using an XML language has some benefits:
    • ease of processing the schema,
    • separation of the schema from the documents it describes, and
    • using a notation that doesn't have to be learnt prior to learning what it notates.
  • XML notated schemas, on the other hand, quickly become hard to read — the markup is "noisy". So there's started to be a move to using more human friendly alternatives and translating them into the "normative" XML notation for use.
    An example of this is Relax NG, which James Clark — himself one of its designers — says he never write in its normative XML notation, but rather has written a translator from C-like notation to the normative one, and uses that.

So we seem to have gone full circle, moving back to human-use languages for schema. And maybe further in that direction for other notations.

Where We Should Be Going — IM(NS)HO

(IM(NS)HO: "In My (Not So) Humble Opinion")

Ideally, our standard markup languages and markup language definition languages (a.k.a. schemas) should be clearly focused as to the requirements they are addressing. Human use vs. computer use is the first division we can make between requirement sets.

It turns out, except for that problematical DTD syntax, XML as it now stands ain't so bad. It's very close to being an as-good-as-it-gets clear-text computer-use oriented markup language syntax. So that requirement is satisfied.

Schemas for computer use are somewhat satisfied by the current generation of XML Schema and related work. The XML syntax is fine for computer use. But work is needed in two areas:

  • Harmonization is required between the different schema languages. (Yes, yes, I know. That's a motherhood and apple pie of a statement, but it's still true.)
  • As part of the harmonization, the human-use features — especially including white-space treatment, contextually-defined elements and optional variations — need to be removed from normalized computer-use schemas.

There are one area in which it's reasonable to expect quite a bit of activity yet — current claims for pending stability in the XML family of standards notwithstanding: human-use features in schema languages. This area is unlikely to settle down until people stand back and get a clearer view of what the requirements are. Until that happens we'll continue to have what we have now — an increasing mixture of different schema languages, each with something of use, but always with desirable things missing.

SGML has some things to teach us in this area, but SGML isn't the solution — it's shorthand definition mechanisms have been rightly criticized — the amount of work you have to invest in them is too often more than is justified by what you get in return.

Apart from their intrinsic difficulty, human-use markup languages also have the problem that human-use markup languages are a small market percentage-wise as compared to computer-use markup languages. But it's being relatively small doesn't mean it's small in absolute terms. And it doesn't mean that we don't need much better support for human-use markup than standards now give us.

In the meantime, if you've got a need for human-use features, use what's the best fit of what's currently available:

  • If you really hate DTDs, Relax NG might be the best thing in the short term.
  • SGML is still the solution for some people.
  • Roll-your-own is still the solution of choice for many people. Especially where the purpose built language has an XML normalization for computer-to-computer communication.

The Dichotomy of Markup Languages

Sam Wilmott [OmniMark Technologies]