Microformats: Contaminants or Ingredients? Introducing MDL and Asking Questions

Liam Quin

Abstract

The use of weakly-defined sets of HTML class attributes to identify semantic information has recently come to be called the use of microformats. Although the class attribute was first introduced into HTML in 1995, the act of naming its use in this way took another ten years. The result of microformats is HTML documents that are peppered with additional information in an uncontrolled manner; geographical information recorded using an HTML abbreviation element with a class attribute set to location, or an emphasis element specialised to be the volume number of a journal in a bibliography.

Is the burgeoning adoption of microformats a good thing or a bad thing? Are these class attributes a poisonous contamination or are they an ingredient of a new sort of food? If they are a contamination, how do we eradicate them? If they are an ingredient, what is being cooked, and should we place an order?

This paper suggests that microformats are in fact both a contamination and an ingredient, and furthermore that microformats are an indication of growing awareness of the importance of retaining authorial intent. A language, MDL, the Microformat Definition Language, is presented, for the purpose of documenting microformats, of decontaminating HTML to produce namespace-polluted XML, and for providing validation checks for the decontaminated industrial waste that results. This language enables us to discuss the importance of grounding of microformats in URI-space, the mechanism used by the World Wide Web architecture to achieve scalability [TAG].

What better audience could there be to address a philosophical question about the relationship of XML, HTML, the Web, semantic markup and the changing nature of markup fashion than that of the Extreme Markup Conference? A discussion is both expected and anticipated.

Keywords: Markup Languages; Metadata

Liam Quin

Liam has been working with text and markup since the early 1980s, with SGML since 1987, and with XML since before it was called XML. He is now the XML Activity Lead (also known as Mrs. XML) at the World Wide Web Consortium (W3C). His interests include digital representation of historical texts, barefoot hiking and schema co-occurrence constraints, and in his spare time he publishes images scanned from his collection of old and dusty books.

Liam has attended SGML and XML conferences starting with SGML 89 in Atlanta, America. He lives in a rural part of southern Ontario in Canada with his husband.

Microformats: Contaminants or Ingredients? Introducing MDL and Asking Questions

Liam Quin [W3C]

Extreme Markup Languages 2006® (Montréal, Québec)

Copyright © 2006 Liam Quin. Reproduced with permission.

Introduction: What are Microformats?

Microformats came out of the blogging community, and are a way to identify groups of XHTML elements as belonging to a domain-specific vocabulary. The best-known Microformats Web site, www.microformats.org, positions microformats as follows:

Designed for humans first and machines second, microformats are a set of simple, open data formats built upon existing and widely adopted standards. Instead of throwing away what works today, microformats intend to solve simpler problems first by adapting to current behaviors and usage patterns (e.g. XHTML, blogging).

Although microformats most often use the class attribute to carry meaning, there are a number of other XHTML idioms, such as rel="tag"to identify a blogging search facet on a link, rel="nofollow" (taken from Google, which interprets it on an HTML link to imply a neutrality of endorsement by the page author of the linked resource), and others. It would be tempting, but overly cynical, to say, a microformat is anything easy to specify, but we should look further than that, and say instead, a microformat is a name given to an idiom or pattern applied to XHTML. This definition allows us to include using a class="geo" attribute on an XHTML abbr element, but to exclude using the idea (or pattern) of using class attributes on abbr elements in general to represent something, done when one also wants a dotted line under the value and a pop-up tool tip when the mouse pointer is over the item.

Evolution of Microformats

Dave Raggett had first proposed the HTML class attribute in March 1993 in his draft for HTML 3, which added support for style sheets by including a STYLE tag and a CLASS attribute. The latter was to be available on every element to encourage authors to give HTML elements styles, much as you do in desktop publishing [DSR98]. Eleven years later, support for style sheets is considered an essential part of a Web browser, and Cascading Style Sheets, by far the most common styling mechanism on the Web, are very widely used.

Although the class attribute has been kicking around since 1995, and although there were even commercial products (such as SoftQuad HiP) that made use of it, it was CSS that gave the attribute a more widely perceived purpose.

The original design of the World Wide Web included the notion of typed links [TBL90a] ; that is, the idea that one would annotate a link with the relationship it represented. This was made explicit in HTML using the rel and rev attributes to linking elements. An Internet Draft listing a set of values for these attributes that had been found useful in practice was published in December 1995 by Murray Maloney and Liam Quin [MQ95], but there was little interest at the time, and the draft expired.

An obvious problem with allowing only a fixed set of labels to indicate link roles is that people may have needs for new link roles. The list must be extensible, and must also make clear which labels are intended to be ad-hoc or experimental, and which are intended to be taken from a standard globally shared list. A common convention is to use X- in front of ad-hoc names, but this means there must be a standard registry for the other names. In 1995 the nearest thing to a standards body for specifying such things was the IETF, but their registries tend to be slow moving.

The emergence onto the Web of specialist markup languages such as CML [Chemical Markup language] [PMR95], which had previously been common in SGML circles, raised the awareness of many Web people at the time that in fact the HTML DTD would not answer everyone's needs indefinitely. At a venue such as Extreme Markup, attended by many of the world's top markup experts, this assertion may seem surprising, but one might easily recall that many HTML users were unaware of SGML, and many others were thinking that the Web would obsolete what they saw as more complex markup.

CML [Chemical Markup language] helped give impetuous to the creation of XML, and it should not be forgotten that Peter Murray-Rust was actively involved in that work, and created the xml-dev mailing list to help. The Chemical Markup language is a good example of a problem domain that was not well-suited to extending HTML with class attributes (even had that idea been widely known at the time), partly because of the desire to associate very specific browser behaviour (such as 3D molecule visualisation) with elements, partly because of the need to do validation, and partly because of the need to do other processing. We shall return to the issue of validation later.

Large SGML vocabularies such as DocBook [NWLM99] and that of the TEI [Text Encoding Initiative] [TEI] have used qualifying attributes on elements in SGML since the early 1990s, and more recently in XML, with role and type respectively.

The hope of some of the developers of the original XML specification that XML documents would gradually replace HTML was always unrealistic in this writer's view, but in any case such hope quickly faded. Over time, XML and HTML communities continued to develop separately.

Some ten years later, blogging interfaces have appeared: Web forms, often heavily scripted, that let people write diary entries and post them onto a Web page. People want to include simple markup, emoticons, graphics and animations in their diary entries. In addition, people want to scrape screens; that is, they want to write programs that process diaries, or blogs, to extract specific information. To facilitate this, people can include extra markup in their blogs that is still valid HTML but that assists the screen-scrapers. It is unfortunate that the notion of valid HTML in many people's minds precludes the use of extra elements or modified DTDs.

In addition, the increased awareness that use of CSS [Chemical Style Sheets] can help reduce the bandwidth a Web site uses, make the site easier to maintain, and make it look and work better, has also let to an increase use of the HTML class attribute in markup.

By 2002, with the increased use of JavaScript, and increased availability of implementations of the W3C DOM [Document Object Model] [DOM98], Web programmers were beginning to see some of the benefits of separating code from markup. Stuart Langridge introduced the term Unobtrusive JavaScript to describe the idea that behaviour should be triggered by markup rather than included directly in source documents [SL02]. This approach allows people to share JavaScript libraries to handle common idioms.

The significance of Unobtrusive JavaScript may have escaped the XML community, since it was to them nothing new, but in fact a starting point. The concept of separating out the logical structure of a document and any processing or behaviour had already been built in by design to SGML (e.g. see section 4.1.1 of the SGML standard, most conveniently accessible in the SGML Handbook [CFG90]). But to Web programmers used to thinking about, for example, the <BR> command, it opened new realms of possibilities.

The main driving forces behind the appearance of microformats, then, were the increased use of CSS [Chemical Style Sheets] the desire for more accurate screen-scraping of HTML (or XHTML) Web pages, and the desire to associate specific behaviour in Web browsers with elements based only on declarative markup.

All of the pieces for microformats were already in place and in widespread use when the term was coined: all that is new is a name. But a name can be a powerful thing.

Strengths of Microformats

Microformats are decentralised: anyone can make one, without getting permission. This, of course, is also true of HTML and the Web, and is a large part of the success both of the Web and of microformats.

Microformats are also simple, both to describe and to use. In particular, and this is crucial to understanding the culture from which they arose, there is no need to perform formal analysis in advance: you can type a Web page into your favourite text editor and add attributes whenever you feel like it. This is a strength, in that it is immediate, something now in a world characterised by the attention span of a scrollbar.

Microformats can be displayed in an existing Web browser: they use existing HTML elements that the browsers already know how to render.

Standardised markup idioms also facilitate the sharing of code libraries to augment browser behaviour, the so-called unobtrusive JavaScript mentioned earlier.

Disadvantages of Microformats

Microformats are decentralised, but also have no global identifiers, no equivalent to the namespace URI: name conflicts seem destined to occur with increasing frequency.

In order to try to reduce possible conflicts, there is a single repository for microformats; clearly this is not scalable; neither is it universally recognized as authoritative.

Documents marked up with microformats are, in effect, superimposing a second grammar onto the markup. But this is hard to validate, not only because there are no published schema fragments, but also because existing schema languages for XML do not support the idea of a grammar constraining attribute values on essentially arbitrary elements. Where they do have some such support (such as Schematron), the possibility of a single element having multiple (space-separated) values for class makes validation difficult.

When markup is used to enable behaviour, the document author may fall into a trap of thinking about the behaviour rather than the relationships being modeled. This is subtle but can lead to problems when the JavaScript library is changed or replaced, or when other uses for the content arise. The distinction between making a document to achieve a particular effect in a particular piece of software and making a document to represent properties of information is an important one, although of course either purpose can be equally valid depending on context.

Contamination

Is XHTML a semantic language or a presentational one? The intent is a part of each. If anything, the semantics of HTML have always been closest to the needs of computer documentation. The class attribute was originally designed not for object-oriented subclassing but for use in style sheets [DSR98].

The values of the class attribute do not have meaning defined by XHTML. Nor are they defined to have (or not to have) meaning. Meaning is sneaking in without the computer noticing. Subversive semantics.

Microformats help people to avoid using XML. They are at the same time subverting HTML. Microformats are a contamination.

Ingredients

XHTML is mostly a presentational language, used for rendering. Web designers are longing for more. Microformats offer a way to go beyond presentation.

Microformats are a sprinkling of flavour in a bland meaningless soup. They are one step closer towards using XML. They are part of a bigger picture, helping computers to exchange information. They are ingredients.

Turning contamination into ingredients

Whatever one's opinion, it is clear that one can write an XSL Transformation to turn microformat-contaminated XHTML documents into XHTML documents with extra XML elements in them, floating in some alien namespace.

Two plausible approaches spring to mind:

  1. Using QNames to establish connections between the names used in microformat-based markup idioms and URIs;
  2. Without changing the markup idioms, using an external transformation.

The first approach I shall call in this paper namespaced microformats. The syntax of namespaces is not beautiful. There are surely some who would claim that if HTML documents that use microformats are the teenagers of Markup, namespaced microformats would represent their acne. More to the point, people already struggling with HTML or XHTML markup might be resistant to adding extra namespace declarations and colons in their documents unless they perceive some immediate benefit. But such short-term benefit is there, since JavaScript libraries would no longer be as great a danger of interfering with each other.

It might be that establishing a convention for finding out about a microformat from its namespace would also help encourage people to use namespaces.

There are two drawbacks with this approach, though, that defeat it entirely:

  1. It involves using QNames In Content, an XML practice that is known to lead to problems. In particular, transforms that alter the prefix of namespace bindings will not in general change the prefix in the content where it occurs. This is mollified in XSLT 2, because the XHTML class attribute can be declared appropriately and the XSLT processor can perform automatic namespace fixup. But not everyone is using XSLT 2, and it has not to date been implemented in Web browsers.
  2. The W3C CSS Recommendations do not include support for namespaces.
    Since a large part of the motivation of the HTML class attribute is for document styling with CSS, this is a problem. Fortunately, common browsers today do work with namespaces, as long as one uses fixed prefixes. This means that the namespace prefix must be considered a fixed part of the microformat, and since prefixes are not guaranteed to be robust across transformations or other XML processing, this sounds like a recipe for much confusion.

None the less, the idea is simply to change <dt class="title"> into <dt class="x:title"> where x is a bound namespace prefix.

We can now no longer match our dt element using standard CSS, but if we are willing to go out on a limb and use a Working Draft, we can do:

@namespace lq
      url(http://www.holoweb.net/~liam/xml/01); dt.lq\:title { color: red; }
      

This approach lets us make the microformat names globally unique; as a Semantic Web person might say, they are grounded in URI space. Now my book titles, image titles and person titles (Mr, Mrs) don't conflict but I still can't validate as I'd like.

Without entirely abandoning the approach of namespaced microformats, we should stop and try the second approach, that of an external transformation. The idea here is that, for the purpose of validation, we transform the input XHTML and produce XHTML with explicit non-HTML elements corresponding to the markup idioms defined by the microformats in use.

This sounds like a job for XSLT. We must be careful, however, not to activate any of XSLT's namespace fixup, as otherwise namespace prefixes may change, and people's CSS may stop working.

A transformation from, for example, <div class="vcard"> to <v:card> is sufficiently straightforward not to need sample XSLT in this paper; the implied strategy does not handle all cases of markup idioms, and in particular cannot handle an element that participates in more than one microformat. But it is enough to enable simple schema validation.

The simple and obvious XSLT to do this work needs to be edited for each new microformat. Note also that one wants to leave style-based class values unchanged, to cope with elements with multiple class values, and also to handle other sorts of microformat. None the less, this approach starts to migrate documents toward the Higher and more Spiritually Satisfying essence of True markup. Or, more pragmatically, it lets you validate the document and process it more easily with XML Query engine and other tools.

MDL: Microformat Definition Language Introduced

The previous section introduced a way to convert HTML containing class-based microformats into an equivalent representation that is more amenable to XML technology. But the method shown required programming, and microformats are simple, and appeal to people who might not wish to follow such a route. In this section we introduce MDL [Markup Definition Language], a very simple and declarative way to associate both automatic processing and human-readable documentation with a microformat. It should be stressed that the purpose of MDL here is as a thought experiment: it is perfectly implementable, but that is not the point. Conceptually, a simple (and unchanging) XSL Transformation takes an XHTML file and one or more MDL files and uncontaminates the XHTML to produce namespace-polluted XHTML.

<mdl version="1.0">
        <microformat version="0.6">
          <name>book</name>
          <author>Liam
      Quin</author>
          <class>
            <inhtml>chapter</inhtml>
            <inxml>chapter</inxml>
            <notes>class=chapter
      occurs on a div element to indicate
              a
      chapter of a book.</notes>
          </class>
          <class>
            <inhtml>page</inhtml>
            <inxml>folio</inxml>
            <notes>a
      span with class=page turns into an XML folio element.</notes>
          </class>
          <class>
            <inhtml>abstract</inhtml>
            <inxml>abstract</inxml>
            <isalso>chapter</isalso>
            <notes>a
      div with class="chapter abstract" turns into an abstract
              element
      and the chapter class value can be discarded.</notes>
          </class>
        </microformat>
      </mdl>

Advantages of MDL

With MDL, users can reliably arrive at XML documents without having to edit an XSLT stylesheet. They can also do some validation on the result. One might, for example, use W3C XML Schema to say that the textual content of a particular XML element must match a regular expression.

Splitting content of essentially different types into different element will also help XML Query-based repositories to index the data efficiently. Mixing dates, geographic locations, names and sock colours in the same element precludes an index for that element that is anything other than string-based, for example.

A reverse transformation is clearly possible, and one can imagine taking an XML document and one or more MDL documents and generating XHTML that is contaminated with embedded microformats.

The MDL file also acts as a place to contain some simple documentation. Since microformats are generally very simple indeed, this is likely to be sufficient for most uses.

In some ways, MDL builds on the HTML Profile concept that is for example used by XMDP [XHTML Meta Data Profiles] [XMDP]. XMDP [XHTML Meta Data Profiles] [XMDP] is too simplistic a profile format for our purposes, as it merely contains a list associating natural language text with identifiers. But we could use a similar mechanism to link between an XHTML document and one or more MDL descriptions.

The Micromodels work should also be mentioned at this point. Micromodels are an attempt to use RDF [Resource Description Framework] [RDF04] to describe certain aspects of Microformats [MM] There is a Wiki Web page to link together microformats, Micromodels, and XSL transformations that extract RDF from documents marked up with a given microformat [ESW05]. But one doesn't always want to get RDF, and there is still no clear way to process HTML containing microformats based solely on the markup. What is needed is automatic discovery of tools to process microformats, and this is provided by MDL.

Disadvantages of MDL

The biggest single problem with MDL is that the same people who see a need for such a thing are probably exactly those people who would also be happier editing an XSL Transformation than learning some new language, however simple.

If there were a large number of microformats in use, we are also back to a scalability problem: how to manage 1,000 or more MDL files? Perhaps one could define a Microformat Aggregation Language. But at that point maybe the microformat people will move en masse to using XML with namespaces.

MDL as presented here is a thought experiment. It is, of course, easy to write code in XML Query or XSLT to read an MDL definition and generate XSLT that, given an XHTML document, grounds the idiomatic markup in that document in URI space and enables further processing.

Questions Raised

The primary purpose of this paper is to bring some questions before the markup community. There are no conclusions here, only questions:

  1. Should the traditional markup community ignore the Webheads, or react?
  2. If we do react, how should we react?
  3. Do we try to find ways to make it easier to get into XML, and to use and understand semantic markup?
  4. Do we pressure vendors and the W3C to extend CSS to reduce or eliminate the need to abuse existing XHTML elements? Following this route, it should be possible to reproduce Web browser behaviour exactly and entirely on any element, using CSS. Of course, this is already possible using XSLT.
  5. How important is it that labels introduced by microformats be grounded in URI space? Should they be globally unique, or is the IETF-style rough consensus and running code enough even for the world of machine interchange?
  6. Should MDL be developed further? Making it more complex might defeat its purpose.
  7. The biggest question of all: do we try to help people who are not so far along the path, or do we stop and wait for them, or do we leave them behind? Or do we throw rocks at them from above and try to dislodge them?

Bibliography

[CFG90] Goldfarb, Charles F. The SGML Handbook, Oxford University Press 1990

[DOM98] W3C DOM Working Group: Document Object Model (DOM) Level 1 Specification, November 1998, http://www.w3.org/TR/REC-DOM-Level-1

[DSR98] Raggett, Dave, Raggett on HTML 4, Addison Wesley 1998; Chapter 2, quoted here, is online at www.w3.org/People/Raggett/.

[ESW05] MicroModels Wiki page, online at http://esw.w3.org/topic/MicroModels

[MM] XMDP: XHTML Meta Data Profiles http://gmpg.org/xmdp/

[MQ95] Maloney, Murray and Quin, Liam, Internet Draft: Hypertext Links in HTML, IETF Dec 1995 (expired June 1996). Online at http://www.holoweb.net/~liam/papers/1996-draft-ietf-html-relrev-00

[NWLM99] Walsh. Norman & Muellner, Leonard, DocBook: The Definitive Guide, O'Reilly, 1999 http://www.docbook.org/

[PMR95] Murray-Rust, Peter, Request to use HTML3 for new DTD (Chemical Markup Language), Mail to the IETF HTML Working Group in August 1995, online at https://listserv.heanet.ie/cgi-bin/wa?A2=ind9508&L=html-wg&T=0&P=51642 Online at http://www.holoweb.net/~liam/papers/1996-draft-ietf-html-relrev-00

[RDF04] W3C, RDF Primer, 2004 http://www.w3.org/TR/rdf-primer/ According to www.w3.org/TR/, this document supersedes Resource Description Framework (RDF) Model and Syntax Specification published on 22 February 1999.

[SL02] Langridge, Stuart, Unobtrusive DHTML, and the power of unordered lists, November 2002, online at http://www.kryogenix.org/code/browser/aqlists/

[TAG] Jacobs, Ian and Walsh, Norman, Eds., Architecture of the World Wide Web, Volume One, W3C 1994, online at http://www.w3.org/TR/webarch/

[TBL90a] Berners-Lee, Sir Tim, HyperText Design Issue: Link Types 1990, online at http://www.w3.org/DesignIssues/LinkTypes.html

[TEI] Sperberg-McQueen, C. M. and Burnard, Lou (Eds.), Guidelines for Electronic Text Encoding and Interchange, TEI, 1991 and later. http://www.tei-c.org/

[XMDP] XMDP: XHTML Meta Data Profiles http://gmpg.org/xmdp/



Microformats: Contaminants or Ingredients? Introducing MDL and Asking Questions

Liam Quin [W3C]