XMLR: XML Reduced: A Thought Experiment – or, Why I Demand Coherence

Liam Quin
liam@w3.org

Abstract

The Extensible Markup Language (XML) 1.0 Specification has a number of features that are rarely used, or are poorly defined, or that interact badly with common implementation strategies.

The widespread deployment of XML 1.0 has led to its adoption in many new areas. As new needs have arisen, new features have been added in ad hoc ways, beyond the scope of the original specification. This has led to incompatible approaches to solving similar problems.

This additional complexity becomes a burden upon both users and implementors. New specifications building upon XML find that the areas in which the Extensible Markup Language is not extensible – for example, permitting only a single link to a DTD, or requiring the use of DTD syntax for entities – to be limiting and frustrating.

There have been a number of proposals to form a subset of XML 1.0, and to label that subset XML 2.0. This proposal takes an alternative approach: it implements all of the functionality of XML 1.0, but with a reduced syntax. The additional semantics of linking are expressed using a subset of the Resource Description Format, RDF.

The result of combining some built-in support for metadata processing along with a simplified XML specification will be, it is claimed, specifications that are more powerful and higher-level, and will facilitate the creation of more powerful applications. In particular, widespread adoption of infrastructure with direct support for explicit handling of metadata will lead to human and business benefits in fields as diverse as web services, Topic Maps, metadata searching and the semantic web of the future.

Keywords: Metadata; RDF; Technology Adoption; Web Services

Liam Quin

Liam Quin is currently the XML Activity Lead at the World Wide Web Consortium. He has been using computers since 1976, holds a degree in Computer Science from Warwick University (UK), and first worked with SGML in 1987. Liam was very closely involved in the design and creation of XML, worked at SoftQuad Inc. as technical lead for both SoftQuad Panorama and SoftQuad HoTMetaL, has a keen interest in computing for the humanities, in typography and music, and in taking long barefoot walks in the country.

XMLR: XML Reduced

A Thought Experiment – or, Why I Demand Coherence

Liam Quin [XML Activity Lead; W3C]

Extreme Markup Languages 2002® (Montréal, Québec)

Copyright © 2002 Liam Quin. Reproduced with permission.

Introduction

The XML 1.0 Specification has a number of features that are rarely used, or are poorly defined, or that interact badly with common implementation strategies.

Such features include, for example, notations, entities and CDATA marked sections. They are discussed in more detail in later sections in this paper.

The widespread deployment of XML 1.0 has led to its adoption in many new areas. As new needs have arisen, new features have been added in ad hoc ways, beyond the scope of the original specification. This has led to incompatible approaches to solving similar problems.

Such features include links to style sheets and schemas being different from a link to a DTD, a link to an included fragment, a link to metadata, other metadata itself, and all of these differing in form or syntax from a hypertext link.

This additional complexity becomes a burden upon both users and implementors. New specifications building upon XML find that the areas in which the Extensible Markup Language is not extensible – for example, permitting only a single link to a DTD, and requiring the use of DTD syntax for entities – to be limiting and frustrating.

We need to rework XML and the surrounding specifications to make them fit together better: to produce coherence between them.

There have been a number of proposals to form a subset of XML 1.0, and in some cases to label that subset XML 2.0. This proposal takes an alternative approach: it implements all of the functionality of XML 1.0, but with a reduced syntax. The additional semantics of linking are expressed using a subset of RDF.

The result of combining some built-in support for metadata processing along with a simplified XML specification will be, it is claimed, more powerful and higher-level specifications, and hence applications. In particular, widespread adoption of infrastructure with direct support for explicit handling of metadata will lead to human and business benefits in fields as diverse as web services, metadata searching and the Semantic Web of the future.

As the World Wide Web enters its adolescence, it must grow strong. Its parents gave it a syntax and also, through the Resource Description Framework (RDF), a form of self-expression. Like all adolescents, it must now rebel and find itself.

XML Mayhem and Mishap: The Situation Today

Before we start to suggest changes, we should understand what we already have. We must use a dispassionate and critical eye, so that we neither neglect the weaknesses of XML, nor disparage its strengths.

Relationship to SGML

The Extensible Markup Language was originally made by taking a subset of ISO 8879:1988, Standard Generalized markup language (SGML). Because of some difficulties in using SGML on the Web, and with the use of the Unicode character repertoire, the SGML Working Group at ISO made some changes to SGML, so that XML could remain a profile of SGML and yet still be useful on the web.

The fact that every valid XML document is also a valid SGML document is important both for the sake of longevity of data, and for organizations with a policy of preferring technologies standardized by the International Organization for Standards.

This leads us to reflect that we should not stray so far from SGML as to render XML inexpressible; yet we should also not forget that SGML can be changed. There can be little doubt that XML has far greater deployment, with very many more tools and much better support, than was ever the case for (pre-XML) SGML. However, there are still situations where XML is insufficient, and the greater flexibility and expressiveness of SGML is used. One possibility is that XML could subsume that greater expressiveness. For example, some uses of the SGML OMMITTAG minimization feature are accommodated in XML using XSLT.

XMLR is written to remain syntactically compatible with SGML except where indicated.

Features Bolted On

Since it's pretty hard to change XML, we've added a number of facilities that overload the meaning of existing language features. Perhaps this really started when processing instruction syntax was used for the XML declaration at the start of an XML document, which led to the bug that if you start an XML document with a space, the XML declaration is instead parsed as a processing instruction with the reserved target xml. The specification didn't say that was illegal, so your document is processed, but the encoding declaration is now ignored, and you may silently get an incorrect data stream.

If the overloading of processing instruction syntax to represent the version of XML, the document encoding and other metadata seems strange, overloading a processing instruction as a link to another resource seems even stranger. Yes this is exactly what the W3C Recommendation Associating Style Sheets with XML Documents does.

Since processing instructions (like comments) are not considered part of the document's tree, you can't control them with a schema or DTD, and you can't (for example) restrict the set of stylesheets someone can use for a document; neither can you verify that the link is in place correctly. Those are, however, exactly the sort of use cases for which Schemas and DTDs are designed.

The use of special-purpose attributes, such as xml:space and xml:lang is less of a problem in that at least they can be controlled by DTDs and schemas.

The Internal Document Type Definition Subset

The internal document type subset is part of an XML document that can reference external an Document Type Definition (DTD) using a Document Type Declaration with an identifier that's resolved externally (e.g. a URL or URN). It can also include element, entity and attribute definitions.

The syntax of the subset is such that it can only appear at the start of a document. In particular, you can't have an internal subset or doctype declaration on an external entity that you want to include, which means you can't use a DTD to validate that entity directly. You also can't include one XML document inside another directly, because an XML document can only have one internal document type declaration; it is instead necessary to merge the DTDs, a technically difficult problem.

The peculiar and hard-wired syntax for referring to an external DTD also only lets you refer to a single DTD, even though you might have multiple DTDs you want to use for different purposes. Experience with W3C XML Schema, Relax-NG and other validation languages has shown that use of multiple schema processes can be very useful.

The solution in XMLR is to consider an external DTD to be a resource to be linked to from an XML document in the same way as a stylesheet, as we shall see later.

Special Characters, Attributes and Entities

All XML documents use the Unicode character repertoire. Documents may be transmitted in a character encoding other than one specified by Unicode, for example ISO 8859-15 (Latin 1 augmented with the Euro). There are often characters that it is inconvenient to enter directly, or that the character set used for transmission or editing can't support. In these cases, character references such as & may be used to refer to the character at position 38 (decimal) in the Unicode specification referred to by the W3C XML Recommendation.

Since people generally find names easier to remember and work with than numbers, people use entity names such as & or &w-hat; to refer to particular characters. Apart from five names built in to XML, these names are user-defined, in a DTD. This means that they can be in the document author's own language, a clear requirement.

Unfortunately, there are a number of difficulties with entity definitions, of which the hardest is that, like the DOCTYPE declaration mentioned in the previous sub-section, a document declaring local entities can't be nested inside another document.

One way round this would be to allow any XML element to have its own internal document type definition subset. Another is to consider external resources declaring entities to be associated with XML namespaces in some way.

One important use case is to be able to refer to special characters, such as a superscripted letter, inside XML attribute values. We shall return to this point when we discuss attributes.

A trickier problem is that of associating sets of entity names with namespaces. For example, one widely used HTML and XHTML browser1 automatically introduces the HTML predeclared entities (about 100 or so names) when the HTML namespace is encountered, and does the same for MathML, but although this seems highly desirable, it cannot be represented in XML, and documents that rely on this behaviour are not well-formed.

This is an issue for the W3C XML Core Working Group to investigate further, perhaps.

Processing Instructions and Comments

Processing instructions in XML are second-class citizens. The theory is that they exist outside the document tree, and also do not participate in the data, but are there to instruct some proprietary processor or other what to do. For example, you might have <?font start italic> and <?font end italic>.

Of course, you might well want to constrain documents, for example, to ensure that there are no page breaks inside the abstract. In that case, you'd rather have a processing namespace, <proc:font start="italic" />, and since you can then use a schema or DTD to constrain the occurrence of such processing-instruction-like elements, that is what XML/R suggests.

It is also worth pointing out that there is no way to escape a > sign inside the text of a processing instruction, and entities are not processed within them by an XML processor. These arbitrary restrictions go away if elements are used.

XML comments should be able to contain the string --; that they cannot is an artifact of the way XML was represented in SGML, and not a language design feature.

Notations

XML Notations were taken from SGML, which has a fundamentally different processing model. The SGML Handbook suggests that the declaration of a notation in a DTD could be the name of an external program used to handle non-SGML data, which would presumably be invoked automatically when the document was processed in some way. One must always assume that data obtained over the network, especially from an external site, may be malicious. A DTD loaded remotely might name a command to format your disk, or to remove files or install a virus.

Security issues aside, there is an architectural problem with XML notations. In an SGML world, the data is King (or Queen): if a document says that an image is in TIFF format, then the software can believe it. But on the World Wide Web, using the HyperText Transfer Protocol (HTTP), we have to deal with content negotiation, a process by which a browser says which content types it understands, in order of preference, and the server sends the data in the most appropriate format.

To handle content negotiation, current XML practice is to include a list of MIME media content types in the notation declaration. But this is architectural nonsense: the list of media types comes from the software processing the data, not from the data.

XML Notations do not (as specified and designed) fit into the World Wide Web architecture. They should be removed from XML entirely, and an alternative mechanism substituted.

Minimization

An early decision made in the development of XML was that there would be no minimization or shorthand features. One single shorthand convenience feature did sneak past the vigilance of the working group: CDATA marked sections.

CDATA sections have a syntax too ugly to reproduce on these erudite (and possibly virtual) pages, and that can't nest, and in which there's no escaping. This means that programs generating CDATA sections have to check for the sequence that ends a CDATA section, and, if they find it, end the CDATA section, escape the sequence, then start a new section. This is almost always more complex than scanning for <, > and & characters, so the main use of CDATA sections is for human authors.

If data is to be treated fundamentally differently, it should be marked up differently. One possibility to address CDATA sections would be to add syntax such as <element [attributes] {content}element>.

If character references were still recognised in content, escaping } with a character reference would be sufficient for software generating output. But it's not clear that this is better than simply removing the CDATA section feature.

Summary

This section has discussed a number of places where XML has some warts. A standard is about consensus, it's about getting everyone to use the same specification, so that we get interoperability. It is not, first and foremost, about technical elegance, although that can certainly help to encourage adoption.

XML has warts, and it's possible that a future version of XML could address some of these warts, and others not mentioned here. The purpose of XMLR is to put some of the issues down in writing, and also to experiment with a particular strategy for addressing some of those warts, and that's the subject of the next section.

Relationships in XML and XMLR

XML documents, whether for human consumption or for machines, whether a chapter from a novella or a description of a web service, or a remote procedure call, typically have several components:

  1. The document itself, that is, the text or marked up data
  2. A document type definition (DTD)
  3. External resources defining text entities
  4. External document fragments, included as entities or with XInclude
  5. One or more W3C XML Schemas
  6. RELAX-NG, Schematron or other schemas
  7. One or more CSS style sheets
  8. One or more processing pipeline specifications, perhaps using XPipe, a submission by Sun Microsystems and others published by the W3C as a Note
  9. Metadata, such as author, date, version, and content management information
  10. Other external resources, such as images
  11. Linked documents, such as the remote targets of hypertext links, that may be considered part of a related publication or document set

Figure 1: Sample XML document
[Link to open this graphic in a separate page]

A document management system will typically need to understand all of these relationships; an editor, browser or viewer might need fewer of them, and publishing software might simply deal with a merged XML Information Set (Infoset) and not need to track where fragments were included.

Consider a sample XML document with a DTD, some entity files, three included chapters, an image, and that has a link to another document. Such a document is represented diagrammatically in Figure 1.

We could annotate the relationships between the various entities depicted in Figure 1, and then we might arrive at Figure 2.

Figure 2: Relationships in and around a sample XML document
[Link to open this graphic in a separate page]

We could list the relationships depicted using RDF triples, as follows:

    <theDocument> <http://www.w3.org/XMLR/displays> <image1>
    <DTD> <http://www.w3.org/XMLR/includes> <dtdFragment>
    <DTD> <http://www.w3.org/XMLR/uses> <entity3>
    <DTD> <http://www.w3.org/XMLR/uses> <entity2>
    <DTD> <http://www.w3.org/XMLR/uses> <entity1>
    <theDocument> <http://www.w3.org/XMLR/includes> <inc1>
    <theDocument> <http://www.w3.org/XMLR/includes> <inc2>
    <theDocument> <http://www.w3.org/XMLR/includes> <inc3>
    <theDocument> <http://www.w3.org/XMLR/conformsTo> <DTD>
  

Of course, we should also write an RDF schema, but this paper does not go so far down that road; it's a thought experiment, and we can break off our thoughts wheresoever we choose. Figure 3 shows a graphical representation in the RDFAuthor program, and one possible RDF representation is shown below. It must be emphasized that the URLs used here to show relationships are illustrative only.

Figure 3: Relationships shown in the RDFAuthor package
[Link to open this graphic in a separate page]

<?xml version='1.0'?>
    <rdf:RDF
    xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'
    xmlns:RDFNsId1='http://www.w3.org/XMLR/'>
    <rdf:Description rdf:about='theDocument'>
    <RDFNsId1:conformsTo>
    <rdf:Description rdf:about='DTD'>
    <RDFNsId1:includes rdf:resource='dtdFragment'/>
    <RDFNsId1:uses rdf:resource='entity1'/>
    <RDFNsId1:uses rdf:resource='entity2'/>
    <RDFNsId1:uses rdf:resource='entity3'/>
    </rdf:Description>
    </RDFNsId1:conformsTo>
    <RDFNsId1:includes rdf:resource='inc3'/>
    <RDFNsId1:displays rdf:resource='image1'/>
    <RDFNsId1:includes rdf:resource='inc1'/>
    <RDFNsId1:includes rdf:resource='inc2'/>
    </rdf:Description>
    </rdf:RDF>

We should compare this to the existing syntax before writing it off as unutterably verbose. The following fragment shows some of the syntaxes used today, but, to save space, does not correspond to the complete RDF example in the figures.

<!DOCTYPE sample SYSTEM "DTD" [
    <!NOTATION IMAGE SYSTEM "image/*">
    <!ENTITY image1 SYSTEM "image1" NDATA IMAGE>

    <!ENTITY % dtdfragment SYSTEM "entities-found-here">
    %dtdfragment;
    ]>
    <?xml-stylesheet href="mystyle.css" type="text/css"?>
    <sample>
    <link xmlns:xlink="http://www.w3.org/1999/xlink"
    xlink:href="otherdoc">a link</link>
    &image;
    </sample>

The RDF representation lets us reason about these relationships; the mixture of XML syntaxes does not facilitate that without an intermediate step. It would be possible to write software to gather this information and represent it in RDF, but it turns out not to be easy: the DOM does not expose parameter entity inclusions in a DTD (although that may change in the future), and XML processors do not in general expose information about included parsed entities; other APIs may or may not provide information about processing instructions. Documents being served on the web may also use server side includes which use comments:

<!--#include virtual="somefile.xml"-->

If we want to use RDF to replace these syntaxes, we need to have a simple syntax, and to say where it can go, whether in documents, DTDs, Schemas, or externally. The next section explores one possibility.

Somewhere to put metadata

XML needs to learn from HTML and allow a separate area for metadata. XMLR introduces two elements, <xml:Description> and <xml:Body>, to keep metadata separate from data, and to provide a standard document header.

The importance of a document header has long been understood; it's used in specifications ranging from the Text Encoding Initiative to HTML and SOAP. A placeholder for an open-ended set of metadata, with predefined XML-specific fields and with a strong recommendation of the Dublin Core will be a major step forward.

An optional <xml:Sources> element can be used between these two containers; its primary function is described under Entities, below, but it could also hold pipeline information.

An application that renders documents should not normally render data from anything except the <xml:Body> element, much as is done for HTML.

An XMLR document therefore has the following form:

<?xml version="xmlr-1.0"?>
    <xmlr>
    <xml:head>
    metadata information
    </xml:head>
    <xml:body>
    actual content
    </xml:body>
    </xmlr>

General (internal parsed) Entities

Entities need to be visible after parsing, so they should work like elements. One should also not need a DTD to declare them, since DTD processing is optional, but unknown entities can make a document not well formed.

In XML/R, the syntax &name; is taken to be a short form for <xml:include src="#name"/> where name is the value of an xml:id attribute appearing on an element whose start and end tags appear earlier in the document. This rule eliminates the possibility of forward references and also that of recursion. The optional xml:source section of XMLR documents is intended to hold content that's supposed to be reused in this way.

Entities in Attributes

If we want entities to be visible, attributes are no longer plain strings, but can contain structure.

This may cause problems with DOM-based systems, and it may be necessary to provide an alternate attribute syntax for compatibility. The XSLT x:attribute element could be used for this purpose, to allow structure inside elements, but this is a subject for future work.

Conclusion: Advantages of XMLR

The preceding sections have discussed various problems with XML 1.0, and have suggested some different approaches for addressing these problems.

Perhaps the most controversial of these is the use of RDF, sketched here, to replace other syntaxes, all of which link to other resources.

Since this document is intended as a Thought Experiment, we should do some thinking. If every XML processor had an API for accessing RDF triples, and doing simple queries based on them, how might the world change?

Firstly, it would get creators and users of XMLR documents thinking about metadata, about labeling their content with author and editorial information, and about document management issues.

Secondly, it would let document management software to traverse relationships between all of the entities involved, and do a proper job of managing the entire document environment.

For those of us interested in ontologies and cataloguing, documents would become first-class citizens, asserting their own place in the hierarchies, in a standardised manner. Would that be a foundation for topic maps with fewer layers?

But the single most exciting advantage of XMLR is that all of the specifications building on XML would have an extensible shared representation of metadata, instead of a list of warts and barnacles, we'd have a coherent architecture that can easily be understood.

And this, above all, is why I demand coherence.

Acknowledgements

Eric Miller and Sandro Hawke, for discussions about the ideas here

Aaron Swartz, for help with RDFAuthor

Notes

1.

No, not that one.


XMLR: XML Reduced

Liam Quin [XML Activity Lead, W3C]
liam@w3.org