Publishing applications, specifically applications for writing prose in XML, can benefit from some features of SGML that were dropped. There is a strong case for the resurrection of SGML features such as CONREF, and the DTD features such as entity reference declarations in XML schema languages, in order to build tools that will meet the needs of publishers. These features do not affect the basic XML non-validating parser and only minimally affect validating parsers. While there is no going back to SGML, the XML specification can and should be modified to provide better support for publishing applications. If the XML specification is not so modified, vendors of XML products that service the publishing community can offer the proposed enhancements as product features.
The purpose of this presentation is to discuss issues affecting publishing applications from the perspective of two standards for marking up text: the Standard Generalized Markup Language (SGML), and the Extensible Markup Language (XML). I assume that the audience is familiar with these standards and that the references I will make to various aspects of these standards will be understood. One point I would like to make before I start is that XML is a subset of SGML and the standards community has made a strong effort to keep the two standards in sync by adding several constructs to SGML to ensure support for XML. For details of these changes and enhancements please see the ISO/IEC JTC1/SC34 publication on Description and Processing Languages
The Standard Generalized Markup Language (SGML) was adopted as an international standard in 1986 as ISO 8879. It was the first successful attempt to standardize the way text is tagged based on its content and not its appearance. Prior to SGML, there were some proprietary systems that attempted to promote this generic approach to text markup, but for the most part, text was marked up with formatting codes to achieve a desired presentation on paper.
The Extensible Markup Language (XML) was approved as a Recommendation of the World Wide Web Consortium (W3C) in 1998 primarily for the transmission and processing of complex text documents over networks. As a subset of SGML, XML has its roots in the publishing community. XML was quickly recognized as an application-neutral way to identify data in a text stream and to transport that data over the Internet. Since then, it has become ubiquitous in a broad range of both publishing and non-publishing applications.
The popularity of XML has raised an issue for converting publishing applications from SGML to XML. However, there are some major shortcomings in XML that directly affect the publishing process. In this presentation, I will attempt to explain the limitations of XML with regards to publishing applications so that a decision to stay with SGML can be supported. On the other hand, if XML incorporates these features, the transition to XML for a publishing application can be a more practical option.
There are many activities going on to enhance XML to provide features that will better support a wide range of applications, publishing being only one of them. This presentation focuses on publishing applications and is not meant to be all encompassing. As new features are added, the complexity of the standard will increase, but so will its effectiveness. The features presented here do not require changes to the basic XML parser and only minimal enhancements to the validating parser; i.e., programming effort is relatively minor to add support for the recommended features.
For people who do not know the history of text processing on computers, SGML is looked at as a bloated standard with many diverse features that are difficult to comprehend. When SGML was being developed, techniques for processing text were all over the place. There was no consensus to put tags in angle brackets and no generally accepted method for describing the structure of text. SGML as a matter of course tried to be all things to all people. The feature set was defined to support as many existing systems of the day to get as much support behind it so it would be adopted. The politics of the day mandated that approach.
Converting text from one proprietary system to another was (and still is) a difficult process, especially if the text includes markup generated by some proprietary application. Take a look at text generated from Microsoft Word or any other text processing application that focuses on formatting. Even HTML is a problem because it is primarily a formatting language, not a text processing language. While some could say that HTML can be used for text processing, in fact, today's browsers tend to support it more for formatting and that limits its text processing potential. In other words, HTML is processed by browsers as a linear formatting language, not as a hierarchical markup language.
Formatting codes do not describe a document's structure or content. The same visual affect can be created with a variety of codes. For example, an indent can be created by a tab, where the amount of indent is determined by where the tab is set, or it can be created by typing in a number of space characters. Indents can also be specified by a formatting function where the amount of indent is one of the function's parameters.
Even if you could figure out that the indent exists, what does it mean? Does it mean that the text is the start of a paragraph? What about hanging indents where the first line in not indented, but all the other lines are? Does it mean that the text is a list item contained in a paragraph? If so, does this list item element have restrictions on content that are different from the containing element? For example, it may contain a paragraph, but that paragraph may not contain another list?
There are many complexities to text processing that are unique to the medium on which it is displayed. What happens when you embed codes in the text that do not relate to the text content, but are necessary for the proper rendering of that text on different media?
Each type of application has its own unique problems to solve. For example, if you are writing a credit card application, how do you manage the timing of the issuance of a new credit card to someone who reported one stolen? There may still be some valid transactions outstanding that must be processed even though the credit card itself has been cancelled.
This type of problem doesn't occur with text, but that doesn't make it less important. For the same reason, XML should not only apply to simplified text because most of the applications outside of publishing are relatively simple from a text processing perspective. The publishing community still has needs for some of the features and constructs presented by SGML. How do we decide what stays and what goes?
Even as a long-time SGML advocate (after all I was on the committee), there are some really nice syntax improvements brought to the fore in XML. I especially like the way empty elements are indicated with the single tag; e.g. <empty/>. I wish we had thought of that. Of course, the purpose of using this syntax is to indicate an empty element without the need to refer back to the Document Type Definition (DTD) during processing, a concept not considered relevant with SGML since the DTD is not optional. (You could process SGML data without processing the DTD, but you would not be using an SGML parser.)
Another major improvement is the elimination of many little used or unused features of SGML, such as DATATAG, RANK, LINK, and CONCUR. While inherently consistent with the way text processing and tagging was done before standard markup languages existed, these features have not proven relevant in the majority of today's applications. For the most part, vendors have not offered these features, even in SGML-based products.
Several other SGML constructs that were omitted in XML are considered useful in keyboarding-intensive applications, such as the SHORTTAG enabled constructs of unclosed start- and end-tags, empty start- and end-tags, and undelimited attribute values. Minimization, especially end-tag minimization, substantially reduces the number of keystrokes for text data capture. These features can be incorporated into the software used to capture the data such that the file generated has full tagging and the basic XML parser need not address these issues. In general, the lack of these features is less relevant in the overall scheme of things because skilled keyboarders don't really care. They key what they need to key and do it well.
There are several omissions or limitations of constructs and features in XML that have a direct impact on publishing applications. These aspects of XML will be explored in more depth to determine how to overcome these restrictions in a publishing application.
One of the things I want to clarify before getting into some of the publishing issues this presentation will address is the difference between features that affect the end user versus features that affect the programmer. It is important to distinguish between these two areas because one of the primary goals of XML is to simplify the parser. This objective was met successfully as can be seen by the wide acceptance of XML, mainly due to the availability of free parsers and inexpensive software products that process XML files.
This simplicity comes at a price, however, because some of the features that benefit the end user were omitted from XML to achieve this goal. XML was designed as a delivery language for static documents at the end of the processing cycle. Features and constructs that benefited users during the data capture and editing process for the document were lost. In the context of this presentation, the end user is implementing a publishing application, but I am sure other applications can benefit from these features too.
While the SGML parser may be considered overly complex because of the constructs and features that proved irrelevant, there are features that could have been included in XML that would not have adversely affected the parser. In other words, the degree of difficulty to implement these features is not prohibitive.
IIn the case of available software, all of the features in this presentation have been successfully implemented in publishing-oriented software by vendors who marketed SGML-aware applications before the name change to XML.
Using the SGML declaration and the DTD declaration as my sources, I reviewed the constructs and features both included and omitted from XML. This presentation will not delve into the SGML or DTD declarations or attempt to address every nuance of the specification. Here are some of my thoughts and conclusions.
The character set for SGML was based on ASCII, which was the preeminent code set of the day. The change to UNICODE for XML allows XML to avoid many issues that SGML needed to address in character set processing. The most significant issues deal with "special characters." These are characters that are not represented in the basic ASCII character repertoire, such as bullets, stars, daggers, et al.
A typical SGML publishing application used entity references to represent special characters because accessing these characters for presentation varied from one application to another. Even though UNICODE has a character representation for these special characters, it is still a good idea to use entity references to allow the text content to be adaptable to applications that do not use UNICODE.
Since the applications that support publishing, especially long-standing editorial and typesetting systems, do not all support UNICODE, and since ASCII and UTF-8 are basically one and the same, many products only support the basic UTF-8 code set. In which case, even if the specification calls for UNICODE support, special character handling may still be an issue.
Tag names in SGML were, by default, not case sensitive primarily because having the same tag for different elements based on its capitalization is an error waiting to happen. Entity names, however, should be case sensitive because capitalization can be a useful aspect of the name itself. For example, a Greek letter could be represented by an entity reference called “BETA” or “Beta” for uppercase, and “beta” for lower case.
Of course, some languages do not have simple upper and lower case letters like English and many European languages, so case folding would not be desirable. SGML allows users to change the default from "YES" to "NO" in the SGML Declaration. XML has no such mechanism.
For most publishing applications, keyboarding is still a major method for capturing data. Not everyone has an XML-aware editor and, in many cases, keying the tags in directly as part of the data entry process can be faster and more economical. Since the naming rules do not affect the parser, I recommend that vendors provide options through their products to control tag capitalization. For languages that support case folding, I also recommend that case insensitivity for tags be specified as the default.
One of the changes that XML "inspired" was the replacement of the DTD with a schema that could be expressed in XML syntax. Programmers will say it's easier to parse, but that is their primary concern. My primary concern is for the person who must define the document's structure.
In a text application, the effort to define a document is achieved by interacting with the end user, not the programmer. While there are many nuances to the DTD of which the user need not be aware, such as notation declarations or even entity declarations (although they are pretty simple too), the element declaration is really easy to understand. There have been several occasions where I used the notation on a whiteboard (or blackboard) to reflect the results of an information gathering session with a group of editors. Before long, the editors would begin to "correct" the notation as it became clear to them what it expressed.
Don’t forget that the DTD is relatively straightforward in XML. Many of its complexities had to do with minimization notations, SHORTREF declarations, and other features omitted by XML, so they won’t appear in a DTD anymore.
Regardless of my emotional attachment to the DTD, I realize that there is a more serious problem to be addressed. There is a lack of support in the XML schema for some basic DTD declarations, specifically entity references and exceptions (although a form of the latter has been included in XML Schema 2).
For many non-publishing applications, the schema could be developed programmatically by the programmer from information in electronic format, such as a database. The programmer could decide the order in which elements should occur, and then derive the characteristics of the content from the database’s data dictionary.
Many document structures, especially in a non-publishing application, are very limited in scope and easily defined. The depth of the hierarchy is usually just a few levels and the occurrence of elements (optional or required, repeatable) is well known. The data that forms the content of these elements is often extracted from or input to a database. Such schema features like data types are helpful to ensure that the data is valid before it is processed, especially for a database load application.
These issues are completely different in a publishing application. Free-form text can have very complex data hierarchies, especially when documents are made up of a series of other documents. Anthologies are, by definition, a collection of other works, like poetry or plays. Each of these other works may be their own document type with their own declaration.
Constructs like exceptions – inclusion and exclusion – are inherent in the editorial process for building and revising text documents. Since XML was defined for data delivery over the Web, exceptions cannot occur; therefore, they are not supported. Editing applications, on the other hand, deal with data capture and revision of complex text content. Exceptions help express the occurrence of where certain optional elements may or may not be used. The intent is to reduce the tag set by allowing these exceptions to modify the basic content model for an element and its children.
Here is a simple example. Let's say that the content model for a simple paragraph element (para) is as follows (in DTD notation, of course):
<!ELEMENT para (#PCDATA | fig | fnref)* >
which says that a paragraph consists of parsed character data (simple text), figures (fig), and footnote references (fnref).
Now let's say that we want to use this paragraph in an element called chapter, so we declare:
<!ELEMENT chapter (title, para+, fnote*) >
which says a chapter element consists of a title followed by one or more paragraphs, which in turn may contain figures and footnote references. Footnotes, if any references occur, appear at the end of the chapter.
What's in a footnote? Most likely a paragraph (or several) of text. It is unlikely that these paragraphs would contain figures or footnote references; they could, but for this example, they do not. If we declare it this way:
<!ELEMENT fnote (para+)>
we can assume that figures and footnote references will not occur; however, we cannot validate the document to prevent or at least detect the occurrence of these elements in a footnote. We have two options:
<!ELEMENT fnote (fpara+) >
<!ELEMENT fnote (para+) -(fig, fnref)>
The first option seems reasonable until you start to do it over and over again for different elements in different circumstances. You may also have variations where the figures are permitted, but the footnote references are not and vice versa. Before long you may find yourself adding hundreds of additional elements to the declaration to handle all the variations of the content model that could occur.
Exclusion exceptions allow you to define a complex paragraph that contains all the possible elements it could contain, and then tailor it to be more restricted in certain contexts. Inclusions provide similar benefits when an optional element can appear almost anywhere, like an index reference element. Rather than try to specify it in every content model, you can specify it as an inclusion in a higher level element's content model and let it propagate to the lower level elements.
XML was originally designed as a way to tag text for transport over the Internet. For that purpose, exceptions have no meaning. The text already exists so there are no exceptions. Elements are either there or not there. However, because of its roots in SGML, XML immediately morphed into a simplified markup language for text processing and wound up being used in input and editing applications. All of a sudden, the handling of exceptions becomes a relevant feature.
The good news is that exceptions are a shorthand notation to simplify how to express the addition or omission of elements within the document's structure. Therefore, an XML-aware editor can support this feature without the basic XML non-validating parser having to deal with it. The only requirement is for the DTD parser to be allowed to accept the notation to support the editorial requirement. Of course, the editor software has to support the occurrence or omission of these elements, but then you are writing an editor, not a free parser per se. So this feature can be supported by a vendor product without impacting later processing of the document by a basic XML parser.
Entity reference declarations are not part of the schema specification. Why, because these references have been resolved into the document and the entities have replaced them. Remember, XML is a specification for delivering text over the Internet, not necessarily a text processing specification. Apparently the use of entity references in non-publishing applications is not relevant, hence its omission from the schema.
Publishing applications make extensive use of entity references, both internal and external. In addition to special characters, many aspects of a document's content could be organized by the expeditious use of entity references. For example, all illustrations, from simple symbol graphics to figures and complex tables, are usually represented by entity references. One reason is because these aspects of the content are not captured at the same time the text is created. Illustrations have their own production process and are incorporated into the document when the document is assembled for presentation close to the end of the production cycle. An external entity reference is used as a place holder until such time as the actual illustration is available.
Tables, especially complex tables, or tables generated from another source like a database, are also kept external to the document. When table content may change between the start of the production process and the end, it is best kept external to the document until the last possible moment. An external entity reference is used as a place holder to allow automatic inclusion at processing time.
While the XML specification includes public identifiers, they are not widely supported by existing XML applications, especially by the free parsers. Of course, XML products that have SGML parentage support them, but that’s because the logic is already there.
What are public identifiers and why do publishers want them? Public identifiers are formalized constructs that represent external content. They are used in lieu of physical directory paths and file names to represent the location of a file within a specific application environment. The catalog is a file of mappings between the public identifier and the physical file. The catalog format was standardized by OASIS, an open standards organization, as a vendor-neutral approach for supporting the public identifier.
The advantage of using a public identifier instead of a URL to point to a file is because it allows the reference to change without affecting the instance. Why is this important? Again, we are talking about publishing applications, not Web delivery systems. The URL is meant to be a static reference to a file retrieved over the Internet. It has not gained use as a generic form of file reference in non-Internet applications. Nothing has, really, but for publishing applications that have used SGML, public identifiers are the normal way of handling file references.
There are many good reasons for using public identifiers in publishing applications. First, the referenced file can change during the publishing process. For example, early in the process, an illustration may not exist, so the catalog can be set to point to a dummy image suitable for use as a placeholder. When the actual illustration becomes available, the catalog is changed and no modification to the instance is required. This is especially helpful if the illustration is used multiple times throughout the publication.
Second, different file formats may be required based on the needs of the presentation environment. For example, an application that drives the output to a Web page may use a jpg or gif format for a graphic because that's the best format for the Web environment. However, an application that drives the output to a printed page may need to use a tif version of the image to get the resolution required for the printed product.
Public identifiers also benefit cross-platform publishing environments. Different path and filename syntax may be required based on the production platform for a particular application. Editing may be done on Microsoft PCs where the path names are separated by a backslash (\). Typesetting may be run on a Unix platform where the path names are separated by a slash (/) or on a VMS or mainframe platform where the syntax is completely different.
One way to handle this type of processing is to support multiple catalogs, one for each platform. Of course, there should be only one master catalog from which the other catalogs are generated programmatically. Changing the syntax for file paths or the extension associated with a file can be done programmatically from easily established standard conventions set up for the publishing environment by the system architect. In any event, having the information external to the instance and developing standardized methods for resolving the public identifier provides more options in putting together a publishing environment than a simple URL embedded in the instance.
One last thing: system identifiers should be optional, not required, in a declaration to better support publishing applications. This feature may be best served as a vendor option. Parsers that do not support public identifiers can just report missing system identifiers as a warning (or an error, if you must). Processing, however, should continue.
For the most part, the rules defined in XML for resolving entity references in the parser are sufficient. However, the rules are different from SGML when you are dealing with non-validating parsers, so you must understand the difference so you can compensate in your approach to defining entity references.
In a non-validating parser, one that does not reference the DTD, an external entity reference that points to character data that cannot be parsed is forbidden in content. The reason for the restriction is to allow a well-formed document to be parsed without a DTD and not crash the parser.
While this is an understandable restriction, it can have an adverse affect on some of the older SGML applications that do not support UNICODE. As mentioned earlier, special characters – stars, bullets, daggers, etc. – are usually represented by an entity reference and declared as an external entity reference to a symbol file. In other words, each character is treated as an image and displayed when the entity reference is processed by a formatting application. Many SGML-compliant typesetting systems support this approach.
General entities are used so that the approach to generating these characters can be addressed externally. Based on the processing pass, the relevant external entity is resolved for that instance by manipulating the catalog that maps the public identifier to the external file. In this way, the character may be shown differently on a display screen from how it might appear on the final printed page. What is important, however, is the fact that the content of the instance never changes.
The advantage of controlling the approach through SGML standard constructs avoids the problem of having to ask each vendor to change their systems to support an in-house convention, something that vendors seldom want to do.
In an XML application, each special character can be entered as a character reference to the appropriate UNICODE character. If vendors only support the basic ASCII character set of UTF-8 (in other words, they are really only ASCII, but hype it a little in their PR by saying they support UNICODE), the actual presentation of these characters on the final medium may still be an open issue for the vendor and the user to resolve.
When using a DTD, XML allows entity declarations to be declared in the prologue of a document, not just in the main DTD. One of the advantages of this feature is to allow document parts stored in separate files to define external entities used only in that segment; e.g., illustrations or tables that only occur in that file.
The XML schema specification does not provide this capability because it is more an aspect of the publishing process than of other data processing models. Nonetheless, this capability makes the publishing process easier to organize by allowing these entities to be considered part of the document segment.
This feature also simplifies the DTD by not requiring it to be updated whenever an external entity is added to or removed from the document’s content during ongoing revision cycles.
Content reference attributes are an interesting construct. The primary purpose of this attribute specification is to indicate through the DTD that the content model for an element is optional. When the attribute occurs on the element’s start-tag, the element has no actual content and there is no end-tag for the parser to find. When the attribute is not specified, the element contains content that is parsed according to the model and an end-tag is required. The element cannot be declared EMPTY in the DTD since the element may sometimes contain content, so that is why the distinction was made.
Since XML has an empty element notation that can be used for elements with optional content, the usefulness of this construct is diminished. The reason to support the declaration in the DTD is to document the intention of the element-attribute relationship. When the element is empty, the attribute is required. When the content exists, the attribute must be omitted. Without this notation, a document could not be validated against a DTD. Non-validating parsers are not affected by this construct and the logic to be added to a validating parser is minor.
The purpose of the attribute is an application issue. The element may represent computer-generated data that will be added to the document during some process, or the attribute may be a reference to data stored in a database and retrieved by the application at some point in the production process. The DTD does not specify what the application does, only that when an attribute occurs on the element start-tag, that particular occurrence of the element is empty. Other occurrences of the element on which the attribute does not occur, have content and are parsed according to the content model.
While the purity of the DTD would suggest that support for the construct is a good thing, the reality of the situation is that you could get by without it. If the application managed the presence or absence of the attribute relative to the element’s content, the desired results can be achieved. Vendors should take note that this is a useful feature to provide in their software applications.
This topic continues to surface whenever people talk about XML for publishing applications. Which is better? The answer is both are needed to support the full range of applications to which XML applies. Here is where I stand.
For non-publishing applications, XML schema is better because you are dealing with a different type of data than free-form text. Describing document structure is not the primary goal of these applications, just describing data content in a text format. The data for many of these applications is computer generated, so the need for a visual notation is not necessary.
The direction of XML has been diverted more to the non-publishing application since its inception because the markup syntax is recognized as extremely versatile. It is a self-describing language that identifies elements of data without the need to understand an applications inherent use of them. There is no need to specify an application’s API for passing data or other unique syntax. The recipient of the data becomes responsible for integrating it with their application instead of the sender, a fact that greatly supports information interchange.
For publishing applications, the DTD is better for text applications. As a visual notation, it is easier for an end user to follow the document model while developing the document structure. Other constructs can be handled by a more technical specialist, but they are still easier to write because the intent of the DTD is to be user-friendly, not necessarily programmer-friendly.
An XML DTD is simpler to parse than a full SGML DTD because of the removal of many features and the elimination of the SGML declaration, so it should not be a major programming issue. The XML schema specification does not support entity references, exceptions, or external subsets, all of which are used in publishing applications.
Publishing deals with text content almost exclusively, so the lack of data typing is not as relevant. If there are purists out there, then add a <!DATATYPE …> declaration to the DTD, but I don’t think it’s necessary.
Oh, and one last thought, DTDs ARE NOT OBSOLETE, so publishers should not be afraid to use them. As a matter of course, publishers should demand they be supported by their vendors. Remember the old adage: "Use it or lose it."
Support for SGML beyond its current level is not practical in light of what the industry has determined is useful, but working applications should not rush to change unless some specific benefit for moving to XML can be identified. On the other hand, the XML specification can be modified to provide better support for publishing applications. If the XML specification is not modified, vendors of XML products that service the publishing community can offer the proposed enhancements as product features.
Publishers should understand the limitations of the XML standard as regards their application requirements and be prepared to offset these limitations through vendor products or their own development efforts. Ignoring these requirements would be foolish because the need exists as has been proven over the years by the SGML applications that have been developed. All the applications in which I was involved used some or all of these features.
For the most part, vendor support has been there, especially from vendors that support(ed) SGML. Some of the newer vendors of XML products that came on the scene when the standard was first presented have little or no experience with publishing applications. These vendors are focused on Web output more than paper output and few, if any, address the total publishing experience. While their products may be less expensive than the earlier SGML-based products, they do less also.
Because there is a large market for XML products outside the publishing industry, publishers have accepted the limitations of these products and rationalized the limits as “cost savings” they get by using them. That is more a perception than a fact, because the features must be implemented somewhere. If not in the vendor product, then in other programs usually developed in-house. Most of the alternative solutions require additional processing and, if errors occur, additional passes of the data through the error correction process. These costs are not usually attributed to the deficiency of the XML feature set.
Not having a system-independent way of defining entities can have a major impact on cost. How do you manage external entities that change during the processing cycle? How do you manage cross-platform processing?
If you have to process multiple document instances as part of the production process, how do you associate external entities with the appropriate instance to minimize the opportunity to introduce errors into your data? How many instances are involved and how many entities need to be managed?
As you can see, there are no easy answers, but there are always many questions. While SGML attempted to address these issues, the result became a cumbersome, complex standard that was never fully implemented. On the other hand, XML was simplified too much to address these essentially very basic issues.
It seems the logical solution is to provide a version of the standard in which these issues are covered so publishers need not be put in a position to try to make a determination. If the onus is put on the vendors to support these features, then that is where it belongs. The publisher should not be shortchanged just to simplify the vendor’s product or to make the programmer’s life easier.
The ability to make XML more relevant for the publishing community does not involve substantial change or investment. Yes, some minor tweaks are required, but I think the features presented here minimize the effort. For the most part, all this code exists somewhere on the web. It’s just a matter of putting it together for the benefit of everyone.