Odd Customizations

Syd Bauman
Julia Flanders

Abstract

Like many XML vocabularies intended to be widely deployed in a variety of environments, the TEI Guidelines, an XML vocabulary for markup of textual materials of interest for research and teaching, were designed to be customized. There are several mechanisms for customizing the TEI Guidelines, all with disadvantages as well as advantages. The social and technical advantages and disadvantages, both of specific customization techniques and of vocabulary customization in general, are discussed. Perfect interchangeability of data would require a perfectly restrictive encoding specification, and would preclude the TEI world as we know it, with its diverse disciplinary and methodological perspectives held in relationship by a flexible and capacious tag set. Customizability threatens interchange to the extent that it introduces variation that cannot be controlled or predicted. The challenge is to provide both in ways that are meaningful and functional.

Keywords: TEI; Schema Languages

Syd Bauman

Syd Bauman is the Programmer/Analyst for the Women Writers Project, where he has worked since 1990, designing and maintaining a significantly extended TEI-conformant DTD for encoding early printed books. He also serves as the North American Editor of the Text Encoding Initiative Guidelines. He has an AB from Brown University in political science and has worked as a Emergency Medical Technician since 1984.

Julia Flanders

Julia Flanders is the Director of the Brown University Women Writers Project, where she has worked (as proofreader, textbase editor, project manager, and other things) since 1992. She currently serves as the Chair of the Text Encoding Initiative Consortium and the Vice-President of the Association for Computers and the Humanities. She holds degrees in English history and literature, but her current research focuses on problems of text encoding theory and digital textuality, and for fun she sometimes binds books.

Odd Customizations

Syd Bauman [Brown University Women Writers Project]
Julia Flanders [Brown University Women Writers Project]

Extreme Markup Languages 2004® (Montréal, Québec)

Copyleft 2004 by the authors. Reproduced with permission.

Interchange is a social and political problem, as well as a technical one.[TEI-L:Piez]

Introduction

The TEI [Text Encoding Initiative] is an international membership consortium devoted to developing and maintaining a text encoding standard for humanities documents.[tei-c] That standard is the TEI Guidelines for Electronic Text Encoding and Interchange, which comprises a set of DTDs and accompanying documentation.[TEI] The TEI Guidelines are widely used by digital libraries, text encoding projects, and individuals to encode an extremely broad range of documents including dictionaries, linguistic corpora, literary and historical documents, archival materials, letters, early manuscripts, and many other document types.

Like interchange, customization is built into the conceptual heart of the TEI. In fact, in a sense they are the twin poles around which the TEI universe revolves, two ends of the same problem. Perfect interchangeability of data would require a perfectly restrictive encoding specification, and would utterly preclude the TEI world as we know it, with its diverse disciplinary and methodological perspectives held in relationship by a flexible and capacious tag set. Customizability threatens interchange to the extent that it introduces variation that cannot be controlled or predicted. The challenge is to provide both in ways that are meaningful and functional. These issues are crucial for the TEI, but they are extremely relevant for any widely shared specification whose goal is to facilitate interchange of data.

As Wendell Piez points out, the challenges here are not only technical but also social and political.[TEI-L:Piez] The two are closely related: in both domains, customization is an act of establishing a formal, functional communicative linkage between the individual and the community: expressing a relationship between one's own position and that of the standard. The relationship—whether technical or social—maybe quite tight or quite tenuous, and the degree of tightness in both cases determines the functionality that one can expect to achieve.

The idea of interchange implies not simply the transfer of data from a sender to a recipient, but more importantly a successful transfer: one in which meaning is transferred as well, one which results in comprehension or successful behavior at the recipient's end. On the automated side, some computational process operates on the received data (for instance, by validating it, displaying it successfully, transforming it into some other format). From the human perspective, successful interchange means, for instance, that a human being familiar with the vanilla system can make sense of the customized data being transferred and act on the basis of that understanding (for instance, by adapting existing software to process the information, by writing documentation describing it, or by having a successful conversation about the data with the data's originator at an Extreme Markup Impromptu).

The human dimension of interchange is both more complex and more robust than interchange which takes place through automated processing, and it also in many cases is an essential support to successful automated processing. Human readers are able to make intuitive connections, based on contextual knowledge, between different names for the same fundamental element (e.g. between <p> and <para>), and they are not dependent on strict adherence to formal rules in understanding what markup “means”; they can draw to some extent on the complex shared universe of human language to support their attempts at comprehension. For instance, a human reader can make sense of a <list> element even if it appears in a context that is forbidden by the DTD. However, humans also complicate the problem of interchange by their cleverness: two human encoders may use an identical tag set in two fundamentally different ways to produce profoundly different encodings of the same document or to produce two identical encodings of very different documents. For all of these reasons, human-readable documentation is an essential part of any effort at interchange, and particularly so when a customized encoding system is involved.

We begin by discussing the technical challenges surrounding TEI customization and propose some solutions. These proposals do not describe the current practice or plan of the TEI, but are experimental explorations which we hope the TEI will consider in the future. Although we are using the TEI as our case study here, these issues apply much more broadly. In particular, the “complete customizability” mechanism discussed below would permit any standard (or other document for that matter) written in TEI (as the TEI Guidelines indeed are) to be completely customizable via a clean mechanism, even if the source were read-only, whether due to technical, legal, social, or moral constraints.

However, it is true that the TEI occupies a very distinctive position with respect to the idea of standardization: strictly speaking, it is not a standard, but is poised between a standard and a consensus, possessing some characteristics of each, in ways that have very interesting consequences for extension and interchange. The TEI is also rare, although far from alone, among standards (or things that resemble standards) in allowing for customization, and is rarer still for its ardent encouragement of customization. As a case study, therefore, the TEI brings the complexities of these issues to the fore very usefully.

Categories of customizations

We can immediately categorize customizations of an XML schema into three groups:
restrictions

While there (could) exist document instances that are valid against the vanilla schema but not against the customized schema, any instance that is valid against the customized schema is definitionally valid against the vanilla schema. E.g.:

vanilla <!ELEMENT sp (speaker?, p+)>
custom  <!ELEMENT sp (speaker, p+)>
DocBook uses the term “subset” for this. [db:cust]. See figure 2

isomorphisms

The syntax and semantics of markup in the customized schema is exactly the same as that in the vanilla schema, except that the names are different. E.g.:

vanilla <!ENTITY % n.forename "forename">
        <!ENTITY % n.surname "surname">
custom  <!ENTITY % n.forename "firstName">
        <!ENTITY % n.surname "lastName">
In this case the two DTDs could satisfy the same architectural form.

A Venn diagram of this sort of customization is not very helpful, as obviously if a renamed element appears (which will always be the case if one of the renamed elements is required, e.g. the root element) a document will not be valid against the other schema. Thus, when required elements are renamed there is no overlap whatsoever. However, if we imagine that the Z-axis of a three-dimensional Venn diagram represents the structure or architecture of the document space, then the result would be two circles with the exact same X- and Y- coordinates on different planes along the Z-axis.

However, the important point here is that the transformation either of a document instance that is valid against one of the schemas to an instance that is valid against the other, or even of one of the schemas so that it validates the other set of documents, is likely to be trivial.

extensions

While there (could) exist document instances that are valid against the customized schema but not against the vanilla schema, any instance that is valid against the vanilla schema is definitionally valid against the customized schema. E.g.:

vanilla <!ENTITY % a.xPointer '
                   %a.pointer;
                   doc ENTITY #IMPLIED
                   from %extPtr; "ROOT"
                   to %extPtr; "DITTO"'>
custom  <!ENTITY % a.xPointer '
                   %a.pointer;
                   doc ENTITY #IMPLIED
                   from %extPtr; "ROOT"
                   to %extPtr; "DITTO"'
                   url CDATA #IMPLIED >
Here the url= attribute has been added to the xPointer attribute class. Instances that specify a url= attribute on any element which is a member of this class (<xptr> and <xref>) will be valid against the custom schema, but not the vanilla.

DocBook also calls this an “extension”. See figure 3.

Customizations that include extensions pose more of a potential obstacle to automated interchange. However, the nature of the extension determines the extent of the obstacle. Some will probably pose no difficulty: for instance, extensions which simply allow required elements to be optional, or extensions which allow elements to appear in a wider range of places. Others are more challenging: for instance extensions which create new optional elements. These are much harder to handle, because it may be quite unclear what the element is for, or how it should be processed. Accompanying human documentation can make a big difference to the success of interchange in such cases.

However, it is clear that this is an insufficient categorization. For while each individual change to a schema may fall into one of the above three categories, it is obviously possible to have more than one type of change to a schema. Furthermore, many customizations (even small, simple ones) will be so drastic in their effect as to prevent any overlap in the sets of documents that will validate against the vanilla and customized schemas.
transmogrifications

A combination of restrictions and extensions. E.g.:

vanilla <!ENTITY % a.xPointer '
                   %a.pointer;
                   doc ENTITY #IMPLIED
                   from %extPtr; "ROOT"
                   to %extPtr; "DITTO"'>
custom  <!ENTITY % a.xPointer '
                   %a.pointer;
                   url CDATA #IMPLIED >
As above a url= attribute has been added, but the doc=, from=, and to= attributes have been removed. So, as above, an instance that specifies a url= will be valid with respect to the customized schema but not the vanilla one; however, an instance that specifies doc= and from= will be valid against the vanilla but not the custom schema. See figure 4.

Transmogrifications pose the same kinds of challenges as extensions.

disjoints

A customization, other than an isomorphism, that results in a DTD against which no documents which are valid against the vanilla DTD are valid. E.g.

vanilla <!ELEMENT list ( item+ ) >
custom  <!ELEMENT list ( head, item+ ) >
The sets of documents valid against the vanilla and against the customized DTD are non-intersecting.

Disjoints potentially create the most severe problems, partly because all documents from the customized set are definitionally invalid, but here again the real practical severity depends on the nature of the change and the elements affected. See figure 5.

Even this, however, may be an insufficient taxonomy. It may be useful, e.g., to further categorize “disjoint” customizations.

Obviously a customization that consists of only restrictions presents no burden on automated interchange. Conversely, customizations that include extensions (and thus transmogrifications) pose an obstacle to automated interchange. Isomorphisms also present an obstacle, but generally one that is easily overcome.

Certain sets of customizations (i.e., internationalizations) will consist completely of isomorphisms. Many projects will find customizations that consist completely of restrictions very useful — e.g., to restrict the values of many attributes that TEI does not constrain to a closed list of discrete values (e.g., “foreword”, “introduction”, “chapter”, “section”, “subSection”, and “index” for type= of <div>; or to enforce a project-wide decision to eschew numbered <div>s in favor of un-numbered <div>s).

However, it is reasonable to believe that at least a significant minority, if not a vast majority, of all customization modules result in a transmogrified (as opposed to restricted or extended) schema. (Note that here we are not counting those “customization” modules that are used only to select TEI tag sets, and not to override any declarations within the selected sets.) Indeed, TEI Lite, far and away the most popular customization of TEI, is a transmogrification schema. Many elements and some attributes have been deleted (restrictions), but several elements have been added (extensions).1

In almost all of the types of customization described above, it has been clear that however much you can accomplish with automated interchange, you can accomplish more when human intelligence can also be brought to bear on the problem. In any case where a new element is being created (whether optional or required), in any case where elements are being renamed, and in most cases where elements are being relocated (either required in new locations, allowed in new locations, or forbidden from old locations), providing an explanation of the motive for the change and the semantics of any new elements, attributes, or element relationships can make a huge difference to the effectiveness of interchange. Without human intervention, isomorphisms are potentially opaque to automated processing, whereas with human intervention, they are potentially trivial.

Technical Challenges

Another paper in this conference gives a detailed explanation of the TEI's literate encoding system, affectionately known as ODD [One Document Does it all], and we will not attempt to duplicate that material here.[ODD] (Readers interested in literate encoding would do well to start with [LitEnc].) However, it may be useful to briefly recapitulate the basic facts. In the TEI ODD system, all aspects of the TEI Guidelines (both the specification of formal constraints, and the accompanying prose documentation) are expressed in a single XML document which is itself encoded in TEI, using the module for tagset documentation. From this ODD source various forms of output may be generated, including formal reference documentation for the various DTD or schema components, descriptive documentation such as the chapters of the TEI Guidelines, and DTD or schema fragments. In the conventional terminology of literate programming, “tangle” processes produce source code files, schema fragments, and the like, while “weave” processes produce prose documentation.[LitProg]

The basic problem for the TEI is that it is impossible to predict all of the needs of a diverse humanities encoding community. Ideally, users should to be able to produce documents that use the “standard” markup scheme where it is applicable, and use home-grown markup where it is needed. Users doing this should, of course, be able to generate a schema with which to validate such documents. We identify six different ways of creating such schema:
Hack the ODD source

User takes the ODD source to the Guidelines, changes it directly, runs the tangle (and perhaps the weave) process over the changed source, and uses the resulting schema (and perhaps documentation as well).

advantages

Direct approach; user ends up with both a custom schema and the customized reference (and perhaps prose) documentation for it; reasonably easy on user for initial changes; user has complete control; permits customized version to be published in same manner as original.

disadvantages

Unsystematic approach; potentially very difficult for user to incorporate her changes into new releases of the Guidelines; no easy mechanism for software or humans to ascertain differences between vanilla and customized versions of Guidelines — an xmlDiff program will work, but is likely to prove cumbersome; TEI has no control; requires Guidelines be published as open source.

Hack the tangled schema

User takes a copy of the schema produced from the original ODD source (or, perhaps even worse, a “flattened” version of such a schema) and makes changes directly to it.

advantages

For many users initial change is even easier than changing the ODD source, as they already know the schema language.

disadvantages

Can be extremely difficult (especially in the “flattened” case) to propagate changes to new releases of the Guidelines, or even to make more extensive changes to the customized schema later on; provides no documentation of changes; neither humans nor software have any easy mechanism for rapidly determining what the differences between the customized and vanilla schemata are — only mechanism at all would be a fancy use of a ‘diff’ program.

Personal tangling

User writes her own tangle processor that reads the same ODD source of the Guidelines, but produces a (slightly?) different schema.

advantages

Easy to propagate to new releases of the Guidelines; cleanly separates customizations.

disadvantages

Puts the logic of changes in the wrong place (a processor); likely to be quite difficult to do.

Automated hacking

User writes a transform that reads in the tangled schema as published, and writes out a modified version.

advantages

Cleanly separates customizations; easy to propagate to new releases of the Guidelines.

disadvantages

Non-declarative approach; does not provide documentation; potentially hard to do.

Apply a customization to schema

User writes a schema fragment (DTD for TEI P4, RelaxNG for TEI P5) following prescribed method, which is read by the main DTD or schema, and overrides some of it.

advantages

Cleanly separates customizations; easy to propagate to new releases of the Guidelines; facilitates interchange of just the customization; generally easy for others (who know the schema language) to read and understand; provides humans an easy way to quickly determine the differences between customized and vanilla schemata; provides a hook so that computer software could potentially determine this difference.

disadvantages

Does not provide for any documentation; requires user understand schema language and TEI class system.

new! Apply a customization to ODD source

User writes an ODD fragment following prescribed method, which is read by both the tangle and weave processors, and overrides some of the original ODD source.

advantages

All of the above, plus: user does not have to learn as much of the underlying schema language; since ODD is TEI, user can use the same tools (editors, validators, etc.) she uses for her instances; provides woven documentation of end-result customized schema.

disadvantages

User needs to be familiar with TEI class system and ODD tagset.

As of March, 2003, the TEI already envisions applying user-written ODD fragments as at least an available, if not the preferred, method of customizing the formal declarations contained in the Guidelines, and thus (after a bit of tangling) the schema and simultaneously (after a bit of weaving) the reference documentation for the schema.

However, as we discuss in more detail below, the idea of applying user-written ODD fragments as customizations to the informal prose of the Guidelines as well should be seriously considered.

It is this last idea — applying user-written customizations to the entire Guidelines, not to just the formal declarations of elements and attributes that are tangled into a schema — that triggered our current investigation. We are interested in the consequences of the shift to using the ODD language as the customization method for an entire standard, both practically and theoretically.

Extending the current ODD customization system to include customizations to the prose of the Guidelines should not prove difficult. The basic idea is pretty simple: allow almost any element to bear two new attributes (neither or both should be specified), mode= and mTarget=. The value of the former must be one of “add”, “delete”, “replace” (essentially a delete followed by an add), and “change”. The value of the latter is an XPointer, the referent of which is to be deleted, changed, or replaced; or is the spot at which the new element should be inserted.

The customization process (“customizer”) is now simply a matter of reading both the source ODDs and the customization files, and performing any of the additions, deletions, etc. that have been requested. So for example, if a particular project wished to require that type= of <biblScope> be specified, something like the following would appear in their customization module:

<elementSpec mode="change" ident="biblScope">
  <attList>
    <attDef ident="type" usage="req"/>
  </attList>
<elementSpec>
This would have the effect that after a tangle operation, the generated schema would require that conforming instances have the type= attribute specified on <biblScope>, and that after a weave operation the generated reference documentation would indicate that this attribute is required. However, the prose of the Guidelines generated by the weave operation would still have the text “The <att>type</att> attribute on <gi>biblScope</gi> is optional: …” in the 8th paragraph of section 6.10.2.3 “Imprint, Pagination, and Other Details”.[ipod] Adding something like the following to the customization module would alleviate this inconsistency.
<p mode="delete" mTarget="./p2cobi.odd#xpointer(//*[@id='COBICOI']/p[8])"/>

Giving a user the ability to effectively change parts of the source presents some challenges. We have listed a few here; we are quite confident others will crop up if an implementation is attempted.

  • What happens when there is a cross-reference to something that has been effectively deleted? Should the customizer throw a fatal error, merely issue a warning, or not even check for such problems? If one of the latter two, what should weave and tangle processors do when they try to resolve such a reference?
  • Currently (in TEI) the semantics of markup is described in prose. While it may be the case that some semantics could be expressed in a programmatic or machine-readable manner, such a discussion is not only outside the scope of this paper, but is currently outside the scope of the TEI. (Interested readers may wish to see [semantics].) But if a user can change the prose, she can change the semantics of some markup. The consequences of such changes are hard to predict.
  • The prose of the Guidelines often refers to markup constructs defined by the Guidelines directly. E.g., the aforementioned “The <att>type</att> attribute on <gi>biblScope</gi> is optional: …”. As a corollary to user overrides of the prose, such references should be indirect, in case the markup construct being referred to now has a different name or no longer even exists. Thus, something like “The <attRef key="type"/> attribute on <elementRef key="biblScope"/> is optional: …”, in which case the weave processor would find the element declaration with ident="biblScope", and, e.g., substitute its name (which may or may not be “biblScope”) inside angle brackets in a fixed font.
    Note that while this level of indirection makes reading and writing the ODD sources more difficult, it is quite reasonable to use a weave process that creates, e.g., output similar to TEI Lite or DocBook from the ODD sources for easy searching, etc.

Social and political issues and consequences

Customization of a standard should involve not simply making a change, but also expressing that change as a relationship of difference between the original and the new version. We have identified the formal mechanism for expressing that relationship: the ability to apply user-written customizations to the entire Guidelines, not to just the formal declarations of elements and attributes that are tangled into a schema. We are interested in the consequences of the shift to using the ODD language as the customization method for an entire standard, both practically and theoretically. So with the technical means to modify the prose specification of an encoding system, as well as the formal DTDs or schemas, what are the social ramifications of such changes? How do we characterize and measure change? What relationships might hold between the customized specification and the original version, and how would this relationship affect our understanding of the TEI universe in particular?

Customization universes

We see three different possible relationships between the customized specification and the vanilla original, three different universes with distinctive belief systems about the relative status and importance of these two documents.

The “wiki” model

a universe in which the original specification has no special authority, and its users can modify it ad libitum to match their evolving needs. Although this sounds like a very chaotic place to be, in fact it might turn out (for some user communities) to be a workable way of managing the development of a shared specification (though perhaps not the ongoing maintenance of an existing one). Customization in this universe could be a way of gradually changing the specification to match the actual usage patterns of the user community, and although this evolution would necessarily be a bit wobbly, it might very well involve a steady vector of progress.

The “web” model

in which the vanilla version has special authoritative status and is preserved unchanged, but in which changed versions and commentary may be created which document alternative practices. These customizations have no formal relationship to the original version; there may be a loose connection (for instance, human-readable documents which explain the relationship between the two) but it is not one that can be automatically traversed in any useful way.

The “blog” model

in which the original vanilla version has a special status, and is preserved unchanged, but with accretions of commentary bound together into some sort of organized universe. Users would express their customizations as formal divergences from the standard, using a method which would allow a complete custom specification to be generated from the vanilla version plus any one (potentially more) of the customizations. In this model, both the original standard, the customization, and the relationship between the two (the path that leads from one to the other) are considered important.

Although the TEI has always operated under the blog model for its DTD and schema customizations, for the prose Guidelines it has thus far been under the web model: there is a central standard (the big green books) whose authority is recognized, and local customizations whose relationship to the vanilla version is expressed informally and usually only in human-readable terms. However, the ODD customization method we have described above—which provides for customization of the specification through a formal expression of difference—makes the blog model possible for the prose documentation in the future.

Types of change

The textual editing tradition provides us with one useful dimension in discussing changes, namely the distinction between “substantives” and “accidentals” (first articulated in [RoCT]). In editorial terms, an “accidental” is a textual fact such as spelling, punctuation, layout, word division, and the like, which may affect how we perceive the text but do not affect the “author's meaning or the essence of his expression”. Along the same lines, a “substantive” is anything that affects the sense of the text: changes to words or word order. In modern terms—and particularly in light of arguments made by Johanna Drucker and Jerome McGann in various places (see for instance [reText] or [imText])—this distinction becomes problematic when used to deflect attention from the physical presentation of the text. But without losing sight of the important kinds of meaning which are conveyed through what Greg terms “accidentals”, it may still be useful to acknowledge that some kinds of textual change make effectively no practical difference to the meaning of a text encoding standard. So we can say in a preliminary way that changes to accidentals of the standard (either the ODDs, the derived DTDs, or the documentary prose) would be considered insignificant from the standpoint of TEI customization. We can also observe that the domain of the “accidental” is somewhat larger in this context than in textual editing; in addition to the items Greg lists, we might add that in many cases even the wording of the standard can be changed without affecting its sense. Similarly, some aspects of the DTD or ODDs might be subject to “accidental” changes such as reordering of elements within an OR group.

This dualism between accidentals and substantives takes us only so far, and once we have identified the possibility that a word change might or might not change the sense of the DTD or documentation, we need a more complex model for understanding the various axes along which change can take place. Here the conceptual framework provided in the FRBR [Functional Requirements for Bibliographic Records] may be helpful.[FRBR] FRBR identifies four basic “entities” through which we can understand how ideas are expressed through documents: the work, the expression, the manifestation, and the item. Of these, the “work” would represent the ideas and intellectual structures that characterize the TEI Guidelines in the largest sense, and the “expression” would represent the particular words and DTDs or schemas (and potentially other expressive methods) through which the Guidelines are instantiated and communicated. This distinction is particularly apt in the case of the TEI, because the TEI has taken care to indicate from the start that its essence—the Guidelines—are a specification of a method, not of a particular metalanguage schema; that is, they are not inherently based on any particular encoding language such as SGML, and (previously in theory, in the future in fact) may be validated using any one of a number of schema languages.

Using this distinction, we may then say that for our purposes there are four categories of change we can envision: changes to the accidentals (e.g. layout, punctuation) which affect neither the work nor its expression; changes to the expression (e.g. words or word order) which do not affect the meaning of the work; changes to the expression which do affect the meaning of the work; and changes to the work itself (i.e. changes to the fundamental ideas or encoding structures) which necessarily affect its meaning and presumably result in a change to the expression.

Table 1: In tabular format, this might be summarized as follows.
“accidentals” ⊖ Δ expression ⊖ Δ work
insignificant Δ content ⊕ Δ expression ⊖ Δ work
significant Δ content ⊕ Δ expression ⊕ Δ work
fundamental Δ ⊕ Δ work (necessarily ⊕ Δ expression)

The TEI permits and even encourages several of these: it allows the republication of the Guidelines in other formats (e.g. on the web, in print) which necessarily involve changes to the accidentals of presentation, and it also encourages the production of additional documentation which seeks to explain the meaning of the TEI “work” in different terms, perhaps for a different audience or in a different language, or simply with greater lucidity, but without altering the fundamentals of the TEI universe. Most significantly, of course, it permits and encourages customizations to the TEI work itself, through the TEI extension mechanism, though it has historically channeled those changes though the constraints of TEI-conformance.

These observations suggest that in the idea of the “work” there is some further quality we can identify—at least in the TEI universe—that binds together the various inflections of the TEI Guidelines, ODDs, DTDs, and customizations as recognizably related parts of a single whole. In a customization universe governed by the “blog” model described above, a TEI-conformant encoding system might simply be one which expresses an explicit relationship to the authoritative TEI (both in its formal constraints and in its prose documentation): TEI conformance in this sense would acknowledge the desire to operate within the TEI universe and to allow for successful (or at least improved) interchange with other TEI projects. Even if every tag has a different name, or if every content model is different, as long as the differences are expressed so as to allow their relationship to the original TEI to be traced, then some meaningful participation in the TEI ecology is possible. Even though it may be very difficult to identify something that is essentially TEI-ish about the TEI universe, the idea of conformance could function as a way of achieving some notion of commonality and centrality in the absence of such an essence, by registering a motivation to conform.

Conclusions

We have been describing a technical change that could result in a profound change in the TEI universe: one which would allow customization of the prose of the Guidelines as well as of the DTDs or schemas, while also placing such customizations in a clearer formal relationship with the original Guidelines. What would this shift do for—or to—the TEI universe as we know it? The TEI is not, strictly speaking, a standard, but in many ways it seeks to function like one. Standards have a range of potential functions not all of which apply in all cases: to discourage gratuitous divergence of practice and associated waste of effort, to simplify the development of tools which will operate on the product of the standard—but also, potentially, to establish the social identity of the body which has the authority to define practice, and to permit more consistent and efficient scrutiny and regulation of behavior.

In the case of the TEI, customizability and interchange have to be understood as complementary goals to be held in balance, rather than mutually exclusive categories. The TEI does operate as a standard in many useful ways, and its very customizability turns out—bizarrely—to be a powerful force in support of that function, because it encourages degrees of conformity (and hence a tendency to seek some degree of conformity) rather than insisting on an all-or-nothing choice. The result is what we might call a “voluntary standard” which establishes a community of practice that somewhat resembles a galaxy: the further away from the center you are, the greater your divergence from the standard and the less likely it is that there is anyone else out there with you, but there is no simple line beyond which lies the “not-TEI”. At the center, there is a core of established practice that is well-documented and exhibits a high degree of voluntary uniformity, enforced by the desirability of the results (for instance, the ability to use common tools). Further out, there is a gradually thinning penumbra of shared practice, some of which differs for accidental reasons (for instance, no one bothered to coordinate efforts) and some of which differs for good reasons (such as divergent disciplinary or project methodology).

Figure 1
[Link to open this graphic in a separate page]

The graphic is of a galaxy. The labels “‘vanilla’ TEI”, “TEI in Libraries”, “Model Editions Partnership”, “Corpus linguistics”, “TEI manuscript encoding”, “EpiDoc”, “Women Writers Project”, and "Comic Book Markup Language” point to various stars within the galaxy, spiralling outward.

The TEI Galaxy

Toward the edges of this penumbra, we expect to see an increase in the degree of customization and a proportional decrease in the possibility of easy interchange with projects elsewhere on the perimeter. However, we also see the formation of clusters of well-defined and self-aware subcommunities, each representing some specialized application of the TEI (e.g. TEI in libraries, TEI for scholarly editions, TEI for documentary editions, TEI for manuscripts, etc.). Within these groups we see fairly high degrees of interchange, as well as high degrees of optimization for specific materials and methodologies. Between groupings, we see diminished—but still important and useful—degrees of interchange. These groupings therefore seem to operate as local standards, allowing tight bonds of affiliation between members of the group, while retaining looser bonds between the group and the larger community.

Even for groupings and projects at the very edges of the TEI galaxy, we can increase the possibility of interchange through careful use of customization mechanisms that maximize both the amount of information available about customizations and the degree of formalism with which they are expressed. Achieving this through the ODD customization mechanism described above seems highly desirable, since it allows for highly functional customization, both of DTDs and schemas, and of prose specifications. For the TEI, given its multidisciplinary community and its goal of furthering humanities research (rather than achieving optimally efficient interchange), this approach improves both the technical and political dimensions of interchange.

Customization Diagrams

Figure 2
[Link to open this graphic in a separate page]

The diagram is of the world of XML documents. Each circle represents the set of documents that are valid against a particular DTD. In this case, there are no documents which are only valid against the custom DTD; all custom documents (represented by the inner circle) are also valid against the vanilla DTD.

Venn diagram of restrictive customizations

Figure 3
[Link to open this graphic in a separate page]

The diagram is of the world of XML documents. Each circle represents the set of documents that are valid against a particular DTD. In this case, there are no documents which are valid only against the vanilla DTD; all vanilla documents (represented by the inner circle) are also valid against the custom DTD.

Venn diagram of extension customizations

Figure 4
[Link to open this graphic in a separate page]

The diagram is of the world of XML documents. Each circle represents the set of documents that are valid against a particular DTD.

Venn diagram of a simultaneous restrictive and extensive customization

Figure 5
[Link to open this graphic in a separate page]

The diagram is of the world of XML documents. Each circle represents the set of documents that are valid against a particular DTD.

Venn diagram of a disjoint customization

Notes

1.

They are <ident>, <code>, <eg>, and <kw>, for those who are interested.


Bibliography

[db:cust] Walsh, Norman, and Leonard Muellner. DocBook: The Definitive Guide, ISBN: 1-56592-580-7. http://www.oasis-open.org/docbook/documentation/reference/html/ch05.html.

[FRBR] IFLA Study Group on the Functional Requirements for Bibliographic Records. Functional Requirements for Bibliographic Records: Final Report. UBCIM Publications-New Series. Vol. 19, München: K. G. Saur, 1998. http://www.ifla.org/VIII/s13/frbr/frbr1.htm.

[imText] Drucker, Johanna, and Jerome McGann. “Images as the Text: Pictographs and Pictographic Logic”. http://jefferson.village.virginia.edu/~jjm2f/pictograph.html.

[ipod] Sperberg-McQueen, C. M. and Lou Burnard, eds. Guidelines for Electronic Text Encoding and Interchange, March 2002, p. 181. http://www.tei-c.org/P4X/CO.html#id954393.

[LitEnc] Cover, Robin. “SGML/XML and Literate Programming”. http://www.oasis-open.org/cover/xmlLitProg.html

[LitProg] Knuth, Donald. Literate Programming, ISBN 0-9370-7380-6. See also Sewell, Wayne. Weaving a program: literate programming in WEB, ISBN 0-442-31946-0.

[ODD] Burnard, Lou, and Sebastian Rahtz. “Relax NG with Son of ODD”, Extreme Markup Languages 2004 proceedings. ../Burnard01/EML2004Burnard01.html

[reText] McGann, Jerome. “Rethinking Textuality”, http://jefferson.village.virginia.edu/~jjm2f/jj2000aweb.html.

[RoCT] Greg, W. W. “The Rationale of Copy-Text”, Studies in Bibliography, vol. 3 (1950–51), pp. 19–36.

[semantics] Sperberg-McQueen, C. M., Huitfeldt, C., and Renear, A. “Meaning and interpretation of markup”. Markup Languages: Theory and Practice 2, 3 (2000), 215–234.

[TEI] Sperberg-McQueen, C. M. and Lou Burnard, eds. Guidelines for Electronic Text Encoding and Interchange, March 2002. http://www.tei-c.org/P4X/.

[tei-c] Text Encoding Initiative Consortium http://www.tei-c.org/.

[TEI-L:Piez] Piez, Wendell. “Re: modifying TEI DTD” on TEI-L (posting of 2003-07-14 13:13-0400 http://listserv.brown.edu/archives/cgi-bin/wa?A2=ind0307&L=tei-l&P=R2273).



Odd Customizations

Syd Bauman [Brown University Women Writers Project]
Julia Flanders [Brown University Women Writers Project]