Principles, Patterns and Procedures of XML Schema Design — Reporting from the XBlog Project

Anne Brüggemann-Klein
Thomas Schöpf
Karlheinz Toni

Abstract

Our weblog system, XBlog, is being built using run-of-the-mill XML-based publishing technology. We propose principles, patterns, and procedures that we discovered when translating from the conceptual model of XBlog articles into a schema language for XML. The project is part of a larger endeavor in which we explore to what extent novel publishing applications such as weblog systems can be composed from appropriately configured XML software with a minimum of programming. Our goal is to discover principles, patterns, and procedures that reduce complexity and ensure sustainability when developing and maintaining Web applications. Thus, the project serves as a case study for research in document engineering. In addition, XBlog is being used to teach XML technology to undergraduate and graduate students.

Keywords: Modeling; Programming

Anne Brüggemann-Klein

Thomas Schöpf

Karlheinz Toni

Principles, Patterns and Procedures of XML Schema Design — Reporting from the XBlog Project

Anne Brüggemann-Klein [Technische Universität München]
Thomas Schöpf [Technische Universität München]
Karlheinz Toni [Technische Universität München]

Extreme Markup Languages 2007® (Montréal, Québec)

Copyright © 2007 Anne Brüggemann-Klein, Thomas Schöpf, and Karlheinz Toni. Reproduced with permission.

Introduction

Our weblog system, XBlog, is being built using off-the-shelf XML-based publishing technology. In this paper, we propose principles, patterns and procedures that we discovered when translating from the conceptual model of XBlog articles into a schema language for XML.

The XBlog project is part of a larger endeavor in which we explore to what extent novel publishing applications such as weblog systems can be composed from appropriately configured XML software with a minimum of programming. Our goal is to discover principles, patterns and procedures that reduce complexity and ensure sustainability when developing and maintaining Web applications.

The XBlog system is no end in itself. Rather, the XBlog project serves as a case study for research in document engineering. In addition, XBlog is being used to teach XML technology to undergraduate and graduate students.

This paper is organized in four further sections. First, we briefly introduce the domain of weblogs. Next, we model weblog documents conceptually, using UML class diagrams. Then, we translate the conceptual document model into XML Schema. Finally, we summarize the principles, patterns and procedures that we have applied during modeling and translation.

A number of diagrams can be found in the appendix which is available from http://www11.in.tum.de/~brueggem/papers/SchemaPrinciplesEML/

Weblogs

Weblog systems are one of the newly emerging forms of social software [WikipediaWeblogs] [PicotWeblogs]. In a weblog, its author publishes a list of topical contributions on the Web in reverse chronological order. Some sources [BargerWeblogHistory] [WeinerWeblogHistory] trace weblogs back to the practice of logging browsing sessions on the Web for fellows of kindred spirit, comparing the first weblog authors to the trail blazers that Bush [AsWeMayThink] envisioned for the Memex. Today, weblogs typically exhibit short contributions of a personal nature; it is expected that new articles are appended frequently.

In its more developed form, a weblog goes far beyond publishing a personal journal on the Web: Interested parties can not only globally read its articles but also publicly comment on them and point to them with back-traceble links from weblogs of their own. Thus, a weblog, being a novel kind of social software, provides an internet service that enables communication and collaboration within a social, people's, network.

Built on the infrastructure of the Internet and the Web which enables location-independent reading and writing, the concept of a non-interactive weblog has the charm of simplicity. Ubiquitous weblog technology makes even communication and collaboration within a weblog's social network as easy as writing e-mail. Not being hampered by hurdles of technology, weblog users are free to focus on the dissemination and discussion of content. Most weblogs are capable of generating RSS feeds, seamlessly tying into the publish-and-subscribe world of alerting or notification services.

A weblog is author-centered in that a single person or a small group of authorized persons contribute articles to it. This characteristic distinguishes weblogs from another popular type of social software, the document-centered wikis that have an unbounded number of persons jointly authoring a single document.

The reading of "weblog" as "we blog" has given rise to the terms blog for weblog and blogger for weblog author.

Since the early 1990s, the lightweight nature of HTML has given new meaning to word processing and hypertext systems. Similarly, weblog technology is set to give new meaning to content management, effectively and flexibly supporting knowledge work at the individual, the social-group, the enterprise or the global levels. We have choosen weblogs for one of our case studies since we believe in the high potential of this kind of publishing application for research collaboration and knowledge dissemination.

The conceptual document model

We base XBlog's document model on its domain model (in the appendix) that, generally speaking, identifies the key concepts of the application domain and their relationships in a conceptual model, expressed as a class diagram that uses classes, class attributes and relations.

Our selection of key concepts has been informed by studies of Weblog systems [WestnerWeblogServiceProviding] [PicotWeblogs] [RenDAWeblogs]. The conceptual model aims at being extensible rather than at being complete, with attributes such as styles and policies of Blog System and body of Article being particular points of extension. We treat the conceptual model as a skeleton that will be fleshed out iteratively as development progresses.

Two concepts of the domain model represent documents, namely Article and Comment, with Comment objects being parts of Article objects. In this section, we extend the domain model regarding these documents. This extension is still on the conceptual level, independent of the specific language in which we represent documents, namely XML. In the next section, we decide on a schema language for our documents and convert our conceptual model for documents into an implementation model that can then be directly translated into a schema.

When extending the conceptual model regarding documents, we follow Maler and El Andaloussi [MalerEtAlDTDModeling] who propose to classify document constituents into one of four categories, which we call metadata, organizational items, information items and information snippets.

Metadata are commonly blocked together and associated with the document as a whole or with its major divisions, but may also be associated with more fine-grained document constituents. As the name "metadata" implies, they hold information about a document constituent rather than being part of it. Typical high-level metadata are author, publisher, publication date and so on as standardized by the Dublin Core initiative and others. Examples for low-level metadata are the height, depth or format of a picture.

Organizational items structure a document into high-level units. They typically form a hierarchy, of which each level is organized as a sequence of specific and often repeatable items. Typical organizational items are books with frontmatter, a number of chapters and backmatter, of which each chapter is organised into a title, a number of introductory paragraphs and a number of sections.

Information items are smaller units of discourse that can be semantically understood out of context, such as paragraphs, lists, or quotations. Characteristically, an organizational item of the lowest level will be allowed to contain an arbitrary number of information items whose type may be freely chosen from a repertoire.

Information items may be shallowly organized into sub-items, as a list is organised into list items, but will eventually contain just text, possibly mixed with the smallest and lowest type of document constituents, namely information snippets.

Finally, information snippets are small units of information that normally cannot be semantically interpreted out of context. Typical information snippets are emphasized phrases, cross references and technical terms. Characteristically, information snippets may contain text and possibly further information snippets that are freely chosen from some repertoire.

Thinking in terms of this classification helps us to recognize and to identify the conceptual constituents of our documents:

Starting from the domain model of XBlog (in the appendix), all the attributes of the main—organizational—class Article with the exception of body and all the attributes of the—organizational—classes Trackback and Comment with the exception of content are classified as metadata. The relation of author within Article and Comment is also considered metadata. The types of all these metadata are left unspecified at this point.

The article body is organized into an optional introductory "teaser" and a number of further information items that we will explicate shortly. Following Maler and El Andaloussi [MalerEtAlDTDModeling] once more, we group the repetitive trackback and comment items into container constituents called Trackbacks and Comments; their role is also organizational.

So far we have captured metadata and organizational constituents of XBlog articles (the outer XBlog document model in the appendix) in a hierarchical fashion, terminating with the information item level constituents Trackback, Teaser and Information Item. The latter two are of a textual nature and are modeled next.

The main information item is modeled as an abstract class, Information Item, so that it can be extended later on. Information items are divided into special information items such as polls that are specific to article bodies, and general information items that may also be used within comments. Both sub-divisions are represented as abstract classes, to keep them extensible. The only general information item that is currently provided is Paragraph. Future extensions might be lists or quotations. The only special information item that is currently provided (but not further detailed) is poll. Future extensions might be tables or graphics.

Paragraphs typically consist of text, interspersed with information snippets, some of which are primitive and some of which may be recursively decomposed.

Software engineering provides a data modeling pattern [GammaEtAlDesignPatterns] that fits exactly this situation, namely the pattern Composite. The scenario is a number of primitive items and of composite items that are recursively decomposed into further items, either of primitive or composite type. The Composite pattern supports the requirement that primitive and composite items must be treated uniformly by parts of the system. It does so by introducing a common abstract super-class for the primitive and composite items and by modeling a composite item as a composition of objects of that super-class. The prototypical application of Composite is a graphics program that manages a number of primitive geometric objects and groups of objects homogeneously by introducing the abstract super-class Shape, as illustrated in the appendix (the Composite pattern and an instance). The Composite pattern supports extension: both primitive and composite types can be added by sub-classing.

We employ the Composite pattern for information snippets in our model. The abstract super-class is Snippet; the composite class, Composite Snippet, is also made abstract, allowing for an open, extensible repository of concrete sub-classes such as Em for emphasized phrases. Note that we model text as a primitive information snippet with a characters attribute within the Composite pattern.

Consequently, we can model Paragraph to consist of a number of homogeneous Snippet constituents, conveniently hiding the distinction between primitive and composite snippets a this level and providing a single point of extension for further information snippet constituents.

We express our model as UML class diagrams, employing attributed (concrete or abstract) classes for document constituents and relating them to each other via part-of and inheritance relations. The classifications of document constituents are made explicit in the UML class diagrams via stereotypes.

The schema

From the XBlog project point of view, this section's task is to translate the conceptual model of an article into a concrete schema language for XML. We choose among schema technologies that are grammar-based and, thus, constructive; that is, prescribe how conformant XML documents are built—leaving aside declarative technologies such as Schematron [SchematronISO] that express document constraints in a rule-based manner. The rationale for this design decision is that constructive schema technologies better let us guide bloggers in writing articles.

This leaves a choice of XML DTDs, XML Schema and Relax NG [vdVlistSchemaTechnologies]. Looking at our class diagrams for documents (in the appendix), we notice that inheritance (is-a relationship) plays a prominent role. Therefore, we settle on XML Schema [XMLSchemaW3CRec] as the only of the three technologies that supports inheritance natively.

The basic constituents of the document class diagrams are classes that are related either by inheritance or by composition. Some of the classes also carry attributes.

We see two alternative ways of representing classes in XML Schema: either as element names or as (simple or complex) types. The conceptual model requires the full form of sub-typing that is not just a refinement in naming but also extends parent types with attributes. For element names, XML Schema supports only the simple type of inheritance by way of substitution groups. Consequently, we represent the classes of the conceptual model with types in XML Schema. Inheritance in the conceptual model is mapped to extension of types in XML Schema.

Composition in the conceptual model is canonically mapped into the parent-child relationship of elements in XML documents. This requires, on the level of type definitions in XML Schema, to name the sub-elements themselves, not just their types. We adopt the following simple naming scheme: Type names are built from class names by contracting multi-word names camel-case fashion and by appending the suffix "Type"; element names are built from class names by contraction as above and by lowercasing the first character. We make one exception to these rules, reflecting common usage: to name elements of type ParagraphType as p. Our simple scheme mainly follows the best-practice recommendation of Stephenson [StephensonBestPractices]. Technically, it works since class names are unique across our application and there is no need for content-dependent element names at this level. From the document author's point of view it works since the classes in the conceptual model represent domain objects, so that the derived element names appear to be natural.

Class diagrams impose no ordering on the sequence of their sub-components, whereas type definitions in XML Schema may do so. We make use of this facility for the Article type by imposing the traditional order in blog articles of Body followed by Comments followed by Trackbacks. All other composite classes are built from just one type of component, so ordering on the schema level does not apply. Note that this decision only constrains the coding of articles; articles may be ultimately presented with a different ordering of sub-constituents, if desired, e.g.after XSLT processing.

XML Schema natively supports abstract types. When abstract types appear within a composite type definition, it is required that elements of the abstract type are also named; in this case, we make use of the XML Schema feature of abstract element names; concrete element names of concrete sub-types are put into the substitution group with the abstract elements as head elements. Document instances exclusively use the concrete element names for the concrete sub-types without having to resort to explicit typing with the xsi:type attribute.

We have mentioned in the modeling section that abstract classes serve as points of extension. Extending on the level of the schema requires schema authors to define a new type name and a new element name for a new kind of document constituency, deriving the type from a pre-defined type and putting the element into the substitution group of the corresponding pre-defined element. We call this operation, that simultaneouly extends element names and types, double extension.

Double extension enables authors of schema extensions to provide apt element names for document authors to use directly, without obliging them to resort to explicit typing with the xsi:type attribute at document instance level.

Element names are always on the "surface" of documents—in contrast to the names of objects that are instantiated from classes (types) in a program, which can and should be hidden when following proven programming practices of information hiding. Hence, we consider the two features of XML Schema that together enable double extension, namely type derivation and element name substitution, to be equally essential for designing extensible, application-specific schemas.

Translating from class diagram to XML Schema, we systematically map classes to global type definitions that, in the case of concrete child elements, name child elements and reference further type definitions as representatives of conceptual classes. In addition, we globally declare element article of type ArticleType, to be used as root element in document instances.

We need to break this translation scheme, though, for abstract elements which, according to the constraints of XML Schema, must be declared globally and whose global declarations are then referenced from within any pertinent complex type definitions. Global declaration of abstract elements is further enforced by the constraint of XML Schema that only globally declared elements may act as substitution group heads.

The resulting schema follows the familiar "striped" Venetian Blind pattern, breaking out into the Garden of Eden pattern for abstract elements and their substitutes [SchemaDesignPatterns].

This arrangement allows authors of schema extensions to extend globally-declared elements such as p or poll. XML Schema offers the option to prevent this with the final attribute. This attribute would also be put to use if we had opted to make non-abstract classes in the conceptual model final and if we wanted to reflect this in the schema.

Next, we look into the metadata attributes in the document class diagrams. Generally speaking, we have the choice of representing class attributes either as XML elements or as XML attributes. We choose to represent metadata as elements, grouped withing a container element meta, so that we keep the option of eventually imposing element structure on their values, which is impossible when they are modeled as attributes.

For the three classes XXX in the document model that exhibit metadata, namely Article, Trackback and Comment, we locally declare the element meta to be of type metaXXXType. The advantage is that document authors have to deal only with a single element name, whereas schema authors may differentiate the type of metadata. The same strategy of universal element names and specific types is used when translating the metadata attributes of the document model. Consequently, this part of the schema makes full use of the Venetian Blind pattern [SchemaDesignPatterns] in its applicatioin of context-dependent element names, which were not necessary earlier.

Adhering to the framework character, the schema in the appendix provides empty "placeholder" types for metadata constituents. They can be overwritten in a more elaborate version of the schema.

Text passages, particularly when interspersed with elements (mixed content), are objects sui generis in XML: they are unnamed and untyped and do not participate in inheritance relations. Hence, the class Text in the XBlog document model requires special treatment.

In the document model, instances of the Text class appear intermixed with instances of Reference and Composite Snippet, as parts of Paragraph or Composite Snippet instances. Thus, we map the Text class into mixed content for Paragraph and Composite Snippet. Generally speaking, if an element consists of alternative document constituents of which one is text, we set the attribute mixed of the element to true. In that case, no further mapping of the text constituent is required.

These principles make it straightforward to translate the document class diagrams into XML Schema. The appendix provides a graphical view and the source code of this schema.

Summary

In this paper, we have used the following proven principles, patterns and procedures of document and software engineering when putting together a document model:

  • Procedure: Classify documents into the categories metadata, organizatorial items, information items and information snippets.
  • Procedure: Think about introducing container constituents for repeatable items.
  • Principle: Mark those constituents as abstract that are intended to be points of extension.
  • Pattern: Look for opportunities to apply data modeling patterns, particularly the pattern Composite.

The two procedures are valuable tools in the requirement analysis phase of the building of a document system, since they guide document engineers in recognizing and identifying pertinent document structures.

The principle gives the model the character of a framework with built-in points of extension, namely abstract classes. We could explicitly limit extensions to these abstract classes by marking all concrete classes as final.

The pattern demonstrates the value of following established patterns, in this case for the structuring of data, rather than putting much mental effort into re-inventing the wheel.

Furthermore, wehn translating from the conceptual document model to a schema, we have applied the following principles, patterns and procedures:

  • Principle: Choose a schema language that natively supports the concepts that the document model uses. Since it was designed with document modeling in mind, XML Schema will be a good fit in all but the most simple cases. the following items, hence, apply mainly to XML Schema.
  • Principle: Map classes in the domain model to types in the schema and the part-of relationships into hierarchies of locally declared elements. As a consequence, the inheritance relation can also be expressed on the schema level by mapping sub-classes to derived types.
  • Principle: Decide on a naming scheme that deals not only with type names but also with element names within type definitions. Be careful to choose element names that are meaningful to document authors. In consequence, the schema follows the Venetian Blind pattern, offering types but not elements as extension points.
  • Pattern: Map abstract classes to abstract types and declare elements of abstract type also as abstract. Use double extension for extending the schema with new elements simultaneously with new types, thus avoiding the use of the awkward xsi:type attribute at document instance level. Technically, this pattern causes the Venetian Blind architecture of the schema to break out into a Garden of Eden design, but in a controlled manner, that adds only abstract elements as further extension points.
  • Principle: Map attributes whose type has not been decided upon yet, into elements, not attributes, in order to stay flexible with further modeling. Declare those elements in the Venetian Blind fashion, with globally defined "dummy" types. Context-dependent element names keep the vocabulary that document authors must know small and straightforward. Globally-defined "dummy" types can be overwritten later on when the schema is fleshed out. This principle supports the framework character of the schema.
  • Take care that vocabulary and structures that the schema defines are meaningful to document authors.

Conclusion and future work

Starting from a domain model, we have identified a number of principles, patterns and procedures for modeling documents with UML class diagrams and for translating these models into XML Schema. It remains as future work to translate relations between classes from model to schema.

We have translated classes into types, mapping the part-of relationships into element hierarchies. One ideosynchrasy of XML is that a type definition both names and types its constituents. We have relied on a combination of XML Schema's type inheritance and element substitution features which we have called double extension to make this work in the context of abstract constituents and inheritance.

XML Schema has proven itself as a powerful target language that has made the translation from model to schema quite straightforward. We have used only in passing, for the typing of the metadata elements, XML Schema's facility of context-dependent typing of element names.

One of us, Thomas Schöpf, has modeled Wikis as part of his PhD work. We are planning on elaborating the document part of this model as well and on testing our schema generating methods in this more complex application domain.

Another valuable exercise would be to write an XSLT program that generates a schema from an XML encoding of the UML model.

Appendix

The appendix consists of the following documents:

  • The XBlog domain model
  • The outer XBlog document model
  • The inner XBlog document model
  • The composite pattern and an instance
  • A graphical view of the schema for XBlog articles
  • The source code of the schema for XBlog articles

These documents are available from http://www11.in.tum.de/~brueggem/papers/SchemaPrinciplesEML/.


Bibliography

[AsWeMayThink] V. Bush: As We May Think. The Atlantic Monthly. July 1945. http://www.theatlantic.com/doc/194507/bush.

[BargerWeblogHistory] J. Barger. Weblog Resources FAQ. http://www.robotwisdom.com/weblogs/.

[ConradEtAlXMLConceptualModeling] R. Conrad, D. Scheffner, J.-C. Freytag. XML Conceptual Modeling Using UML. In A.H.F. Laender, S.W. Liddle, V.C. Storey (eds), International Conference on Conceptual Modeling (ER 2000). LNCS 1920, pp. 558—571. Springer-Verlag 2000.

[EcksteinEtAlXMLDatenmodellierung] R. Eckstein, S. Eckstein. XML und Datenmodellierung. DPunkt-Verlag 2004.

[GammaEtAlDesignPatterns] E. Gamma, R. Helm, R. Johnson, J. Vlissides. Design Patterns. Addison-Wesley 1995.

[MalerEtAlDTDModeling] E. Maler, J. El Andaloussi. Developing SGML DTDs: From Text to Model to Markup. Prentice Hall 1995.

[PicotWeblogs] A. Picot, T. Fischer. Weblogs professionell. DPunkt-Verlag 2005.

[RenDAWeblogs] Z. Ren. Design und Implementierung eines Weblog-Hosting-Systems auf Basis von Apache Lenya/Cocoon. Diplomarbeit, Technische Universität München 2006.

[SchemaDesignPatterns] A. Khan, M. Sum. Introducing Design Patterns in XML Schemas. Sun Developer Network 2006.

[SchematronISO] International Standards Organisation: Information Technology—Document Schema Definition Lanuages (DSDL)—Part 3: Rule-Based Validation—Schematron. ISO/IEC 19757-4:2006.

[StephensonBestPractices] D. Stephenson. XML Schema Best Practices. HP Dev Resource 2004. http://devresource.hp.com/drc/resources/xmlSchemaBestPractices.jsp

[vdVlistSchemaTechnologies] E. van der Vlist. Comparing XML Schema Languages. XML.com 2001. http://www.xml.com/lpt/a/884.

[vdVlistXMLSchema] E. van der Vlist. XML Schema. O'Reilly 2002.

[WeinerWeblogHistory] D. Weiner. The History of Weblogs. http://www.userland.com/theHistoryOfWeblogs.

[WestnerWeblogServiceProviding] M. Westner. Weblog Service Providing. Master Thesis, Unitec Institute of Technology 2004.

[WikipediaWeblogs] Article Blog from Wikipedia. http://en.wikipedia.org/wiki/Blog.

[XMLPatterns] T. Lainevool. Develop effective XML documents using structural design patterns. http://www.xmlpatterns.com/.

[XMLSchemaW3CRec] World Wide Web Consortium: XML Schema Part 1: Structures Second Edition. W3C Recommendation 28 October 2004. http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/



Principles, Patterns and Procedures of XML Schema Design — Reporting from the XBlog Project

Anne Brüggemann-Klein [Technische Universität München]
Thomas Schöpf [Technische Universität München]
Karlheinz Toni [Technische Universität München]