Is it Possible to be Simple Without being Stupid?: Exploring the Semantics of Model-driven XML

Ann Wrightson

Abstract

Model-driven XML such as HL7 v3 [Health Level Seven Version 3] messages has a tendency to be verbose and to take much effort both to understand and to process (this is the principal criticism made in the Gartner Note on HL7v3 messaging published in mid-2006[GartnerHL7v3]). There has been a sustained effort over the last two years to tackle the verbosity, complexity and opacity of HL7v3 messages. At the time of writing, the problem of opacity has been tackled fairly successfully, however verbosity and complexity have proved much more resistant. This work has felt rather like handling a partly-inflated long balloon: squeeze here and it squashes OK here, but oops - it pops out elsewhere.

At this point it's tempting to make a plea for good old-fashioned hand-designed XML: let's go back to 1999! Not so. With ever more complex models of meaning needing to be shared to provide open standards in the context of progressive and ubiquitous adoption of automated information processing, that way is not open; there are not enough hands, and the work would be painfully repetitive and error-prone. The challenge is rather to find ways of using XML more compactly and efficiently as a companion technology to information model standards such as the HL7v3 Reference Information Model (RIM).There are two sides to this: gaining a more thorough understanding of the way(s) in which XML documents convey information, and getting better at using the full capability of XML for concise, meaningful and semantically rigorous expression within domain-specific data-oriented applications.

This paper is in three parts. The first outlines the practical problem from the perspective of the HL7v3 family of standards, and summarizes recent related work in HL7. The second uses game semantics as a viewpoint to explore some aspects of complexity and assumed knowledge in understanding an XML instance, using examples based on HL7v3. The concluding discussion includes suggestions for improving the expressive power of XML in this kind of communication.

Keywords: XSD/W3C Schema; Modeling

Ann Wrightson

Initially trained in Philosophy (specializing in logic). Following a varied and successful early career in electronic publishing, Ann spent ten years lecturing, researching, and consulting in an academic context, developing interests in formal methods, requirements modelling and system safety alongside continuing involvement in information systems theory and practice. Moving on from academia in 2000, for the last few years Ann has worked as an advisor, technical strategist and enterprise information architect, mainly in eGovernment and Healthcare. Her main area of expertise is interoperability over space and time, especially in the context of establishing and managing large scale long-lived content repositories for purposes including new media publishing, digital archiving and electronic health records. Ann is a member of the Board of the HL7 [Health Level Seven] UK affiliate, and has been working closely with the interoperability standards for the English health care records "Spine" since mid-2006.

Is it Possible to be Simple Without being Stupid?

Exploring the Semantics of Model-driven XML

Ann Wrightson [CSW Group Ltd]

Extreme Markup Languages 2007® (Montréal, Québec)

Copyright © 2007 Ann Wrightson. Reproduced with permission.

Introduction

The usual XML toolkit is strong on controlling structure, specifying simple content patterns and relatively small repertoires of literal content, and specifying transformations and cross-linkages between documents. Precision and disambiguation of the meaning conveyed by an XML document is often achieved informally in context through the natural language meaning of XML GIs, attribute names and values, and element content. Larger scale cross-community integration typically uses terminologies and ontologies to provide exact references for concepts and terms, for example the use of ontologies has formed a principal thread within the public sector semantic interoperability community of practice sponsored by the Federal CIO Council in the US [SICoP].

Representing and referencing terminologies and ontologies (including simple relationships such as broader/narrower terms and classification) has been well addressed from the XML perpective, notably by Topic Maps, RDF and OWL. However, complex subject matter such as engineering and healthcare needs more in-depth modelling of information structure and organization, and there is a strong consensus amongst those attempting this difficult task that it is best achieved using techniques developed specifically for information modelling; OWL has found a niche in representing complex terminologies.

The examples in this paper are drawn from healthcare. Engineering information as in the ISO 10303 STEP family of standards would be an equally good example, and issues related to those discussed in this paper arise in the context of representing STEP models in SGML and XML. 1

In order to achieve accuracy and faithfulness to an information model, it is normal - and plausibly necessary - for the structural and content models of XML documents that convey information based on such models to be computationally derived from the models in some way. Furthermore, from the information modelling perspective, XML may also be just one of several "platforms" for implementing the model rather than being itself the principal means of achieving platform-independence. In practical terms, this means that a key concern is the ability to take a complex chunk of information from some other representation such as EXPRESS or UML into XML, and out of XML into another such representation, without harm to its intended meaning 2

Model-driven XML such as HL7 v3 [Health Level Seven Version 3] messages has a tendency to be verbose and to take much effort both to understand and to process (this is the principal criticism made in the Gartner Note on HL7v3 messaging published in mid-2006[GartnerHL7v3]) compared to XML that is hand-designed for a similar purpose. This is widely regarded as an inevitable cost of the semantic rigour provided by using a model, however it is difficult to see why in principle XML instance documents with semantic rigour should be any more verbose, complex, repetitive, or difficult to read or process than well hand-designed XML carrying similar information. The tendency to exhibit one or all of these characteristics is plausibly a fixable problem.

Since 2005 there has been an active stream of work on this problem in HL7, with motivation that goes considerably beyond aesthetic judgement or theoretical exploration - this is a real practical (and costly) nuisance, and needs to be overcome in order for XML to continue to be a technology of choice to support information systems integration. Gaining a good understanding of this issue will also enable better judgement in the choice of particular XML data patterns as media for information systems integration, and may also encourage the development of additional markup technologies that are better suited to interoperability involving complex domain-specific information.

What's the problem?

The following example is a brief extract from a sample document (intended to be illustrative yet also plausible) in the published standard for CDA [HL7 Clinical Document Architecture v2] [CDA2]. A translation into English would be: This document concerns a patient with (unique nationally registered) ID 12345, named Henry Levin the 7th, Male, born 24th September 1932.

<recordTarget>
 <patientRole>
  <id extension="12345" root="2.16.840.1.113883.3.933"/>
   <patientPatient>
    <name>
     <given>Henry</given>
     <family>Levin</family>
     <suffix>the 7th</suffix>
    </name>
   <administrativeGenderCode code="M" codeSystem="2.16.840.1.113883.5.1"/>
   <birthTime value="19320924"/>
  </patientPatient>
 </patientRole>
</recordTarget>
						

This isn't too bad. However, consider the following example, an extract from a sample Discharge document constructed according to the standard published by NHS Connecting for Health [NHSMIM6-1]3 . A reasonable English translation would be: The principal recipient of this document is Dr John Jones who is on the (England) national register with practitioner ID 12345 and role profile ID 67890.

<informationRecipient typeCode="PRCP">
 <templateId root="2.16.840.1.113883.2.1.3.2.4.18.2"  extension="COCD_TP145004UK01.AssignedEntitySDS"/>
 <intendedRecipient classCode="ASSIGNED">
  <typeId root="2.16.840.1.113883.1.3" extension="COCD_TP145004UK01#AssignedEntitySDS"/>
  <id root="1.2.826.0.1285.0.2.0.65" extension="12345"/>
  <id root="1.2.826.0.1285.0.2.0.67" extension="67890"/>
  <informationRecipient classCode="PSN" determinerCode="INSTANCE">
   <typeId root="2.16.840.1.113883.1.3" extension="COCD_TP145004UK01#assignedPerson"/>
   <name>
    <prefix>Dr</prefix>
    <given>John</given>
    <family>Jones</family>
   </name>
  </informationRecipient>
 </intendedRecipient>
</informationRecipient>

You might think that the following was a more reasonable, tractable XML representation, in a context where all healthcare practitioners are on the same central register (known as SDS). It's not the only possibility, of course, nor necessarily the best for any one situation, but it will serve my purpose for now.

<SDSRecipient role="principal" practitioner-ID="12345" role-profile-ID="67890">
 <name>
  <prefix>Dr</prefix>
  <given>John</given>
  <family>Jones</family>
 </name>
</SDSRecipient>

What is HL7v3 trying to achieve?

As background for taking a deeper look at these examples, this section summarizes what HL7v3 is trying to achieve and the work in progress in HL7 to improve the usability of the XML layer.

Open Standards for Healthcare Interoperability

Exchanging clinical information is hard, partly because modelling healthcare information sufficiently accurately and consistently for full platform-independent interoperability is very hard indeed (there is an insightful discussion of several informational aspects of the overall problem in Alan Rector's paper Clinical Terminology: Why is it so hard? [Rector99]). . Special purpose solutions to small to medium scale problems are relatively straightforward; it is the reusable, long-lived, community-scale solutions that are (very) hard. Why is this? - because they require solutions to a number of problems each of which is pretty hard in itself:

  • Co-ordination of healthcare activities across different organizations (hard)
  • Interoperability of data structures across diverse information systems (medium hard)
  • Full coding of clinical data using clinical terminologies, sufficiently well for automated interpretation (hard)
  • Sustainability and preservation of electronic records (medium hard)

TheHL7 v3 [Health Level Seven Version 3] family of standards is one of the most elaborate domain-specific standardization schemes around today. The HL7v3 family of standards [HL7v3], is a major collaborative achievement that is beginning to come good on its promise of providing practical industry standards for cross-platform interoperability of information systems that use some of the most complex and open-ended information on the planet.

The key features of HL7 the organization, and HL7v3 the standard, are:

  • A family of standards designed around a single logical information model, called the RIM [Reference Information Model]
  • Dependence on standardized terminologies (backed by considerable theoretical work in biomedical ontologies) to give precision to the information expressed
  • Close attention to method and process as well as content by the HL7 organization. Innovation is of course allowed, but each innovation is expected to become part of overall scheme of things in due course, and alignment with the RIM is very closely watched.
  • XML is near-ubiquitous in use, however in principle is just one medium for defining an ITS [Implementation Technology Specification]. There is no design activity on specific XML representations (with a few exceptions, mainly for human-readable components), only on ways of generating classes of XML representations in a uniform way from classes of models.
  • Emerging consensus that the resulting XML is too large and too complex, and unnecessarily difficult to process.

There have been a number of efforts over the past two years to address this problem of the opacity, verbosity and complexity of the XML generated by the HL7v3 XML ITS [Implementation Technology Specification]. The verbosity and complexity of the XML have proved very difficult to address, however opacity of the XML to a human reader (discussed in [Wrightson05]) has proved more tractable. Within the natural limits imposed by the inherent complexity of what is being represented, this is one of the key achievements of the new HL7v3 ITS [NewITS]

The HL7v3 New ITS

This section is based on the New ITS Guide [NewITS].

The new ITS specification describes a method for creating HL7 implementation models using UML diagrams, XML schemas, and (when desired) mappings between an alternative view of a message model and its parent RIM-Based Model (e.g. RMIMs [Refined Message Information Models] such as the message models published in the NHS CFH MIM [NHSMIM6-1]).

The new ITS supports two very different ways of providing better XML representations of RIM-based information structures:

  • More consistent serialization of RIM-based models into XML documents. The main practical benefits are more consistent naming of XML elements, and greater consistency of XML structure and content both across different versions of the same message definition, and between closely related information structures. In addition, the serialization method has been made easier to see and understand, with explicit rather than implicit rules.
  • The ability to transform a RIM-based model using a step by step, reversible set of rules. This can be used to create a view of a message model using business information class names (from domain analysis models) instead of RIM-based names. Such a simplifeid model can be used for example for business expert review / validation or to simplify a model to meet a specific local implementation requirement, whilst maintaining a well defined relationship to the base standard. 4

An important aspect of the new ITS is that it uses standard UML, contributing to the slow movement away from the legacy of specific HL7v3 modelling notation and tools that arose because HL7 modelling requirements were for several years in advance of the development of UML and UML tools. UML tooling including HL7-specific capabilities for the new ITS is being developed within the Eclipse Open Healthcare Framework Project [OHF]

A matter that is still causing some controversy is whether only one serialization model for any given message type may be normative. HL7v3 tradition tends towards having only one normative serialization, however there is also growing pressure towards making the semantic models independent in principle of serialization decisions, thereby allowing different implementation communities to use the same model with different serializations without (one or both) being automatically "non-conformant" - something that may be important for the public image of a high-profile standards-based interoperability initiative.

Design Principles

The following Design Principles were developed for the new ITS; in the context of this paper, they are also valuable as providing examples of general requirements for a model-driven XML standard. (The wording of some of these principles has been adapted from the wording in the original so as to be clearer to a more general XML readership).

  • The transform from the abstract model to the instance should be as simple as possible.
  • The instances should be pleasing to the eye of a human reader. (Of course, this does tend to mean different things to different readers, and this has emerged clearly in discussion, however the principle has survived...)
  • Names of XML elements and attributes should be determined in the underlying RIM-based model, not generated by the ITS
  • Without compromising other principles listed here the instances should be as small as possible.
  • The ITS should define the set of valid XML instances for an HL7v3 model artefact, and also support the generation of a W3C schema such that every valid instance is schema-valid against the defined W3C schema for that instance, and the schema serves as a useful filter for invalid instances. (Note that this is not intended to exclude other kinds of XML schemas from being used in implementations.)
  • The normative schema should use well supported parts of the W3C schema language.
  • Use attributes to support the population of the post schema validation infoset.
  • The ITS should support the reuse of components in implementations (eg using UML based technologies).
  • The depth of nesting in the instances should be a shallow as possible.
  • An additional layer of UML modelling (alongside the RIM-based model) should be used to express the structure of the set of XML instances, and this model should be fully compatible with UML-based case tools.
  • Boundaries between significant components in the HL7v3 model (such as common message components (called CMETs) and Wrapper boundaries) should not be transparent in the XML instance.
  • Choices in the underlying model should be transparent in the XML instance - thus a class should appear in the instance in the same form whether or not it is part of a choice. This is to allow for forwards compatability when choices are introduced into models and to reduce the level of nesting.
  • The ITS should support modeling constructs provided by the HDF [HL7 Development Framework] including potential introduction of new wrappers, use of models for multiple purposes (as CMETs, payloads, templates), and other foreseen developments in HL7 modelling practice.

Models as Semantics for XML Instances - taking a closer look

In my 2005 Extreme paper Semantics of Well Formed XML as a Human and Machine Readable Language I presented a way of understanding element GI [generic identifier]values, attribute names etc as identifying appropriate resource situations (a precise notion of context from situation semantics) that enable a human to understand accurately a fragment of XML.

In the established practice of the HL7 standards community, the context against which an XML instance is intended to be understood is provided principally by UML5 models combined with term sets derived from clinical terminologies and other defined value sets. Furthermore, when I read an XML instance that bears the intended relationship to an HL7v3 UML model, it feels intuitively correct that I am understanding it through understanding how to read the element GIs, attribute names and values etc as implicit references into the UML model and onward into the referenced terminologies. This informal experience motivates the more formal exploration in the remainder of this section.6

A Simple Game Semantics for Understanding an XML Instance

Game semantics is based on the notion of a 2-player game concerning a sentence in some language, and some state of affairs, the "world" in which the sentence is understood. The players, called here A and B 7, work together using a defined set of moves. In classic game semantics on first-order logic, player A has the ultimate goal of showing that the sentence is true, player B that it is false. A starts with a sentence S in a language L that has a model M, where M being a model of L means that all the nonlogical constants of L are interpreted on M. Structural properties of the initial sentence and of the logical connectives are used to break down the initial goals of A and B into subgoals of determining the truth-value of subformulae. At any point, either player may have the subgoal of showing that a particular subformula 8 is true in M ("verifier" role) or false in M ("falsifier" role). Through the moves of the game, truth-functional properties of the logical connectives and interpretation of nonlogical constants as names of individuals in the domain of M are used systematically to determine a truth-value for subformulae, and thence step by step for the full sentence S. Player A has won the game if the truth-value for the whole of S in M is "true". A sentence S is logically true in M if (and only if) there is a winning strategy for player A, that is, a set of rules that player A can follow for sentence S and always win. (There is a fuller description on pp363-365 of Hintikka & Sandu [Game1].)

Another variant of game semantics is "back-and-forth" games, where the challenge is to construct an equivalent model or demonstrate equivalence of two existing models. Every move in such a game consists of player B choosing some item from one structure, and player A choosing a corresponding element from the other. Player A wins if, after a certain number of moves, the patterns of objects chosen are equivalent. (Hintikka & Sandu [Game1] p401).

The rest of this section develops a simple game semantics for successful interpretation of an XML document by a recipient. In this case success or failure is in successfully interpreting the XML document against some model, rather than determining truth or falsity. It is similar to a back-and-forth game, using an XML document X, and a model M. Player B presents items from the XML document (or if you will an XML Infoset or GODDAG structure [GODDAG06]) and player A interprets the XML document using the model. The model M is envisaged as including named items, structural constraints, datatype constraints, and one or more "oracles" capable of determining the validity of a value, for example conformance to a datatype or membership of a value set.9 Player A may be envisaged as building a UML object (instance) model step by step from the XML document, however this is specific to the model M being a UML class model, and is not a necessary aspect of the game semantics itself. If the model M were expressed in Prolog rather than UML, then player A might be evaluating one or more Prolog statements in response to each "move" from B. Player A might even be using an XML schema (the game works trivially for schema-validation by A repeatedly asking for more until the document is fully filled in, then schema-validating the document) however the game semantics notion is considerably wider in scope than schema-validation.

Both players have knowledge of the XML document (so player B cannot win eg just by throwing elements at A in an unexpected order); player A understands the model and is able to invoke the "oracles". Player A (standing for a system receiving the XML document) wins if the XML document is OK according to the model (and bearing in mind that the relevant model may not be exhaustive, A may declare a win before B has covered the whole document); player A has lost if the document is not known to be OK at close of play. If player B has not yet given a full description of the XML document, and has not yet given A enough items for A to be able to make a decision, then A can force B to continue providing more items from the document.

The following example shows a game on the short document in the third example above, with a (postulated) model that only has one namespace, requires the given structure, and handles value sets using oracles, with one oracle requiring two pieces of data. (Please remember that this is an abstract semantic game, NOT a practical strategy for either implementing an interface or evaluating XML against a model.)


Player B: I have: element "SDSRecipient"

Player A: Thinks: element OK... Continue

Player B: I have: attribute "role" on element "SDSRecipient", value "principal"

Player A: Thinks: attribute OK on "SDSRecipient", need to check value "principal"...
...ask the role oracle... OK... Continue

Player B: I have: attribute "role-profile-ID" on element "SDSRecipient", value "67890"

Player A: Thinks: attribute OK on "SDSRecipient", need to check value "67890"...
...ask the role-profile-ID oracle... oops, need the practitioner-ID attribute value to do that...
OK... Continue

Player B: I have: attribute "practitioner-ID" on element "SDSRecipient", value "12345"

Player A: Thinks: attribute OK on "SDSRecipient", need to check value "12345"...
...ask the practitioner-ID oracle... OK, now need to give the practitioner-ID attribute
value and the role-profile-ID value to the role-profile-ID oracle...OK... Continue

Player B: I have: element "name"

Player A: Thinks: element OK... Continue

Player B: I have: element "prefix", content "Dr"

Player A: Thinks: element OK... check content, ask the "prefix" oracle... OK... Continue

Player B: I have: element  "given", content "John"

Player A: Thinks: element OK ... check content, ask the "given" oracle... OK... Continue

Player B: I have: element "family", content "Jones"

Player A: Thinks: element OK... check content, ask the "family" oracle...
...Document now OK...Done! I win!
 

In this first game example, the role of the model has been downplayed so as to give front-of-stage to the game semantics principle itself. The next game is based on the second XML example in section 2 above, and looks more closely at the role of the model.

Using a Specific HL7v3 Model in the Game

In the following game Player B presents the second example in Section 2 above, and Player A has to hand the corresponding HL7v3 RIM-based model for this specific set of XML instances (such as the definition in the MIM [NHSMIM6-1] of the clinical document of which this is a fragment), together with some means of checking terms in the terminologies used. This corresponds to a familiar situation where a design-time human reviewer goes through an XML instance example item by item, checking everything against the model in order to understand it fully, for example in preparation for constructing software to process this kind of incoming XML and make the information it contains available in a local information system. A view of the model being used by Player A (which is published as a hyperlinked HTML package) is included in the presentation slides. This game is envisaged as a subgame of a larger game on the whole example message; it is also quite coarse-grained since otherwise it would be rather long.

Player B: I have: element informationRecipient with attribute typeCode, value "PRCP"

Player A: Thinks: There's informationRecipient in the model... typeCode is the
participation type, fixed at "PRCP" in this model, which (consult the typeCode oracle)
means "principal"... the other attribute is optional ...
...informationRecipient in the model links across to an entity with a choice of templates
for its content... I should get the templateId next so I know which one it is... OK. Continue

Player B: I have: element  templateId with attribute root,
value "2.16.840.1.113883.2.1.3.2.4.18.2", and attribute extension,
value "COCD_TP145004UK01.AssignedEntitySDS"

Player A: Thinks:  good, here comes the templateId...the root is just the OID for the
template identifier vocabulary, what I want is the extension (consult the templateId oracle) ...
...so this is a RecipientEntitySDS rather than a  RecipientOrganization. OK ... Continue

Player B: I have:  element intendedRecipient with attribute classCode, value "ASSIGNED"

Player A: Thinks:  OK, both templates have this element as the next layer of the XML structure;
I should get the typeID next and it should correspond to the templateId ... Continue

Player B: I have:   element typeId with attribute root, value "2.16.840.1.113883.1.3" and
attribute extension, value "COCD_TP145004UK01#AssignedEntitySDS"

Player A: Thinks:  Odd that it's a different OID...assume this is an error,
the extension looks OK... Continue

Player B: I have:   element id with attribute root, value "1.2.826.0.1285.0.2.0.65", and
attribute extension, value "12345"

Player A: Thinks:  which id is this? see what the model says...OK, the first one is the
SDS User ID, the second is the SDS User Role Profile ID ... Continue

Player B: I have:  element id with attribute root, value "1.2.826.0.1285.0.2.0.67",
and attribute extension, value "67890"

Player A: Thinks:  right, I have an SDS User ID of "12345" and an
SDS User Role Profile ID of "67890"... Continue

Player B: I have:  element informationRecipient with attribute classCode, value "PSN"
and attribute determinerCode, value "INSTANCE"

Player A: Thinks:  OK, this corresponds to the Person entity in the model, and the
attributes are fixed values... Continue

Player B: I have:    element typeId with attribute  root, value "2.16.840.1.113883.1.3" and
attribute extension, value "COCD_TP145004UK01#assignedPerson"

Player A: Thinks:  OK, fixed value needed for some HL7 modelling reason ... Continue

Player B: I have:  element name containing: element prefix with content "Dr";
element given with content "John"; element family with content "Jones".

Player A: Thinks:  OK, name has datatype PN... content OK for PN datatype...
Document OK... I win!

Using a Generic HL7v3 Model

One of the design decisions driving HL7v3 modelling and (the full) XML serialization is that HL7v3 instances should be understandable generically, that is, apart from the specific message model. The next game reprises the previous one, but with Player A initially only having the underlying international CDA standard, and relying on information given in the XML instance to identify other sources required to interpret the XML instance. The epistemological device of an "oracle" is used again here to shortcut the research needed in practice to look up OIDs and allowed values. The templating mechanism is ignored here, since it is still under review at HL7.org level, and does not appear in the generic CDA model.

Player B: I have: element informationRecipient with attribute typeCode,
value "PRCP"

Player A: Thinks: There's informationRecipient in the model... typeCode is
the participation type, "PRCP" (consult the typeCode oracle) means "principal"...
...informationRecipient in the model links across to IntendedRecipient ... OK. Continue

Player B: I have: element  templateId with attribute root,
value "2.16.840.1.113883.2.1.3.2.4.18.2", and attribute  extension,
value "COCD_TP145004UK01.AssignedEntitySDS"

Player A: Thinks:  I don't have any templating in my model... ignore and carry on ... Continue

Player B: I have:  element intendedRecipient with attribute classCode, value "ASSIGNED"

Player A: Thinks:  OK, this is the role acting as recipient, and the classCode names the
actual role, so this is an assigned recipient...Continue

Player B: I have:   element typeId with attribute root, value "2.16.840.1.113883.1.3" and
attribute extension, value "COCD_TP145004UK01#AssignedEntitySDS"

Player A: Thinks:  More templating stuff that I don't understand... Continue

Player B: I have:   element id with attribute root, value "1.2.826.0.1285.0.2.0.65",
and attribute extension, value "12345"

Player A: Thinks:  OK, I'm expecting any number of IDs here... (consult the OID oracle) this is an
SDS User ID... (consult the SDS User ID oracle) value OK ... Continue

Player B: I have:  element id with attribute root, value "1.2.826.0.1285.0.2.0.67", and
attribute extension, value "67890"

Player A: Thinks:  OK, I'm expecting any number of IDs here... (consult the OID oracle)
this is an SDS User Role Profile ID... (consult the SDS User Role Profile ID oracle)
value OK ... Continue

Player B: I have:  element informationRecipient with attribute classCode, value "PSN" and
attribute determinerCode, value "INSTANCE"

Player A: Thinks:  remaining attributes of the intendedRecipient entity are optional...
...OK, element appears to represent both the link in the model from the role
intendedRecipient to the role playing entity (Person) in the model, and the Person entity itself...
...attributes are the fixed values given in the model. OK... Continue

Player B: I have:    element typeId with attribute  root, value "2.16.840.1.113883.1.3" and
attribute extension, value "COCD_TP145004UK01#assignedPerson"

Player A: Thinks:  More templating stuff that I don't understand... Continue

Player B: I have:  element name containing: element prefix with content "Dr"; element given with
content "John"; element family with content "Jones".

Player A: Thinks:  OK, I'm expecting any number of names here... name has datatype PN...
content OK for PN datatype... no more names, that's the end of the document... Document OK... I win!

So, is it possible to be simple without being stupid?

A closer look at simple XML

The full generality of the HL7v3 XML instance only really comes into its own to support the universal processor represented by Player A in the final game above. Comparing the first and second games, the additional pain felt by users of the generic model-driven XML compared to "ordinary" XML is clear, and this, together with the sheer size of the data, is what users of HL7v3 have been complaining about. However, the first game deserves a closer look to see where and how the extra information exposed in the later games is implicit in the (deceptively) simple XML. The following game reprises the first game above, this time making the semantics "known" by Player A more explicit. Once more, this game should be considered as a subgame of a larger game on a full clinical document.

Player B: I have: element "SDSRecipient"

Player A: Thinks: OK, this is the element I'm expecting. What does it mean?
(consult local element name oracle)... OK, SDSRecipient provides information
about the recipient of a clinical document, assumed to be a healthcare
practitioner who is on the electronic national registry for England ... Continue

Player B: I have: attribute "role" on element "SDSRecipient", value "principal"

Player A: Thinks: attribute OK on element "SDSRecipient"; is this the role of
the recipient or the role of the person?... value "principal", so looks like it's the
principal recipient of the document... need to check value "principal"...
(consult "role" oracle)... OK... Continue

Player B: I have: attribute "role-profile-ID" on element "SDSRecipient", value "67890"

Player A: Thinks: attribute OK on "SDSRecipient", this looks like the role of the person...
need to check value "67890"... (consult the "role-profile-ID" oracle)...
...oops, need the practitioner-ID attribute value too...  OK... Continue

Player B: I have: attribute "practitioner-ID" on element "SDSRecipient", value "12345"

Player A: Thinks: attribute OK on "SDSRecipient"; this looks like the practitioner identifier
on the registry... need to check value "12345"... ask the practitioner-ID oracle...
...OK, now need to give the practitioner-ID attribute value and the role-profile-ID value
to the role-profile-ID oracle...OK... Continue

Player B: I have: element "name"

Player A: Thinks: element OK., this must be the name of the person... Continue

Player B: I have: element "prefix", content "Dr"

Player A: Thinks: element OK... check content, name prefixes have a
set list of values... (consult the "prefix" oracle)... OK... Continue

Player B: I have: element  "given", content "John"

Player A: Thinks: element OK ... check content, given names just need to
have a name datatype, so check datatype... OK... Continue

Player B: I have: element "family", content "Jones"

Player A: Thinks: element OK... check content, family names just need to
have a name datatype, so check datatype...  Document now OK...Done! I win!

What this game shows on one level is that - no surprise - domain knowledge in terms of both the subject matter and its encoding in XML is needed to interpret an XML instance accurately. However, it also suggests that what is needed in addition to the simple XML to ensure accurate (say, automated) interpretation is not so much additional detail in the instance, but ways to encode additional knowledge about a class of instances. This is exactly what one stream of the new-ITS work is making possible for HL7v3 models, gaining simplicity in the XML by allowing simpler models with more user-friendly names to be derived (in a reversible way) from a fully-expressed HL7v3 model.

The HL7v3 New ITS work has provided a framework for working with complex models in the specific context of the HL7 RIM, with simplifications using model reshaping and model transforms that depend on ways that RIM-based models are often used. However, the underlying problem is not restricted to healthcare information, and the final section draws on experience with interoperability standards for cross-community processes in the public sector, and cross-territory processes in multinational companies, as well as healthcare.

What are the Implications for Markup Technologies?

How can generic XML technologies make it easier to handle semantically complex information in XML in a simpler way? This is a question that I hope this conference will be willing to discuss seriously. The following suggestions are offered as a starting point for discussion:

  • Enable specification of attribute values and element content as belonging to intensionally referenced, externally defined value sets (a general term intended to include terminologies, thesauri etc). Doing this at schema level, using a declared naming convention, will lessen the need for cumbersome coding schemes in instances.
  • Enable externally defined value sets (and named subsets thereof), referenced intensionally not extensionally, to be used as repertoires for GIs and attribute names.This does introduce a risk of a term not being suitable to use as a name in the XML, however datatype constraints are not unknown in ontologies and terminologies, so this is not a fatal objection. For this and the preceding suggestion, it is essential that the means whereby an XML processor may access the named value sets is not specified.
  • Enable datatyping schemes that specialize the foundational XML data types to be defined externally (including by means other than XML technologies) and referenced intensionally.
  • Make it easy and straightforward to use XML and UML together to support semantic interoperability of complex information, in ways that use the strengths of both to best advantage. 10
  • Do all of this in a way that yields document-class-specifications (functional equivalents of current schemas) that are easy to use and acceptable to the general software development and enterprise integration community.
  • Find a way to represent XML structure and content that interoperates nicely with other software design notations, and does not need special tooling.

Acknowledgement

Many thanks to the Extreme peer reviewers who were invaluable in helping me realize just what it was I was trying to say.

Notes

1.

Some of the original SGML work can still be found, though is probably in some need of curation [STEP-SGML]; there is also more recent work by Lubell et al including the UML perspective [STEP-XML-UML].

2.

Note that this is a requirement for semantic interoperability via XML, rather than round-tripping which is generally a minimal special case.

3.

Please note that no criticism is intended of this example as an HL7v3 based interoperability standard - it has been chosen as a real example that exemplifies properties of the HL7v3 XML that are of interest in this paper, whilst also having convenient charactersitics for this exposition such as element names in the XML that correspond directly to class names in the model.

4.

Each simplifying rule depends on certain aspects of the full RIM-based models not being used to convey substantive information, for example, carrying only fixed default values, and therefore being able to be abstracted away (and algorithmically regenerated when round-tripping) without sustantive loss of information.

5.

UML is by now well established as a complementary platform-independent medium to XML. This partnership is not without its problems, however there is enough evidence of its practical utility that debating whether in principle UML should be used with XML is not in my opinion a very useful activity. The engineering challenge is to get the best out of each when used together.

6.

Informal rigour refers to a rigorous natural language exposition in the 2000+ year tradition of philosophy. These days, especially incomputer science, the term also suggests an implicit or explicit rationale for formalization, where formalization means full modelling and verification in symbolic logic, preferably by automated theorem-proving - and there may indeed be potential in that direction. Mathematics can be very free with implied assertions of full formalizability, and the reader is left to make their own judgement on the possiblity or desirability of formalizing this work. Implementation is (pace Carroll Morgan) in practice a different matter, and there some utility is claimed for the perspective elaborated here.

7.

In traditional game semantics parlance, player A is called "Myself", and player B, "Nature". The intended interpretation here is that A is a receiving system in a working interoperabiity channel (or a human looking at an example message), and B is the source of the message.

8.

Loosely used to include substitutions, reorderings etc as necessary.

9.

This could be quite complex, for example computation of a term's membership of a classification within a complex terminology, using a third party terminology service.

10.

This issue needs to be kept safe from being hijacked by debates about (whether to or) how best to use either or both technologies in specific situations.


Bibliography

[CDA2] Clinical Document Architecture release 2, Health Level Seven, http://www.hl7.org

[Game1] J Hintikka & G Sandu, Game-theoretical Semantics, Chapter 6 of ed J van Benthem & A ter Meulen, Handbook of Logic and Language, Elsevier & MIT Press 1997.

[GartnerHL7v3] Wes Rishel, HL7 V3 Messages Need a Critical Midcourse Correction, Gartner Industry Research Publication ID Number: G0014095, 5 June 2006

[GODDAG06] C Huitfeldt & M Sperberg-McQueen, Representation and processing of Goddag structures, Extreme Markup Languages 2006

[HL7v3] Health Level Seven Version 3 standards are available for purchase (and freely to members) via the organization's website http://www.hl7.org accessed 21 June 2007.

[NewITS] New ITS Guide, HL7 Implementation Technology and Conformance Technical Commitee. At the time of writing this is available via the HL7v3 Ballot pack, at http://www.hl7.org/v3ballot/html/welcome/environment/index.htm.

[NHSMIM6-1] Message Implementation Manual 6.1, issued by NHS Connecting for Health 30 Jan 2007, restricted circulation.

[OHF] Eclipse Open Healthcare Framework (OHF) Project http://www.eclipse.org/ohf/ accessed 21 June 2007.

[Rector99] Alan Rector, Clinical Terminology: Why is it so hard?, 1999 Methods of Information in Medicine 38(4):239-252 http://www.cs.man.ac.uk/mig/ accessed February 2004

[SICoP] Semantic Interoperability Community of Practice Wiki http://colab.cim3.net/cgi-bin/wiki.pl?SICoP accessed 15 June 2007

[STEP-SGML] Archive page of ISO/TC184/SC4/WG3/T14 at http://www.eccnet.com/step/ and the summary on Cover Pages at http://xml.coverpages.org/stepExpressXML.html, both accessed 15 June 2007

[STEP-XML-UML] J Lubell, R S Peak, V Srinivasan, S C Waterbury, STEP, XML, AND UML: Complementary Technologies; Proceedings of DETC 2004: ASME 2004 Design Engineering Technical Conferences and Computers and Information in Engineering Conference

[Wrightson05] A Wrightson, Semantics of Well Formed XML as a Human and Machine Readable Language, Extreme Markup Langauges 2005

[Wrightson06] A Wrightson, Conveying Meaning through Space and Time using XML; Semantics of Interoperability and Persistence, Extreme Markup Langauges 2006



Is it Possible to be Simple Without being Stupid?

Ann Wrightson [CSW Group Ltd]