Network-Oriented Document Abstraction Language: Structure and Reference for the Rest of the Web

Lee Iverson


The Web has amply demonstrated the benefits of an infrastructure that makes publishing and reference of semi-structured information easily accessible, but in many cases reference and reuse of such information is only at the level of complete files. The potential for greater benefits that may derive from sub-document structure and reference is currently being explored, but this exploration is limited by the fact that such references can only be used with a portion of the Web's content (i.e. that encoded in XML). We have developed a system, the Network-Oriented Document Abstraction Language (NODAL), that is designed to provide a common data model, schema language and sub-document reference system for web-accessible documents or databases encoded in any format (e.g. images, PDF or Word documents etc.). This system thus provides a common reference and access environment for all structured, semi-structured and unstructured data. In this paper, we describe the data model, schema language and URI-based reference language for NODAL and compare it with XML and other systems. Finally, we outline a number of ways that this system can be extended to allow for composition, synchronization and reuse of documents and databases and can thus form a hypertextual foundation for interactive application development without inhibiting interoperability with existing systems. In essence, with NODAL we can bring the benefits of markup and hypertext to all data formats.

Keywords: Publishing; Modeling

Lee Iverson

Lee Iverson is an Assistant Professor in the Department of Electrical and Computer Engineering's Software Engineering Program. His main research interests are in Knowledge Management, collaboration, digital libraries and museums and usable security and privacy systems.

Network-Oriented Document Abstraction Language: Structure and Reference for the Rest of the Web

Lee Iverson [University of British Columbia]

Extreme Markup Languages 2005® (Montréal, Québec)

Copyright © 2005 Lee Iverson. Reproduced with permission.

Note: This paper contains W3C MathML, which is not equally well supported in all browsers. If you have reason to think that mathematical expressions are not displaying properly, consult the PDF version (or try a different browser).


The Web has created an unprecedented ability to publish and distribute documents and other data and to augment this content with hypertextual references. However, the granularity of these references is, for the most part, limited to the file level. The combination of a need to reuse and/or quote sources smaller than a whole document (e.g. Nelson's transclusion [Nel99] ) and the advent of new technologies built on hypertextual reference (e.g. RDF [RDF98] and the Semantic Web [SemWeb98] has stimulated a desire to provide a means of referencing only parts of a document. In traditional HTML this was accomplished with named anchors, but was thus limited to only those points in a document anticipated by the author. XPointer [XPointer] , XPath [XPath] and XLink [XLink] provide sub-document linking for XML documents but as yet only a small fraction of the Web is available as XML.

Instead of restricting sub-document reference to markup languages, we have designed NODAL, the Network-Oriented Document Abstraction Language, as a simple middleware layer that provides a means of defining URI-referenceable structure for all document formats and other structured data sources. It achieves this by defining a simple data model and schema language that can then be used to define sub-document structure for any document type. An extensible, plugin architecture is used to associate format encoder/decoder pairs with MIME types that in turn identify particular document formats and schemas. This system thus defines a common language for referencing and accessing the structured contents of arbitrary document formats at arbitrary granularity. Moreover, it can be used in conjunction with other sub-document reference schemes without interference.

Finally, the system provides a base for future development and research towards new software architectures and development paradigms that allow for a much more seamless interconnection between data and documents across applications, formats, database structures and distribution protocols.

The Problem: Granular Reference, Content Reuse and Annotation

Consider an engineering student working on a class project. The course has a website that contains notes from classroom sessions as well as a set of links to online resources to learn more about certain subjects. The student exchanges email with other students about his project and asks questions on an online forum and mailing list that help him successfully complete his project. At the end, he has a design document, a project report and an artifact that is a combination of a software system and some hardware described in a CAD model. Through this entire project, he has used a few dozen sources of information (most online), at least half a dozen different software systems, and between his project notes and reports he has written a few thousand lines of text to describe and justify his design. How then is this information structured to facilitate the work itself and to communicate its products? If he was using broadly accepted practices then his working environment consists of a project directory on his desktop system, project and class folders in his email client, and a set of bookmarks for selected information sources in his browser. But how are they connected? Where is the justification for a particular design feature in the CAD model? Where is the connection between his project document and the email conversation and forum entries in which some critical issues were explained?

This kind of scenario is familiar to most hypertext researchers and some of these issues are well described by Ted Nelson in a recent paper [Nel99] summarizing his goals and motivations of the past 40 years and a prescription for “Xanalogical structure”. Unfortunately, like many hypertext systems of the past (e.g. Microcosm [Mcsm90] and Intermedia [Imed88] ), he suggests that the solution lies in an incompatible, new system and data model that will only act as its own kind of information silo. Instead we suggest that the only way that this can be made to work (and have any possibility of wide uptake) is if we strive to develop a system that interoperates seamlessly with existing environments and applications (e.g. as in Chimera [Chim00] ).

In a separate paper [DKC05] , we have analyzed this general class of issues and various kinds of "information silos" (e.g. applications, data formats, operating systems, database systems). We suggested that a change to the assumed application architecture would allow for the deep linking, content reuse and general annotation facilities necessary to build general personal (and group) information management environments. This application architecture (the DKC Model) contains a persistent storage layer at its base (the Data Layer) that forms the basis for sharing, linking, reuse and versioning of data and structure. A Knowledge layer that can be used to imbue this data with semantic meaning, and then a final, independent layer of Context that manages user interface, interaction and modelling. By advocating the independence of view and storage (in the Context and Data layers), we suggest that this model will not only allow personal and group information management to finally begin to deal with cross-platform, cross-application issues but that we will enable greater innovation and adaptation in the user- and task-oriented spaces often considered by HCI and Hypertext researchers.

The NODAL Solution: Document Data Modeling and Reference

The approach we advocate is to separate data structure from syntax (just as pure XML approaches separate representation and presentation) by designing a system around a data model that can represent the data structures internal to a wide variety of file formats. In essence, we represent each data format as an encoding of a structured container and develop a schema language and system architecture to provide a standard means of referencing, querying, reusing and navigating through those data structures. When we combine these schemas with plugins to decode and encode data streams in a wide variety of formats we have a complete system that can enable general markup-like capabilities for any data format. This is similar to the approach of Abiteboul in [Abit93] and [Abit95] which provided a relation database-inspired query and update model for structured files. It can also be traced directly back to Hytime [ISON1920] and its “groves” data model, which was defined in order to provide a reference architecture for the contents of multimedia files. Unfortunately, we believe that the Hytime groves' data model was better matched to representing markup-like files than other data formats, and was sometimes as unnatural to manipulate as relational data is in typical programming languages. Instead, with NODAL, we hope to demonstrate that we can derive and implement a simple data model that naturally represents the data structures and formats of modern programming languages and yet can still form a basis for granular reference and annotation inside of existing file formats. Moreover, we have designed our system around the standard URI document/fragment model for reference and query. This emphasis on generality, simplicity and compatibility in the NODAL design is intended to provide the basis for the development of a standard model for data integration and reference.

The design requirements for the NODAL data model and path language can be largely taken completely from those defined for the XML Schema Language [XSchReq] and XPointer [XPtrReq] , without assuming an XML framework. It must be able to represent unstructured, semi-structured, and structured data in as wide a variety of formats as possible and provide a simple, natural path language that can be easily translated into fragment URIs. The NODAL system thus defines a data model that can be applied to modeling document formats in such a way as to provide a basis for a URI-based fragment references for any kind of modeled document. It is also a superset of the relational model with object-relational (O/R) extensions (making such common data structures as sequences and dictionaries explicit) so it can also be used to provide a direct (and indirect) reference and integration architecture that encompasses both database and document-based information repositories. Moreover, we will show how this approach can also be extended to a wide variety of data access protocols, some database-like and some filesystem-like.

The NODAL Data Model

The challenges are to develop or adapt a data model that has clear application to distributed storage systems, provides a framework for both absolute and relative URI-based reference, and maps naturally to document formats, database schemata and application-level APIs. Moreover, for some of the more advanced functionality we will discuss later, it is important to distinguish which units of data will have metadata associated with them.

The fundamental design constraints we settled on are summarized as follows:

  • Use a type system to model the structural and value constraints that characterize the data encoded in a particular format. This means that each data format or database schema should be expressable in the NODAL data modelling language with a new schema or via reference to an existing schema.
  • Clearly separate data modeling from syntax. A common data model for a variety of data formats requires this independence, although there are certainly situations in which there must be a standard syntax for certain uses of the model (e.g. in coding references as URIs).
  • Don't invent new data types or structures. Compatibility of this model with existing programming languages and data storage models is a primary concern. We are primarily interested in providing a minimally sufficient model that can be adapted for a wide range of purposes using existing data sources.
  • Encourage standardization and reuse of schemas by designing a language that encourages granular reuse and composition of schemas.
  • Distinguish between immutable andmutable types. We need both, but immutable objects are more easily distributable (reference equals copying). The only objects that may need metadata such as change history and access control are the mutable ones.
  • Ensure that all objects have a simple, natural serialization to readable text. This way, the translation to URI references will be more natural.

We believe that these constraints follow naturally from both programming language and data modeling experience and from the overwhelming need for both backward compatibility and extensibility. If we ignore these requirements, our goal of providing a model that can be the basis for seamless information integration will fail.

Literal Data Types

The literal, immutable data types were chosen by combining the type systems of XML Schema: Part 2 [XSchema2] , the SQL99 standard [SQL99] , and modern programming languages. No significant explanation should be required to justify the set of literal data types shown in the table below. Where appropriate, a reference is provided to a standard that describes the storage format and textual expression of each type. The Name type is the only one of these that may need explanation. It is an immutable sequence of characters with an optional namespace (specified by a URI as in XML namespaces). This type is distinguished from the mutable String type (a Sequence of characters).

Table 1: Literal data types for NODAL data model
Type Name Description Standard
Boolean A true or false value  
Character A single character ISO 10646
Octet An 8-bit unsigned integral value (0-255)  
Short A 16-bit signed integral value  
Integer A 32-bit signed integral value  
Long A 64-bit signed integral value  
Float A 32-bit floating point value IEEE 754 [IEEE754]
Double A 64-bit floating point value IEEE 754 [IEEE754]
Name An immutable character string with optional namespace  
Timestamp A single moment in time ISO 8601 [ISO8601]

One characteristic of these literal types is that their identity is completely described by their content. This combination of content-defined identity and immutability is usually described as a "value" type in computer language design. Their advantages for distributed computing applications are well known: they are inherently distributable, since all copies are identical.

Structured Data: Collections as Nodes

However, recalling the Abiteboul approach (which actually addresses the issues of database update for structured files [Abit95] ), we do not wish to restrict ourselves to simply static data and structures. To support dynamic, structured data, we then add to these literals a system of structured, modifiable types that we refer to as Nodes. We define the node types N such that for t N we have t = a i v i , a finite set of attribute/value pairs where a i A N and v i V i N . Thus, for any node type N , we have a domain A N that constrains the attributes of t N and a domain V i N that constrains the values v i that may be associated with attribute a i A N . By varying the constraints on A N and V i N we can define a variety of different classes of node types, while still maintaining this common attribute/value model (in essence, we have defined a data type hierarchy that generalizes to Hytime's property sets [ISON1920] and specializes to common data type building blocks). Below, we describe these constraints with set functions that compose new domains (node types) using existing domains (types). See also the UML diagram in Fig. 1.

Figure 1
[Link to open this graphic in a separate page]

NODAL Data Model. UML inheritance diagram of basic NODAL type system.


A record R = R A V i is defined by a fixed set of names A R = a i name and the domains of the values associated with each name V i R T . We refer to the names a i as fields of the record R and the domain V i R is the field type of field a i in R . This is clearly compatible with the existing relational model as outlined in [Rama03] , with our record mapping to a relation.

We further define an inheritance relationship for record types such that if a A R ' a A R and a A R ' V R a V R ' , a then we say that R is derived from R ' . Clearly, these conditions ensure that R R ' even if R has extra fields a A R such that a A R ' . We consider this model to capture the kind of data structure inheritance available in O/R systems such as PostgreSQL [Post87] and object-oriented languages such as Java and C++.

In essence then an inherited record is a restriction of its ancestors, possibly with new fields. To continue this inheritance as restriction model, we allow a property's value type to be redefined in an inherited record, as long as the newly defined type is a restriction of the inherited type. For example, a Document record has a field mime-type for the MIME-type of document that is a Name that matches a particular regular expression. The record type for XMLDocument includes an override of the mime-type field to a fixed value: "text/xml". Thus the Record types can be used to form an object inheritance hierarchy where kind-of restrictions can be maintained. And example of the declaration of Record types and field shadowing is shown in Fig. 2 , an excerpt from the standard NODAL data model for the basic NODAL types.

Restriction Types

One way of specify the constraints on a particular type is to use a restriction language to define a new type as having a limited value space with respect to an existing type. This is a very powerful concept and is the foundation for the XML Schema type system [XSchema2] and the ISO standard type system from which it is derived [ISO11404] . To this end, we provide just such a type restriction facility for atomic types based on a set of matching functions applied to a base type. Some of the restrictions available are: regular expressions for Name and String types; inclusive and exclusive minima and maxima for any ordered type; a namespace restriction for Name objects; and an enumeration list for any atomic type (including the single-valued enumeration or fixed value).

Documents and Node Graphs

Given these building blocks, a schema , or set of types, can be considered as a model of the constraints on a set of interconnected, structured values. Since this model is clearly a superset of the relational model, it can obviously be used to describe structured data stored in a relational database. How then is it also useful for modelling documents in a filesystem? Simply by associating a particular document format with a root node type in some schema that models the structure of the data contained within those documents. In the Web-based NODAL system, this is done by associating a MIME type [RFC2045] that identifies the format with a DocumentFormat class that defines the type of the root node and encoder and decoder methods that translate between a bitstream and instances of an associated schema (see Fig. 2 ). A file then corresponds to the graph of the nodes reachable from the root node. In this way, we can integrate document and database accesses with a common API and, given the reference architecture we describe next, a common reference language.

The choice of the term Node for these collection objects should now be clear. A document in this kind of model consists of set of Node objects with properties that are either literal or references to other Node objects. These Node-Node properties can be considered to be the labelled edges of a graph that we refer to as the document graph . Unlike XML though, this graph is not restricted to being a hierarchy. Parent-child relations are one-way, although a structural query can be used to recover the many possible parents (and document containments) of a particular Node.

The NODAL Path Language

In order to provide an external reference system for this data model, we adopt the path language approach of XPath [XPath] and define URI references based on a navigational path through a document graph. A path p P in NODAL is a chain of path components p = p 1 ... p n . Using the concatenation operator p = p ' / p n = p 1 ... p n - 1 / p n , we can define the tail of a path p as the final component p * = p n and the parent of a path p as the path p ' such that p = p ' / p * . Note that neither of these exist if p is the empty path , but that the parent of a single-component path is the empty path.

To interpret paths, we need some way of interpreting the path components individually and then in sequence. In this formulation, a path component has two aspects, its path normalization function and its binding function. Each of the components p P has a path normalization function N p : P P that takes a path and produces a normalized path p for which p = p * N p p ' = p . Thus, any path component p that appears as the tail of a normalized path has the property that N p p = p / p , Thus concatenation is the standard normalizing function. An example of a component that does not always concatenate is the parent operator .. that extracts the parent path. The normalizing function for the parent component is N .. p = { .. if p = , p ' otherwise . So, the parent operator can only appear as the first component of a normalized path.

But paths are defined to provide a means of accessing values within a node graph, so we need a means to determine the target value of a path V p . To evaluate this target value, we consider another aspect of a a path component, its binding function V p : P T , which returns the value that a path with p at its tail refers to. We then define the target of a path with respect to its tail as: V p = V p * p . We distinguish two kinds of path components, the absolute components have values that are independent of the containing path p , while the relative components have dependent values. A relative path is then a path that contains only relative path components, whereas an absolute path contains at least one absolute component. To reduce path redundancy we require that the normalization function of every absolute path component produce a single component path containing itself: if p is absolute then N p p = p , Thus an absolute path has exactly one absolute component at its head. Examples of some of the available components are show in "NODAL Path Components".

Table 2: NODAL Path Components
p N p V p Functional Form Shorthand
document d o c p d o c V d o c   d o c URI
node id p id node id nid( id )  
parent p p ' none parent() ..
fragment root d o c / p d o c / p V d o c .root root() \#/
property g p p / p V p .g property ( g ) g
range of i to j p p / p V p . range i j range( i , j )  

So far, these paths are independent of the data model outlined above. To provide some grounding, it will be useful to consider paths within documents. Remember that a document is modelled as the graph of nodes reachable from a root node. (see Sec. “Documents and Node Graphs”). Thus, a reference to a document determines the starting point for navigation within the node graph. An absolute path thus has two parts, the document part and the fragment part. If the document part is not empty, then the fragment part is evaluated relative to the specified document's root node. We can now appreciate one of the main advantages of the unification of the node/collection data model in terms of attribute/value pairs: a homogenization of the standard component for selecting a value in a collection, namely property( g ). From a single absolute root path, we can create paths to all reachable nodes and values with only chains of these property components. The property component is thus the fundamental building block of the relative reference mechanisms in the NODAL path language.

Finally, each path component is expressible as a function or shorthand in text, and a path URI is simply the concatenation of these component expressions with an appropriate separator (in the fragment part of a path, the `/' character is used as a separator). As with XPointer [XPointer] fragment URIs, we use a context frame \#ndl(...) to enclose NODAL fragment expressions. Other fragment expressions are passed to the plugin responsible for the MIME type of the document addressed. Thus, HTML and XML documents can properly handle \#id fragment ids and even xpointer(...) expressions without interfering with the NODAL references.

So, given the components described in NODAL Path Components table above (a subset of the component operators available in NODAL) we can describe a number of NODAL path expressions as examples:

Table 3: NODAL Path Examples
URI Description\#ndl(/) The root node of the document
\#ndl(../foo) The property named foo of the node that is the parent of the path to be applied to.
file:/doc.txt\#ndl(15/range(4,16)) The characters between index 4 and 16 on line 15 of the local text file /doc.txt


As in HTML and the Dexter Reference Model [Dxtr94] , indirect references to Nodes in other documents or repositories are enabled by the creation of a special kind of Node, the Anchor. An Anchor is essentially a proxy for another node, and is completely specified by a path. When the Anchor is encountered, it evaluates the binding of the Anchor's path in the context of the path to the Anchor (to handle relative references) and then acts as the node bound to this path. These anchors are thus of the style of Nelson's transclusion operators [Nel99] . We also provide a facility (as a path operator) to interpret any String or Name as a path URI and then extract the binding. This is how we implement HTML-style hyperlinks.

Future Directions and Potential

We are currently testing the system in this data consumption mode and formulating a number of pilot projects to assess the ability to build working applications on top of the NODAL APIs. The NODAL type system is already implemented based on an interpretation of the data model described in the baseTypes.nls schema excerpted in Fig. 2. One of the most interesting possibilities is to create a personal information management environment that can operate by building semantic indices not only to local and remote files but also to their contents to provide granular annotation.

Future work is planned on the following items:

  • Database interaction: Automatically extracting database schema and allowing the NODAL APIs to access relational and object database systems was one of the design goals and must now be implemented and tested.
  • Read/write functionality for all data sources: Currently we can consume and reference data from a wide variety of sources but can as yet only write to local file systems.
  • Query: It is important to provide mechanisms for data discovery that go beyond the simple exploratory metaphor provided by filesystem and link following. We must be able to search data based on content and structure.
  • Versioning: As was stated above, the best unit for attaching metadata and managing version control is the Node. Each node has a specific structure and a limited number of possible changes. We can already generate change records and associate node versions with each transaction. The difficulty comes layering this functionality on top of inextensible data stores such as local filesystems.
  • Access control: If versioning is best done at the node level, then perhaps access control is too.
  • Synchronization: Once a version management is available, then it should be possible to enable CVS-like [CVS90] synchronization between local copies and remote repositories. We suggest that it will be easier to automatically extract difference charts between local and remote modifications, since most of the node types have a very simple set of modification operators. In fact, the Simias Collection Store [Sim04] being used by the iFolder project has a very similar structure to this one and it is completely designed for synchronization.

With the NODAL data model and path language, we have demonstrated a new paradigm for opening up all Web-accessible content (not just HTML and XML) to the advantages of hypertextual information management. We hope to extend it so that it can become a generic Data layer that can be the foundation for a next-generation of fully interoperable, collaborative end-user applications.


[Abit93] Abiteboul, S., Cluet, S., Milo, T.: Querying and updating the file. In: Proceedings of the 19th International Conference on Very Large Data Bases, San Francisco, CA, USA, Morgan Kaufmann Publishers Inc. (1993) 73--84

[Abit95] Abiteboul, S., Cluet, S., Milo, T.: A database interface for file update. In: SIGMOD '95: Proceedings of the 1995 ACM SIGMOD international conference on Management of data. Volume~24., New York, NY, USA, ACM Press (1995) 386--397

[Chim00] Anderson, K., Taylor, R., Whitehead, E.: Chimera: Hypermedia for heterogeneous software development environments. ACM Transactions on Information Systems 18 (2000) 211--245

[CVS90] Berliner, B.: CVS II: Parallelizing software development. In: Proceedings of the USENIX Winter 1990 Technical Conference, Berkeley, CA, USENIX Association (1990) 341--352

[DKC05] Iverson, L.: Data-Knowledge-Context: An Application Model for Collaborative Work. In: Proceedings of the IEEE International Conference on Information Reuse and Integration, Las Vegas, NV, (2005)

[Dxtr94] Halasz, F., Schwartz, M.: The Dexter hypertext reference model. Communications of the ACM 37 (1994) 30--39

[IEEE754] IEEE Standard 754-1985: Binary Floating-Point Arithmetic. IEEE Computer Society (1985)

[Imed88] Yankelovich, N., Haan, J.B., Meyrowitz, N., Drucker, S.: Intermedia: The concept and the construction of a seamless information environment. IEEE Computer 21 (1988) 81--96

[ISO11404] ISO/IEC 11404:1996: Language-independent Datatypes. International Organization for Standardization (1996)

[ISO8601] ISO/IEC 8601:2000: Representations of Dates and Times. International Organization for Standardization (2000)

[ISON1920] ISO/IEC JTC1/SC18/WG8 N1920:1997: Hypermedia/Time-based Structuring Language (HyTime). International Organization for Standardization (1997)

[Mcsm90] Fountain, A.M., Hall, W., Heath, I., Davis, H.: MICROCOSM: An open model for hypermedia with dynamic linking. In: European Conference on Hypertext. (1990) 298--311

[Nel99] Nelson, T.H.: Xanalogical structure: Needed now more than ever: Parallel documents, deep links to content, deep versioning, and deep re-use. ACM Computing Surveys 31 (1999)

[Post87] Rowe, L.A., Stonebraker, M.: The POSTGRES data model. In: The 13th VLDB Conference, Brighton, UK (1987)

[Rama03] Ramakrishnan, R., Gehrke, J.: Database Management Systems. 3rd edn. McGraw-Hill (2003)

[RDF98] Lassila, O., Swick, R.: Resource description framework (RDF) model and syntax specification. The World Wide Web Consortium (1998)

[RFC2045] Freed, N., Borenstein, N.: Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies. The Internet Society (1996)

[SemWeb98] Berners-Lee, T.: The semantic web roadmap. (1998)

[Sim04] Lasky, M.: The Simias collection store model. Novell Corporation (2004)

[SQL99] ISO/IEC 9075:1999(E): Information technology - Database languages - SQL. International Organization for Standardization (1999)

[XLink] DeRose, S., Maler, E., Orchard, D.: XML path language (XPath). The World Wide Web Consortium (2001)

[XPath] Clark, J., DeRose, S.: XML linking language (XLink). The World Wide Web Consortium (1999)

[XPointer] DeRose, S., Maler, E., Jr., R.D.: XML pointer language (XPointer). The World Wide Web Consortium (2000)

[XPtrReq] DeRose, S.: XML XPointer Requirements. The World Wide Web Consortium (1999)

[XSchema2] Biron, P.V., Malhotra, A.: XML Schema part 2: Datatypes. The World Wide Web Consortium (2001)

[XSchReq] Malhotra, A., Maloney, M.: XML Schema Requirements. The World Wide Web Consortium (1999)

Network-Oriented Document Abstraction Language: Structure and Reference for the Rest of the Web

Lee Iverson [University of British Columbia]