The Topic Maps paradigm allows people and organizations to use a specific piece of addressable information, often called a "published subject indicator," as a "binding point" for information about a specific subject. Information about subjects could be collated by merging sets of such bindings. A given subject may be classified according to multiple classification schemes simultaneously. For example, different authorities could create classifications in the same subject area; the purpose of one scheme might be to restrict access to subjects, while another might be to enhance their findability. Each classification scheme itself consists of some set of subjects. Examples in XTM syntax show how the Electronic Commerce Code Management Association's Universal Standard Products and Services Classification Code can be a set of published subjects. Emerging issues and standards are discussed, including the OASIS Published Subject activity.
There are many initiatives in Schlumberger and in other corporations to begin to sort out and classify the wealth of information within the corporate intranet. In our own company, much work has gone into the classification of technical information by engineers for use by help desks to support our oilfield services operations. The classification has made it easier to find information within the support portal, and an improved organization has helped support staff communicate better with field locations. The next step is to extend the internal classification to enable sharing of technical information with clients and also incorporate information from partners. In the case of a corporate merger, the combination of the knowledge base would also be necessary.
A topic map of a general classification, if published, could serve as an intermediary of all the information sources to be merged if the classification contains subjects in common. These subjects could serve as binding points between groups of resources.
The UNSPSC [Universal Standard Products and Services Classification Code] of the ECCMA [Electronic Commerce Code Management Association] is an "open global coding system that classifies products and services. It is used extensively around the world in electronic catalogs, search engines, procurement application systems, and accounting systems [ECCMA a]." It comprises a rich diversity of commodities; you can find a segment for "Live Plant or Animal Material and Accessories and Supplies," and another for "Musical Instruments and Games and Toys, etc."
Segment 71, Mining and Oil and Gas Services, of the UNSPSC was chosen as the source of general subjects to represent in an XML topic map [XTM]. The method of modeling described here can also be applied to any of the other UNSPSC segments. In the near future, I can envision not only the sharing of information with our partners and customers, but also the sharing of how certain aspects of this information are to be represented, in this case, in topic maps. If we can define a "binding point" between these external and internal subjects when they are in fact the same subject, then we can merge the topics and collate our information with that of others.
"In the most generic sense, a 'subject' is any thing whatsoever, regardless of whether it exists or has any other specific characteristics, about which anything whatsoever may be asserted by any means whatsoever [ISO]." Topics in a topic map make subjects real to a computer system when they have names, occurrences, and associations. Occurrences are information about subjects. Associations establish relationships and define the roles in the relations between the subjects.
The subjects may have become "real" within a computer as represented in a topic map, but the subjects are not yet binding points; in other words, we must strictly identify subjects if we would like to merge information about them. A subject indicator is defined in the XTM Topic Maps 1.0 Specification as, "a resource that is intended by the topic map author to provide a positive, unambiguous indication of the identity of the subject," and a PSI [published subject indicator] is defined as "a subject indicator that is published and maintained at an advertised address for the purpose of facilitating topic map interchange and mergeability [XTM]."
For the creation of Published Subject Indicators to be used in a Topic Map representation, I initially followed what is described by Bernard Vatant in "Binding Points for Subject Identity," in his "Requirements for Standard Published Subjects Indicators [Vatant a]," and also included parts of the "Draft Proposal for Recommendations and Requirements for Published Subjects [Vatant b]" that are being discussed by the Topic Map Published Subjects Technical Committee of OASIS [Organization for the Advancement of Structured Information Standards].
In "Binding Points for Subject Identity," these requirements were proposed
The OASIS Committee is discussing the value of having published subject indicators "accessible by both humans and machines."
The two requirements that I will address here are
As we are all well aware, domain persistence is not common on the World Wide Web. For corporations, it is particularly difficult in these times of mergers, acquisitions, and spin-offs to maintain a stable domain name. Sometimes the domain may be stable but the location of resources within a domain are not. How do we publish our resources on the Web so they may be retrieved months later, even years later? There is really no clear-cut solution yet, but one solution has been offered by the OCLC [Online Computer Library Center], a nonprofit computer library service and research organization linking more than 21,000 libraries in 63 countries and territories. The OCLC Office of Research is an active participant in the IETF [Internet Engineering Task Force] Uniform Resource Identifier working groups. It has developed and recommended the use of a PURL [Persistent Uniform Resource Locator]. It explains the approach here:
To aid in the development and acceptance of Uniform Resource Name (URN) technology, OCLC has deployed a naming and resolution service for general Internet resources. The names, which can be thought of as Persistent URLs (PURLs), can be used in documents, Web pages, and in cataloging systems. PURLS increase the probability of correct resolution over that of URLs, and thereby reduce the burden and expense of maintaining viable, long-term access to electronic resources. Functionally, a PURL is a URL. However, instead of pointing directly to the location of an Internet resource, a PURL points to an intermediate resolution service. The PURL Resolution Service associates the PURL with the actual URL and returns the URL to the client [Shafer et al.].
Major organizations such as the DCMI [Dublin Core Metadata Initiative] are already using PURL (http://purl.org/dc) for their resources. You have the option of using the OCLC PURL service, as the DCMI does, or you can set up your own. However, this has not caught on, possibly because of how browsers support redirects or partial redirects. Users must not use relative links on their Web pages. Not all clients will resolve fragment identifiers.
Any organization that wanted to publish subjects for use in topic maps might want to consider using the PURL service. For example, the ECCMA could register
The Topic Map Published Subjects Technical Committee is in discussions to recommend a URL such as http://psi.mydomain.org, for the location of published subject indicators, and my example here would follow the proposed recommendation [Vatant c].
The OASIS Topic Map Published Subjects Technical Committee is discussing whether to include metadata with published subject indicator resources. Dublin Core elements have been suggested. If resources were HTML documents, then metadata could be added in meta tags in the head of HTML files. I followed the committee's proposed recommendation [Vatant b] and Encoding Dublin Core Metadata in HTML [Kunze]. Here is an example:
<meta name="DC.Source" content="ECCMA"> <meta name="DC.Date" content="2002-01-01"> <meta name="DC.Language" scheme="rfc1766" content="en-US"> <meta name="DC.Identifier" content="http://psi.eccma.org/unspsc/71.htm"> <meta name="DC.Relation.IsPartOf" content="http://psi.eccma.org/unspsc/71.xtm">
The use of PURL for the location of the resource containing the published subject indicators would solve the persistence issue, but we still should consider how individual subjects within a file can be read by humans and resolved by computers. The OASIS Published Subjects Technical Committee is discussing the kinds of URL strings that could possibly be used to resolve to a published subject indicator [Vatant c] such as
The Philosophy of Aesthetic Realism has been indispensable to me as I try to solve complex problems, or create or critique any kind of work. Eli Siegel poet, critic and founder of Aesthetic Realism states that "The world, art, and self explain each other: each is the aesthetic oneness of opposites."
In Self and World, he writes:
When things are well or beautifully arranged, in every instance, the side of them which can be seen as separate goes along rightly with the side of them which can be seen as together.... No matter how many objects are concerned, two, and only two, opposite things are involved. When we talk of the composition of materials, say, the problem is how to place these materials so that their separateness does not conflict with their togetherness. For all objects can be seen as being away from other objects, discordant with them; or as close to them, mingling with them serenely [Siegel].
The UNSPSC classification is well arranged and the design lends for it to be updated with ease. It is a commodity classification schema. Its structure is hierarchical with four levels. Each level contains a two-character numerical value and a textual description. The levels include Segment, the logical aggregation of families for analytical purposes; Family, a commonly recognized group of interrelated commodity categories; Class, a group of commodities sharing a common use or function; and Commodity, a group of substitutable products or services. Here is an example taken from Segment 71, Mining and Oil and Gas Services.
Title -[71.00.00.00.00] Mining and Oil and Gas Services -Family-[71.11.00.00.00] Oil and gas exploration services -Class-[71.11.21.00.00] Open hole well logging services -Commodity-[71.10.15.01.00] Formation testing sampling services
There are two extra zero digits shown for each EGCC in the HTML file.
The eight-digit hierarchical code called the EGCC [ECCMA Global Commodity Classification] helps users avoid duplication, identify the code in the table, and differentiate titles and definitions. The UNSPSC also includes the EGCI [ECCMA Global Commodity Identifier], a sequence identifier that is linked to the title and definition of the classification and is never changed.
The ECCMA stresses that the EGCC may change if a commodity is reclassified, but the EGCI never does [ECCMA b].
The ECCMA provides the public with one HTML file for each segment containing the levels, EGCCs, and textual descriptions, which is updated every 3 months. Its Segment Web pages have content similar to that shown in the code example above, which usually spans hundreds of lines. In addition to these, it also provides its members with the EGCI and version control information, which is updated every 30 days.
The ECCMA has defined the EGCI as a fixed identifier throughout the lifetime of each classification; it is only natural to use this number to establish the "identity" of each of the subjects in the topic map. In contrast, the EGCC positions each title and definition within the classification, and the EGCC can be updated to reflect new positions. The fixed aspect of the classification versus the changeable aspect is important to keep in mind when deciding how to model later on.
I would like to use Segment 71 for the source of published subject indicators for my XTM representation, but the public web pages are missing something I needed: the fixed and stable EGCIs. The EGCIs (bold print) could be added as named anchors to the Segment 71 web page, as in this example,
-<A name=71>[71.00.00.00.00]</A> <A name=009026> Mining and Oil and Gas Services</A> -family- <A name=7111>[71.11.00.00.00]</A> <A name=009253> Oil and gas exploration services</A> -class-<A name=711121>[71.11.21.00.00]</A> <A name=013392> Open hole well logging services</A> -commodity-<A name=71101501>[71.10.15.01.00]</A> <A name=013401> Formation testing sampling services</A>
For a representation of Segment 71 in XTM, it would be good to include all of the semantics, such as general definitions, that are supplied with the classification. One core XTM file could contain all the topics that could be used for the typing of Segment topics in the UNSPSC, and this file could be merged into a main XTM file that contains all of the specific topics for a Segment.
The topics of "Segment," "Family," "Class," and "Commodity," could be used for typing the specific instances in the classification. These in turn could be typed as levels, so "Level" was also made a topic. "Contains," "Container," and "Content" were created for the association typing and assignment of association roles. The topic "UNSPSC" was created to be used exclusively as a scope for base names. Here is an example of how a topic used for typing could be coded.
<topic id="Segment"> <instanceOf> <topicRef xlink:href="#Level"/> </instanceOf> <subjectIdentity> <subjectIndicatorRef xlink:href="#segment-desc"/> </subjectIdentity> <baseName> <scope> <topicRef xlink:href="#UNSPSC"/> </scope> <baseNameString>Segment</baseNameString> </baseName> <occurrence> <resourceData id="segment-desc"> The logical aggregation of families for analytical purposes.</resourceData> </occurrence> </topic>
Each topic in Segment 71 is typed by Segment, Family, Class or Commodity. Here is one example.
<topic id="egci-009026"> <instanceOf> <topicRef xlink:href="http://psi.eccma.org/unspsc/ core.xtm#Segment"/> </instanceOf>
Each of the segment topics gets subject identity from the URL and fragment identifier, which resolves to the published subject indicator as in this example,
<subjectIdentity> <subjectIndicatorRef xlink:href="http://psi.eccma.org/unspsc/ 71.htm#009026"/> </subjectIdentity>
<baseName> <scope> <topicRef xlink:href="http://psi.eccma.org/unspsc/ core.xtm#UNSPSC"/> </scope> <baseNameString>009026</baseNameString> </baseName>
These are used to ensure that the identities of the subjects are unique. Within a controlled vocabulary, it is best to look for that aspect of the classification that is the most stable and least likely to change. In the UNSPSC, this is the EGCI. The most rigorous binding point so far is the URL string with a fragment identifier to the named anchor in the HTML file. Topic map subject identity can also be conferred through the base name especially a scoped base name. If two topics have the same base name within the same scope they are thought to be the same subject. The single topic that results will have all of the names and all of the occurrences of both original topics, and it will play all of their roles in all of the same associations (topic naming constraint rule). We need to consider whether a word within a particular scope can be used for subject identity. It is possible for an ECCMA member to submit a name change for a textual description of a classification. The UNSPSC may decide to remove the "and" and change the name of "Mining and Oil and Gas Services" to "Mining Oil Gas Services." They might even want to drop the name "Service." There have been other problems with the topic naming constraint rule, and it could be the subject of a paper by itself, but I think that the problem is always one of identity and the need for a standard way of expressing this identity. If the topic naming constraint rule remains for merging topic maps, it might be advisable not to recommend the use of a name (character string) but instead recommend the use of some unique and fixed identifier within a precise domain, such as the EGCI defined by the ECCMA.
Since I am using a number for the base name, it is not easy to follow the semantics of the classification. People must see names to discern information. The display name is used to contain the textual descriptions of the classification. In XTM, display name is a variant of the base name.
<variant> <parameters> <subjectIndicatorRef xlink:href="http://www.topicmaps.org/xtm/1.0/ core.xtm#display"/> </parameters> <variantName> <resourceData> Mining and Oil and Gas Services </resourceData> </variantName> </variant>
<variant> <parameters> <subjectIndicatorRef xlink:href="http://www.topicmaps.org/xtm/1.0/ core.xtm#sort"/> </parameters> <variantName> <resourceData>71.00.00.00 </resourceData> </variantName> </variant>
There was still some information within the classification that had not been expressed. Each of the instances was typed as being a Segment, Family, Class or Commodity, but how would the relationships between the instances be described? We must define an association type and association roles for the members participating in the association. For example, we can say that "Open hole well logging services (egci-013392)" (role of container) contains "Formation testing sampling services (egci-013401)" (role of content). Here is a code example.
<association id="asoc3"> <instanceOf> <topicRef xlink:href="http://psi.eccma.org/unspsc/ core.xtm#contains"/> </instanceOf> <member id="assoc3role1"> <roleSpec> <topicRef xlink:href="http://psi.eccma.org/unspsc/ core.xtm#container"/> </roleSpec> <topicRef xlink:href="#egci-013392"/> </member> <member id="assoc3role2"> <roleSpec> <topicRef xlink:href="http://psi.eccma.org/unspsc/ core.xtm#content"/> </roleSpec> <topicRef xlink:href="#egci-013401"/> </member> </association>
This completes one modeling approach to represent any of the UNSPSC segments in XTM. UNSPSC codes can be included in a catalog of products of a particular company. It would be interesting to also have the products of any company classified and represented in XTM, and consider how to merge the information.
Working in the Topic Maps Published Subjects Technical Committee, I wondered if topic maps and published subjects could be used to further refine our internal corporate classification. Schlumberger does have an extensive classification for products and services, stored within an Oracle database with a unique ID assigned to each instance. It is quite extensive and descriptive and is continually updated. Many experts contribute to the process, and anyone can submit a change request.
The classification of products and services is more changeable than some other areas. The general modeling of the UNSPSC Classification in XTM could be done for other classification schemes such as the one used in Schlumberger. I have studied how topic maps might extend the classification that we already have. This work is in progress and I hope to report on it at a later date. Ultimately, it would be useful to have the entire internal corporate classification in an XTM representation. If we did have to represent it in XTM syntax, and other classifications such as the UNSPSC were also in XTM, then we could merge the classifications for various purposes.
The UNSPSC is designed to discover resources (locate potential suppliers) and analyze expenditures (report on the use of funds). When it is used to find potential suppliers, it would be good to consider possible scenarios where suppliers had their own classification schema detailing products and services, especially if their classifications were in XTM syntax. The detailed products and services information could be merged with the more general UNSPSC classification.
If Schlumberger would like to merge its information with others, it might want to see what parts of its own classification would map to the UNSPSC. For example, our MDT* [Molecular Formation Dynamics Tester] tool 2 [Schlumberger] could be classified in the UNSPSC as
-[71.00.00.00.00] Mining and Oil and Gas Services -Family-[71.11.00.00.00] Oil and gas exploration services -Class-[71.11.21.00.00] Open hole well logging services -Commodity-[71.10.15.01.00] Formation testing sampling services.
Internally, we use two paths in our classification to describe the MDT tool. One stems from an equipment classification and the other from a Product and Services classification. Our internal classification of "Formation Sampling" could be assigned the same subject identity as that of "Formation testing sampling services" of the UNSPSC. I could also define an association where Formation Sampling (container role) contains MDT Modular Dynamics Tester (content role) in my XTM file. All of the occurrences (content classified as being about the subject MDT Tool) could be included, too.
Even though our internal classification does not entirely match the UNSPSC classification, it is close enough to establish a binding point between the two to allow merging the subjects.
Another way of doing this would be to use the UNSPSC as the upper level ontology, and then create associations between the upper levels of its classification with our more specific ones. For example, I could define separate subject identities for the Schlumberger "Formation Sampling" and "Formation Testing" categories. I could define one association where the UNSPSC "Formation Testing Sampling Services" (container role) contains the Schlumberger "Formation Sampling" (contents role) category, and I can define another association where "Formation Testing Sampling Services" contains "Formation Testing." This comparative modeling analysis would not be an easy task, and I am only looking at one oilfield services tool.
Suppose Schlumberger merged with a company that had similar products. We could assign the same binding points to the resources for the information merging. This would allow for an easier analysis of the resources of the combined products and services.
From another perspective, let's say that the MDT Tool includes a part from a third-party vendor, and we want to include the maintenance manual of that part with our MDT Tool documentation. We could merge this information with our own using the same method.
In the modeling of an XTM representation, one published subject indicator could be provided for each instance in the UNSPSC classification, plus a base name, the EGCI, scoped by the topic, UNSPSC. Having these published subject indicators defined for a generalized upper level classification such as the UNSPSC is useful because when represented in topic maps in XTM syntax, the topic maps can be used as an upper ontology of binding points for information that can be shared between multiple suppliers. In another application, the more general levels of the UNSPSC as a commodity classification could also be merged with a detailed products classification such as the one that Schlumberger has, through associations between the general and specific levels of the classifications.
The modeling that is described here may also be applicable towards other general use classification schemes. It would be interesting to see this applied to classifications for government or library use; in particular, it would be very exciting to see this applied to the Library of Congress Subject Headings. If published subject indicators could be provided for these, the library community would have an easier way to merge resources of common subjects. Publishers would have an easier time creating anthologies on the internet. Teachers could use topic maps to merge lesson plans in a common directory on the Web.
Schlumberger K. K. does not endorse or recommend any products, processes, services, or suppliers as described herein. The views and opinions of the author expressed in this paper do not necessarily state or reflect those of the corporation.
*Mark of Schlumberger
I would like to thank all members of the OASIS Published Subjects Technical Committee, especially Lars Marius Garshol for his modeling advice and technical review of this paper; Murray Altheim for his review and practical approach to implementations; Steve Pepper, Bernard Vatant, Thomas Bandholtz, and Motomu Naito for all kinds of encouragement; and Suellen Stringer-Hye for pressing the issues that the library community faces. I would also like to thank Steven R. Newcomb and Liam Quin for their critical reviews, and finally, all of the people in Schlumberger who support and encourage my experimentation with new ideas.
[ECCMA a] Electronic Commerce Code Management Association Technical Secretariat. UNSPSC Technical Manual. http://www.eccma.org/download/Tech-Manual-UNSPSC.zip (8 June 2002).
[ECCMA b] Electronic Commerce Code Management Association Technical Secretariat. UNSPSC Implementation Guide. http://www.eccma.org/download/Imp-Guide-UNSPSC.zip (4 June 2002).
[ISO] International Organization for Standardization. ISO/IEC 13250 Topic Maps. Second Edition. 19 May 2002. http://www.y12.doe.gov/sgml/sc34/document/0322_files/iso13250-2nd-ed-v2.pdf
[Shafer et al.] Shafer, Keith; Weibel, Stuart; Jul, Erik; and Fausey Jon. "Introduction to Persistent Uniform Resource Locators." OCLC Online Computer Library Center, Inc. http://purl.oclc.org/OCLC/PURL/INET96 (4 June 2002).
[Siegel] Siegel, Eli. Self and World. New York: Definition Press. 1981. New York.
[Vatant a] Vatant, Bernard. "Binding Points for Subject Identity." Proceedings of Extreme Markup Languages 2001 Conference. August 2001.
[Vatant b] Vatant, Bernard. "Draft Proposal for Topic Map Published Subjects Requirements and Recommendations." May 2002. http://www.oasis-open.org/committees/tm-pubsubj/docs/recommendations/psdoc.htm
[Vatant c] Vatant, Bernard. "OASIS Topic Maps Published Subjects Technical Committee Meeting Minutes: 24 May 2002." Released 27 May 2002. http://www.oasis-open.org/committees/tm-pubsubj/meetings/2002-05-24.htm