Taking Topic Maps to the Nth dimension

Eric Freese
eric.freese@lexisnexis.com

Abstract

A topic map-based system is being developed that will allow users to narrow a catalog of over 36,000 different sources to a manageable level by navigating over 9 separate dimensions (or metadata axes). This system will enhance searching capabilities and help users to find the best sources for their particular information needs more efficiently. This paper discusses the business case for developing such a system and describes the implementation and design decisions made. A demonstration of the system will be included as part of the conference presentation.

Keywords: Topic Maps; Metadata; Querying

Eric Freese

Eric Freese is a consulting software engineer with LexisNexis. He has 15 years of experience in the area of document, information, and knowledge management with specific expertise in the development and implementation of XML [Extensible Markup Language]technologies. His experience includes research, analysis, specification, design, development, testing, implementation, integration, and management of information systems in a wide range of environments. He has significant research experience in human interface design, graphics interface development, and artificial intelligence. Freese is a founding member of TopicMaps.Org, the organization that developed the XTM [XML Topic Maps]specification, and currently serves as the chairman of this group. He is also the chief architect and developer of the SemanText, an open source application that uses topic maps to harvest and manage knowledge.

Taking Topic Maps to the Nth dimension

Eric Freese [Consulting Software Engineer; LexisNexis]

Extreme Markup Languages 2003® (Montréal, Québec)

Copyright © 2003 Eric Freese. Reproduced with permission.

The problem

LexisNexis publishes content provided by over 36,000 different sources from around the world. A great deal of time, effort, and money are spent in trying to design user interface screens that will allow users, no matter their experience level, to efficiently find the most appropriate sources for the information they need. In cases where screens do not provide direct access to a source, a full-text lookup on the name of the source is provided. In many cases, specialized screens are developed for specific types of users, such as journalists or attorneys.

The source information is currently stored in a database that is not directly accessible by users. This information includes such metadata as:

  • name
  • publisher
  • description
  • subject matter (news, market and industry, public records, etc.)
  • topics covered (business, news, legal material, etc.)
  • type of publication (books, court filings, reference materials, etc.)
  • region (the world, North America, United States, etc.)
  • jurisdiction (federal and state courts, etc.)
  • update frequency (annually, monthly, as needed, etc.)
  • language
  • data format (full-text, multimedia, briefs, etc.)
Many of these metadata types have complex hierarchies developed that narrow wider areas down and organize the metadata. This leads to a number of possible source classifications totalling more than 388,800,000,000,000,000,000 (3.888 x 1020)! Naturally, there are not sources that fit each possible combination. A user interface that guides the user away from invalid combinations is also needed in order for the system to be usable.

The inability of users to quickly find the best sources of information has been a recognized problem for quite some time, with several attempts to solve the problem. Figure 1 shows a lexis.com screen where specific sources are listed in menus which help to guide users to select an appropriate source. This screen allows users to start a search once they have selected a source or group of sources. However, when selecting group searches, the interface controls which sources can be grouped together.

Figure 1: Lexis.com source selection screen
[Link to open this graphic in a separate page]

Figure 2 shows another screen available from http://www.lexisnexis.com that allows users to do more complex searches of the source collection by using some of the metadata to narrow in on specific sources. However, a user cannot run a search from this system. It is intended only to find appropriate sources that can then be use in the Lexis or Nexis systems. Also, it is impossible to further refine the search once the results are displayed without going back and re-running the search.

Figure 2: Searchable directory of online sources
[Link to open this graphic in a separate page]

Figure 3 shows the result screen from the search in Figure 2. The number of hits is shown next to each metadata value. This is a rolling combination of values. In other words, there were 3034 sources that are newspapers, 20 of these sources came from Iowa, all 20 happen to be published in English, but only 19 are individual sources. In this case, this means that there is a group file set up to allow searching across all Iowa newspapers.

Figure 3: Results from online search
[Link to open this graphic in a separate page]

Problem analysis

Further analysis of the situation led members of the data architecture group to consider using topic maps to model the information and drive an enhanced user interface.

Benefits

The anticipated benefits include:

  • better organization of the data
  • increased ease of use
  • more flexible searching of the sources
  • increased revenues from more focused searching
  • reduced cost for custom navigation screen development

Currently, the source selection data is stored in a variety of places and formats including a master SGML file and several relational databases. These data sources could be tied together to provide a unified master index that also allows users to select one or more sources on the fly without the need to pregroup the sources. This master index could then be tied into the permissions system so that users are presented with only the sources included in their subscription agreement. Another possibility would be to allow them to see all the sources and prompt them with cost information for searching in the sources that are outside their current subscription. These actions could then be recorded and sent back to the sales force when they meet with customers to discuss subscription options.

A user interface that guides a user to a list of appropriate sources would reduce the time spent by users, especially novice users, in finding the sources they are looking for. This will increase the likelihood that they will continue using the system. The increased flexibility would also aid more experienced users by making them aware of new sources coming online that resemble the source they are already familiar with. The cost of training users will drop since they will no longer need to be familiar with the way that menus are organized within the system in order to locate a source.

By allowing users to select groups of sources to search, revenues can be increased. This will occur as they spread their searches across a wider variety of sources. This will also allow more internet-based (non-subscription) searches.

By shifting users from the current menu-based system to a selection system, the cost of creating and maintaining the menu screens will drop dramatically. This will need to be a slow migration since many Lexis.com users are accustomed to the menu interface.

Costs

The associated costs have included:

  • conversion of the source database into topic map form
  • application development
  • user interface design and coding
  • procurement of appropriate software and equipment

The conversion of the source data to topic map format should be minimal. The data is already highly organized, so the conversion to XTM should be a rather straightforward process. It might also be possible to use functionality available in some topic map software that allows database tables to be accessed as if they were topic maps. This would further reduce the effort required to get the data into a topic map processable form.

Many topic map tools come with an API and user interface toolset that allow applications to be developed on top of the core engine. Development staff would require training in order to most efficiently use the tools available to develop the system in a timely manner. Interfaces to other systems, such as subscription control and billing, might need to be developed, depending on the functionality defined in the final design.

Several commercially available topic map tool suites are available. In order to support the anticipated level of usage of this interface, several servers would need to be set up to access the topic map. In time, it may be possible to switch some machines away from the menu-based web interface to run the topic map application if user preference moves in that direction.

The Proposed solution

Development overview

The initial task was to select a topic map API from which to work. Initially the tm4j open source API was used. However, its mapping of the topic map structures within memory are such that out-of-memory errors occurred even on small test sets. It was then decided to use the Ontopia knowledge suite for development of the prototype and as a platform for delivery of the prototype topic map.

The initial development task centered on the construction of the base topic map containing the selection hierarchies. The nine (9) hierarchies were most readily available in HTML format on the company intranet. A Java program was written to create a well-formed XML file containing all the hierarchical information for each type of metadata. Another Java program converted the XML file into a base topic map that could later be merged with a topic map containing the source information. This has allowed the base topic map to be managed separately so that is does not need to be reconstructed each time the source list is updated. This provides savings in maintenance costs and time. This also allows the base topic map to be enhanced as necessary without needing to modify the topic map containing the source information.

Initially the TNC [topic naming constraint] was used to allow items with the same name to be merged into single topics. This immediately caused problems when items occurred in multiple places within the same hierarchy. The result of the merge caused several items to have more than one direct ancestor. Ontologically speaking, this was very messy. It also made intuitive navigation very challenging, since a user could be presented with multiple paths when trying to backtrack out of a hierarchy. It was determined that although these items may share a name, they are, in fact, different by virtue of their location within the hierarchy and should not be merged simply because they have the same name. Therefore, a decision was made to employ variant names for the specific display name of the item, while defining the basenames as the entire contextual string of the item in order to guarantee uniqueness, thus preventing merging based on the TNC.

The next task was the conversion of the source data to XML. Since users didn’t have direct access to the source database, it was necessary to develop a program which could harvest the appropriate information from a web-based browser screen. Written in Java, this program creates a simple well-formed XML file containing not only the metadata information, but other things such as descriptions and membership of the sources within libraries. It is anticipated that a production system would have full access to the database. Access to the source databases would greatly simplify the conversion process.

The final, and most important, task was to create the topic map containing the source information. The XML created in the previous step for each source was processed to create associations between the source and the ontologies. Each source was initially associated to a specific point within the ontology. However, it was determined that to more easily drive the user interface (and perhaps more accurately model the information) it would be more appropriate to also associate all the hierarchical ancestors of the initial points to the source. For example, a source covering Iowa should also be considered in the groups of sources that cover the Midwest, the United States and North America. Initially, this was done by including the additional members in the original associations. However, the design was changed so that only binary associations were used to connect the sources to the ontology. This was done to ease maintenance of the topic map if the ontologies are changed in the future.

Prototype

The topic map paradigm was used because topic maps provide a multi-dimensional data organization capability out of the box. Because a particular source can be associated with several separate ontological hierarchies at the same time, this topic map model seemed to be a natural fit. In navigating the topic map, a user is able to seamlessly navigate up to nine (9) separate dimensions at the same time, finding sources that appear at the intersections of these dimensions. They also allow the expressive power to easily add more dimensions as future needs warrant.

Figure 4: Base user interface screen
[Link to open this graphic in a separate page]

The overall design of the topic map based system uses a web-based interface (shown in Figure 4) written using JSP [Java Server Pages]. Each type of metadata creates an axis (or dimension) on which users can narrow the source set down to a reasonable number from which to choose. Each axis starts at the top level of the ontology and allows a user to drill down as desired. For example, a user may navigate the region axis starting at North America and then drill down to United States, to West, and then to Texas. Users are able to use any combination of the nine (9) metadata types to focus their search to identify the best sources of the information needed. For example, a user is able to specify a search for full-text (data format) general news (topic) articles from newspapers (publication type), published daily (publication frequency), that cover Texas (region). Figure 4 shows the results of this search within the prototype test set.

Figure 5: Multi-dimensional search result
[Link to open this graphic in a separate page]

By examining the results shown in Figure 5, a user receives a great deal of information. Most important, he sees that the Austin American-Statesman is probably the best source in which to search (within the test set), based on the parameters provided. If the user decides that this is not the source he wants, he can see that there are 144 newspapers, 150 general news sources, 38 Texas sources, 98 daily sources, and 1510 full-text sources and continue refocusing his search by moving around within the hierarchies. But only one (1) source, within the prototype test set, is located at the intersection of the five (5) axes.

The system has been designed in such a way that it is impossible for a user to get to the point where no sources are available for the parameters given. This has been done by re-adjusting the menu options each time a parameter is changed. For example, in Figure 4, there are 16 top-level choices available under the subject hierarchy. However, these have been narrowed down to two (2) by the time the user gets to Figure 5. The user is also able to back out of any hierarchy at any time. This allows the search parameters to be rebroadened at any point if the user decides they have followed a wrong path.

As the user maneuvers through the set of hierarchies, the system provides feedback on the number of sources matching the current set of search criteria. When the number of matches reaches a specified number of sources, a list is provided from which the user can select one or more sources to begin searching. The user can also click on a link to a source to receive more information about it to determine its appropriateness for the research being conducted.

Full-text searching capabilities are available to allow users to enter search terms within a given hierarchy. These search terms allow more experienced users to jump directly to the desired location within the ontology. For example, searching the subject axis using the term “military” narrows the choices from over 5,000 to three (3) as shown in Figure 6. It is possible that there are more hits for the full-text search, but only those subjects that have sources attached to them will be presented.

Figure 6: Full-text axis search
[Link to open this graphic in a separate page]

The full-text searching also allows a user to jump directly to a point within the source set. The full-text string is analyzed to find its component pieces. The pieces are then applied to the hierarchies to determine if any hits occur. If so, that level of the hierarchy is chosen. This allows the user to enter a focused search and be taken directly to the place in the ontology that best represents that search. Figure 6 shows the results of entering the string “New York daily newspapers”. Notice how the three (3) most appropriate axes have been selected in order to return two (2) appropriate sources in which to search. Of course, the user can then continue maneuvering within the ontology from this point to either broaden or narrow their search.

Figure 7: Full-text search result
[Link to open this graphic in a separate page]

It is possible for the full-text search to return a set of zero available sources. This has been done specifically to allow a user to determine the most appropriate way to back out of the search. For example, within the test set (see Figure 8), a full-text search for “Iowa newspapers” returns no sources. The user is notified of the condition, and he can then decide whether moving back up to “Midwest” or “News” is the best method of recovery.

Figure 8: Full-text search returning no sources
[Link to open this graphic in a separate page]

The results

At this point (mid-June 2003), a small sample set (2055 sources) is being used to test the prototype system. Users who have tested the system are able to quickly narrow the source catalog down to a manageable number of choices in a few mouse clicks. All demonstrations of the prototype to user representatives, designers, and management within LexisNexis have received favorable comments. It is anticipated that the project will be formalized and the capability described herein made available to the user community (over 2 million users).

Some simple enhancements have been made to the base topic map by defining alternate names for some of the hierarchy items. The most candidates were within the region hierarchy. Since many of the items in this hierarchy can have multiple names, it was decided to add additional naming capabilities to make the full-text search work better. This included the addition of AKA names and abbreviations for regions. For example, “United States” was given alternate names of “US” and “USA”. Also, some regions within the U.S. were given additional names in adjective format (i.e., “northeastern” for “Northeast”). This provides a more natural language capability to search within the topic map.

It is also anticipated that other enhancements to the source ontologies will be made. These enhancements may conceivably contain domain knowledge which will make the source selection process more powerful. One such example is the connection of the region and the jurisdiction ontologies in such a way that if a certain jurisdiction is chosen, the choices within the region ontology are narrowed accordingly. Another possible enhancement is the definition of non-English names for all the items in the base topic map to support international users.


Taking Topic Maps to the Nth dimension

Eric Freese [Consulting Software Engineer, LexisNexis]
eric.freese@lexisnexis.com