On-the-fly Clustering As A Novel RDF Query Mechanism

Thomas B. Passin
tpassin@comcast.net

Abstract

On-the-fly clustering has developed in recent years as one of the most promising ways to organize search results. This technique groups search results into categories that are derived from the search results themselves. The categories can be constructed as the results stream in. With the right algorithms, the clusters can be remarkably useful in helping a person make sense of a large set of search hits.

Clustering techniques normally try to use as much information as possible, which in the case of typical search results on the Web means the title of the retrieved work along with whatever text snippet the search engine returns. By contrast, this work examines the effectiveness of the clustering when only the title is available. This is the case when the search occurs in a browser bookmark collection, for example. Even with this small amount of information, on-the-fly clustering can be surprisingly useful.

An RDF data set consists of a number of simple three-part "triples". A triple, being a kind of simplified sentence, is much like the title of a bookmark, though usually less informative. This similarity raises the interesting question of whether the same clustering technique used for bookmark clustering could be useful for analysis of an RDF data set. This question is investigated experimentally. It turns out that this version of clustering amounts to a remarkably powerful parallel query mechanism for RDF data. Among other characteristics, data about each of the key resources is grouped together.

The clustered results can be invaluable as an aid to investigating a collection of RDF data. The results include portions that are similar to Patrick Stickler's "Concise Bounded Description", except that all the important resources are so clustered in a single run of the algorithm. This approach has good potential to help a person trying to analyze and understand a small to moderate size RDF data set.

The algorithm has been implemented in Javascript and runs in a Web standard browser. Actual results are presented, and possible future development is discussed.

Keywords: RDF; Querying; Knowledge Management

Thomas B. Passin

Thomas Passin has been working with XML-related technologies since 1998. He helped to create the XML version of the message set in SAE J2354 Advanced Traveler Information Systems, currently in balloting, and has created a number of demonstration applications that use XML, XSLT, and Python technologies together. He also consults at work about XML and XSLT matters, and is active on a number of related discussion lists.

He is the author of the book "Explorer's Guide To The Semantic Web".

His interest in Topic Maps developed naturally from past experience with data modeling. He is currently finishing a manuscript about the Semantic Web.

Mr. Passin studied physics at the Massachusetts Institute of Technology and the University of Chicago.

On-the-fly Clustering As A Novel RDF Query Mechanism

Thomas B. Passin [Principal Systems Engineer; Mitretek Systems]

Extreme Markup Languages 2004® (Montréal, Québec)

Copyright © 2004 Thomas B. Passin. Reproduced with permission.

1.0 Introduction

The results returned by a search engine can be difficult to work with, because the results are usually not differentiated by categories. For example, a query for "swing" will return hits for the Java Swing GUI library, childrens' swings, golf swings, porch swings, big band music, and so on. When there are many hits, the user can be hard-pressed to find the ones of interest. By contrast, a few search sites display the hits grouped into categories. In the past, this was done by developing an hierarchy of categories for the site, and labeling the groups with the corresponding category labels. However, this approach is often not satisfactory for various reasons. A newer approach uses so-called "on-the-fly" clustering.

1.1 On-the-fly Clustering

In this approach, the search results are clustered based on a dynamic analysis of the actual corpus returned, rather than by using static, predetermined categories. Clustering is often based primarily on matching portions of strings. Common sequences suggest a cluster. A particular document may end up in more than one cluster. This is usually considered to be a desirable characteristic, one that reflects the complexity of real documents and the fact that real-life classifications are rarely totally exclusive. It is also possible to combine dynamic clustering with pre-established categories in various ways. One challenge is to do the large amount of processing in a short enough time to make it acceptable to users. Another is to select appropriate labels for each cluster.

1.1.1 Search Sites That Cluster Results.

Some sites cluster results to good effect. Notable are Vivisimo, Kartoo, Mooter, and IBoogie, with Vivisimo consistently returning the most useful results. With any of these sites, scanning the hit list is much easier than for a site that returns a simple flat list. There are two reasons for this. First, the user can select or reject whole groups of results because they fall into the wrong category. With flat results, the user must read the text snippet or open the linked page to see if it is what is desired. Second, sometimes the user notices unexpected topics, which can turn out to be very rewarding.

1.1.2 Clustering With Minimal Information

Typical sites that cluster request data from several established search engines. They combine the results and remove duplicates, and attempt to compute the most appropriate clusters and cluster labels. The more information available, the better this process is likely to work. Usually the main information is the title and the text snippet supplied for each hit.

The original motivation for this work was an investigation into clustering the results of queries into a large collection of browser bookmarks. In this case, only the title is available for dynamic clustering. This raised the question of just how little information is needed to get useful clustering results. The inquiry was moderately successful, as well be discussed in section 3.

1.2 Application To RDF Data

RDF data consists of a number of short "statements" that link concepts and other "resources" together. Each statement by itself contains very little information, a circumstance reminiscent of the bookmark titles. Is it possible that dynamic clustering techniques can be applied to RDF?

1.2.1 A Triple Is A Kind of Sentence

RDF data sets are organized into three-part statements, or triples, These triples are in effect short sentences, consisting as they do of a subject, a predicate, and an object. A list of triples is therefore similar to a list of bookmark titles. This raises the possibility that useful clustering of a set of triples might be possible.

1.2.2 N-Triple-like String syntax

To serve as source data for the existing clustering implementation reported here, a collection of RDF triples would have to be written as a string, one triple per line. To simplify a bit, a triple in this form would resemble the following:

book_21 title "A Field Guide To The Semantic Web"

This form is very similar to the N-Triples text format for RDF. It is not hard to convert N-Triples data to this form, and with some additional coding, it might be feasible to use the N-Triples format directly.

1.2.3 Matches Produce Interesting RDF Subgraphs

A collection of RDF statements can be considered to form a graph, where the predicates correspond to edges between nodes, which represent subjects and objects. The more edges that connect to a given node, the more information that is available about that node (that is, about the resource that the node stands for). Thus an RDF data set is likely to contain a number of connected graph components, sometimes called subgraphs. A query over an RDF data set would return a subgraph of data relating to the conditions set by the query.

Flexible string matches of the sort found by the clustering algorithms can identify subgraphs. These subgraphs should be interesting because they collect information in depth about one or many resources, and also because there can be matches among any literal values that are present. This latter motion raises the possibility of discovering related information that would be unlikely to be found by standard query languages.

1.2.4 Clustering And Queries

It turns out that the effect of the clustering algorithm is much like running an extensive set of wild-carded queries in parallel. For example, triples that contain a particular resource end up grouped together whether the resource occurs as the subject or object. This means that everything known about a resource (to a depth of one relationship, and more when the groupings around bodes are included) is collected together in one place. The groupings provide a practical alternative to graphical visualizations of RDF data, which become progressively harder to read as the amount of data becomes larger. The clustered results also can show groupings of related information that could not be shown on a typical graphic display. Thus, on-the-fly clustering can act as a powerful tool for people who need to inspect and analyze sets of RDF data.

Since clustering algorithms are fairly computationally intensive, queries can take some time, especially for larger data sets. They make up for this by being like multiple - even dozens - of queries all at once.

1.3 How This Paper Is Organized

The rest of this paper is organized as follows. Section 2 discusses on-the-fly clustering as it operates on typical (non-RDF) search results. The discussion is motivated by a desire to cluster search results when only a minimum of information is available, typically just the titles of the works returned. One algorithm for dynamic clustering is described; it uses suffix trees to build up a matching structure. The advantages of suffix trees are summarized, and the remaining parts of the algorithm are sketched out.

Section 3 discusses a Javascript implementation of the algorithm that uses a simplified suffix tree. Typical results are presented, and some interesting characteristics of the clustered results are pointed out.

Section 4 focuses on the fact that RDF data sets are organized into triples, which are in essence short sentences. If each triple is written out as a one-line string, it would be not unlike a short title of the type clustered in section 3. An RDF data set in this form can therefore be submitted to the very same algorithm discussed previously.

Section 5 presents examples of RDF data clustered by the algorithm. The results are quite striking, and their useful characteristics are discussed. Section 6 discusses promising extensions of the work, which are based in part on how suffix trees are used in the analysis of strings of DNA. This section also speculates on ways to adapt the approach to cover data contained in Topic Maps and Conceptual Graphs. Section 7 summarizes the paper.

2.0 Dynamic Clustering

Automated categorization of a collection of documents has been offered by commercial software for a number of years, with various degrees of success. The rise of large, capable search engines has led to their widespread use. Yet search engine results can be difficult to work with because they are usually presented in a flat, undifferentiated list. The user must skim both the title and the snippet of returned text for each hit to decide whether to follow its link and read more. Searching in this manner can become time-consuming and tedious. On-the-fly, or dynamic, clustering refers to a group of attempts to improve the lot of the searcher by grouping the search results based on their contents.

2.1 Characteristics Of Dynamic Clustering

Dynamic clustering usually uses the returned search results themselves as the source of the information needed to cluster the results. For a given set of results, this is often more useful then using precomputed categories, because such categories may not fit the actual results well, since they are but a subset of a larger collection. Furthermore, for a collection that is a substantial subset of the set of pages available on the Web, keeping precomputed categories up to date is a large and time-consuming task.

Another significant point is that many times, a given search result does not fit neatly into a single category. Thus a clustering algorithm should be capable of placing a document into more than one category. Typically, precomputed categorization software attempts to find a single "best" category for each document.

A third consideration is that it is often necessary to combine search results from several sites or databases. Many current search sites operate in this way. No one site would be able to precompute a set of categories to cover all the contributing sites, nor to tag all of their documents with the categories. Even if the individual sites did so, and encoded their categories into their result streams, the categories from one site would not match those from the others. Thus, pre-established categories are impractical for search results, especially for aggregated results.

Dynamic clustering generally operates separately on each set of (aggregated) results. Indeed, certain algorithms can operate on the search hits as they stream in. This is a challenging task, since the delay from receipt of the result streams to presentation to the users must be small, perhaps on the order of a few seconds or less. To increase the difficulty of the task, it is normally infeasible for the software to analyse all the complete documents found by the search. Retrieving and analyzing them all would take much too long.

Thus clustering software must usually, as a practical matter, operate on just the retrieved titles and text snippets, or summaries, that are included in the stream of results. Consequently, on-the-fly clustering cannot be expected to be as precise (a least in a library science sense) as categorizations produced by a human librarian who is able to take a considerable length of time, and who has available the entire work to be classified. On the other hand, because the dynamic categories are adjusted to fit the actual results, they can turn out to be more cogent than preset categories, for the purpose at hand.

2.2 Examples Of Clustered Search Results

Examples will help make clear the benefits of clustering. We search for the word "swing". This word is a good choice because there are many possible senses with which it can be used. "Swing" can mean the Java GUI library, a child's swing, a form of dance, a golf swing, etc. Listing 1 depicts unclustered results of a search on Google, which does not currently cluster its data. Only a portion of the results can be shown here, but it will serve.

Listing 1 - Unclustered results of a search for "swing".
Creating a GUI with JFC/Swing
Any Swing Goes - swing music dancing lindy hop
Total Swing Online
Swing: A Quick Tutorial for AWT Programmers
Swing 46 - Dinner Dance Club with Live Jazz
Welcome to the Swiss Swing Dance Society
The World Swing Dance Council
Reader adventure "a swing through a rain forest"
Sonny Watson West Coast Swing Master Classes,
SWING PLANS - Woodworking plans: Porch
Swings & Gliders ...
Swing State Project "... swing states ..."
The Swing of a Cricket Ball
[... more entries omitted ...]

Figure 1 gives clustered results from three sites that cluster results. Vivisimo, Mooter, and iBoogie. The latter's basic technology is also used in the desktop search application, Grokker.

Figure 1: Clustered results of a search for "swing".
[Link to open this graphic in a separate page]

Search results are for three different search sites. Note that only Vivisimo found results for Java (the Java Swing GUI library). The results depict only the categories, not individual hits. The numbers after each category in the VIvisimo column indicate the number of hits for each category.

Figure 2 shows a screen capture of Vivisimo results for "blogging".

Figure 2: Vivisimo results of a search for "blogging"
[Link to open this graphic in a separate page]

Screen capture of typical Vivisimo clustered search results. The graphic has been clipped at the bottom. Note that some categories may be expanded to reveal subcategories (not shown here).

When the number of search results is large, clustering can clearly help a person to choose which returns to sample, provided the clustering is of sufficient quality, as is apparent from figure 2.

2.3 Motivation For This Work

The present work grew out of an experiment in applying dynamic clustering to a set of bookmark titles. The author's bookmark manager application [Passin 2003] can search through bookmark titles and return a list of matching resources. Currently, the list is grouped according to the top-level folders the bookmarks are filed under. Because titles are short, they do not offer much information for a clustering algorithm to work with, but it seemed possible that the clustering results might be useful nevertheless.

Accordingly, a clustering procedure was developed following the algorithm discussed in [Zaimir, Etzioni 1998]. Later it became clear that using the algorithm to cluster RDF data sets was feasible and interesting.

2.4 Characteristics Of On-the-fly (Dynamic) Clustering

Dynamic clustering has been characterised as follows by Vivisimo -

Automatically categorizes search results on-the-fly into meaningful hierarchical folders.

and by iBoogie -

Clusterizer puts documents with similar content or with related topics into the same group. Each group is assigned a label based on the content of the documents.

Dynamic clustering typically uses just the information in the returned results to perform the clustering, although it is possible to incorporate pre-defined categories to a degree as well. Usually the goal is to group the results a Web searches, with the aim of improving the utility of the results. Therefore, a practical algorithm must return quickly after all the search results have arrived, generally within a few seconds. This will be easiest to accomplish if the clustering can be performed as the results stream in, rather than in a batch process afterwards.

It is also very desirable for the execution time to scale linearly with the number of documents, to allow the system to scale effectively. Note that a naive clustering algorithm is likely to require (N2) time, where N is the number of documents.

Dynamic clustering algorithms attempt to find similar works based on matching phrases in their titles and text. In some cases, works may be clustered into a kind of super cluster or meta cluster. If document D1 shares a phrase with D2, and D2 shares a different phrase with document D3, all three may be grouped together, depending on how the algorithm scores the similarities and the significance of the phrases that match. It is also possible to develop subcategories within higher-level clusters, as depicted in figure 2.

It is not uncommon for a given document to be clustered into more than one category. This is desirable, as long as documents do not get clustered into too many places, since a given document is likely to contain information about more than one subject or theme.

Each cluster must be given a title, of course. It can be challenging to compute a suitable title for a given cluster. Preferably, the titles will seem to the reader to have been written by a person rather than computed. This ideal is reached to greater or lesser degrees in practice, depending on the design and complexity of the algorithm for selecting a title.

2.5 Titles Only Vs. Text Snippets

Titles, being short summaries of a document's intent, contain much less information than even the text snippet returned by a typical search engine, let alone the entire document. Can titles alone contain enough information for provide useful clustering? This question formed the original motivation for the work reported here, as explained in section 2.3. [Zaimir, Etzioni 1998] reports on experiments that compared the use of snippets (averaging 20 words) vs. entire Web documents (averaging 760 words, or 220 words after preprocessing that removed certain non-significant words). There was little difference in the precision of the clustering. This data holds out some hope that titles alone might be sufficient to support a useful clustering quality.

2.6 Examples Of Clustered Bookmark Titles

The algorithm described in [Zaimir, Etzioni 1998] was implemented (rather crudely) in Javascript, and is described further in section 3. A search was made with the bookmark manager for titles that contained the string "rdf". Figure 3 shows part of the results, grouped under the top-level folders they are filed under (in other words, the clustering procedure has not been applied).

Figure 3: Results of a search for titles containing "rdf".
[Link to open this graphic in a separate page]

The search results were not clustered, but have been grouped under the top-level folders they are filed under in the bookmark collection. The entire result set contains approximately 150 titles. The results have been clipped on the right side so as not to have excessive width.

The groupings depicted in figure 3 seem fairly sensible, as they ought since they were chosen by the owner of the bookmarks. Figure 4 shows the same search results, clustered by the procedure described in section 3.

Figure 4: Clustered results of a search for titles containing "rdf".
[Link to open this graphic in a separate page]

The search results shown in figure 3 were clustered by the procedure described in this paper. The entire result set contains approximately 150 titles. The underlined, boldfaced words or phrases are the titles of the corresponding groups.

The clusterings depicted in figure 4 are completely unlike the groupings in figure 3, but nevertheless seem reasonable. One major difference is that the cluster headings are more closely tied to the text of the titles, as might be expected. In figure 3, in which the headings were created by a person, the heading often does not even appear in the documents' title. For example, in figure 3 there is an entry "DAML+OIL Definition in RDF", listed under the heading Ontology. To a person, it is not surprising to see this relationship, but since the title does not contain the word ontology, the dynamic clustering algorithm used here is not capable of making the connection.

We conclude from these results that clustering on the title seems promising, although naturally not of the quality of results achieved by, for example, Vivisimo.

2.7 The Clustering Algorithm

As mentioned in section 2.3, the procedure used for this work was developed based on work reported in [Zaimir, Etzioni 1998]. The core of this work is an adaptation of the so-called suffix tree. The design and use of suffix trees is covered thoroughly in [Gusfield 1997]. Suffix trees are used for string matching and partial matching problems. They can be constructed in linear time, and matching can also be performed in linear time (linear in the number of test strings, and linear in the size of the suffix tree). Because of their good performance, they have been used extensively for matching DNA fragments, which can be represented as very long strings. In addition, The suffix tree approach has been adapted for matching against multiple test strings simultaneously, which is of considerable interest in the present context.

We briefly describe a suffix tree, and then describe its adaptation for dynamic clustering. The description will lack detail since the thrust of this work is the clustering results, not this particular implementation. A classic suffix tree contains all suffixes of a string of letters. A suffix of a string is defined as the substring from a given letter to the end of the string. The prefix is the rest of the string (the "front end", where the suffix would be the "rear end")

Figure 5: A simple suffix tree.
[Link to open this graphic in a separate page]

The figure shows all the suffixes of the string abcabf, arranged in a suffix tree. The path from each leaf node to the root is labeled such that the concatenation of the edge labels equals one suffix of the string, without duplication.

As depicted in figure 5, every suffix of the string is represented once in the tree, as the concatenation of the edge labels from a leaf node to the root. Interior nodes are inserted where two suffixes share a common portion. Although not shown here, the nodes are labeled with pointers into the string. Once the tree is built, it can be reused for as many string matches as desired. Typically, the string to be matched against the tree is much smaller than the length of the string encoded in the tree. There are several implementations that allow a suffix tree to be built in linear time.

The suffix tree as described so far is made up of letters. One of the central adaptations for dynamic clustering is to change from letters to arbitrary tokens; in this case, the words of the documents are taken to be the tokens. Each document to be clustered is broken up into words, and then its suffixes are added to the tree. In this way, the tree can be build incrementally as the results stream in. Information on which strings match which prefixes is added to the appropriate nodes during construction.

Clusters are found by examining the strings (of words) that share a node. There is a weighting procedure that eliminates words or phrases that are too common or not common enough. After all, if a word were to appear in every cluster, the clustering would be of no value for discriminating that word from others. Similarly, if a word were not shared by other documents, there would be little point in clustering it.

Next, The clusters are compared pairwise to see which ones share documents. Again, there is a scoring procedure. Cluster that survive this rating are considered "connected", and these connected components of the entire graph of clusters form the superclusters that are to be presented to the user. Again, once enough of the tree has been built, the clusters and superclusters can be formed, and simply modified slightly as more results stream in. However, in the work reported here, streaming is not used.

The final step is to compute a good title for each of the superclusters. This is not a trivial procedure, and the referenced material does not give details of how it should be accomplished. The title is selected from the matching phrases of a supercluster, but because a supercluster can include documents that do not share a phrase with one or more of the other members, there may be a number of candidate phrases.

3.0 Javascript Implementation

The implementation used for this work was written in Javascript, because the original intent was to integrate it into the bookmark manager, which itself is written in Javascript (since it runs in a Web browser). The implementation strategy emphasized ease of development rather than sophistication, since the intent was to explore clustering titles and to evaluate the value of the results. Thus, a number of shortcuts were taken that reduced performance, especially scaling, but also reduced development time.

In particular, a full suffix tree is not actually built, and matching is not done by traversing the tree. Instead, the suffixes are inserted into an associative array (which we will call a dictionary from here on). In effect, each suffix is attached to the root. This approach allows testing for a match by a simple and fast dictionary lookup instead of by a tree traversal. The dictionary entry for each key also contains a list of all documents that have a phrase (sequence of one or more words) that matches the key.

However, since the tree is not actually traversed, arbitrary subphrases cannot be found. To make up for this, all possible substrings of each title are inserted into the tree. Thus matches can be found by using each phrase as a key into the dictionary. This design is easy to code but does not scale well. However, it is adequate for the exploratory work reported here.

Some preprocessing of the titles is also done before it is added to the dictionary. Unwanted special characters such as "/", ":", etc., are translated to spaces or underscores, or simply removed. After this stop words are removed; to date the stop-list has been hand-written and contains some 80 words. The list is extended when necessary based on source text samples. Finally, the remaining words are stemmed using a standard Porter stemming routine. The stemming routine was ported to Javascript from a Python version written by Vivake Gupta, released in 2001.

Apart from the changes discussed above, the implementation follows the description in [Zaimir, Etzioni 1998] as closely as possible.

4.0 RDF and Clustering

Data modeled with the Resource Description Framework (RDF) follows a simple pattern. This paper does not attempt to give a comprehensive introduction to RDF. If one is needed, see [RDF Primer]. Here we summarize a few features that are essential for our theme.

4.1 Essential Features of RDF

The main essential feature for our purposes is that each bit of information in RDF consists of three parts, a subject, a property or predicate, and an object (also called a value or a property value). The three-part structure is known as a statement or a triple.

The subject and predicate of a statement are abstractions called resources. The object is either a resource or a literal value. A resource may stand for virtually anything, whether retrievable over a network or not. Resources are normally identified using a URI reference as a global identifier; the URI reference (we will just say URI even when we mean URI Reference) does not need to point to an actual retrievable resource. In some cases a resource will not have a global identifier. Such resources are known as blind nodes, or simply bnodes. The node terminology comes from viewing a collection of triples as an edge-labeled directed graph, where predicates are the edges, and an edge connecting a subject with an object points from the former to the latter.

One node may have more than one statement where it is the subject. For example, bnodes will commonly be the subject of numerous statements; they typically represent an unnamed collection point. For example, a row in a relational table might be modeled by a bnode; each column would become an edge, and each cell value would become an object.

4.2 Similarities Between Titles and RDF Statements

Titles are normally phrases or short sentences. Here are some typical examples:

An Axiomatic Semantics for RDF, RDF Schema, and DAML+OIL
An RDF view of REST
ARP Another RDF Parser
Automatic RDF Metadata Generation for Resource Discovery
Basic XML and RDF techniques for knowledge management

An RDF statement is essentially a simple sentence, and even short that most titles. Here are some examples, with URIs replaced by simple words to make them more readable:

John knows Sara
John livesIn Montreal
Montreal type city
Montreal locatedIn Canada
Canada type country
We see that an RDF statement is similar to a short - very short - title. Seen in this way, it is a short step to the question of whether a collection of RDF statements can be successfully clustered, and if so, whether the results would be of use.

4.3 Possible Benefits From Clustering RDF Statements

Natural language titles are subject to all the foibles of any natural language utterance, including variations in word order that would prevent clustering of titles that would otherwise be put together. RDF statements do not have such a problem. In addition, each part of a statement is but a single piece - either a URI or a literal value (although some literal values may contain several words). Various techniques are available to make URIs more readable, such as replacing the base of a URI with an XML entity or an XML namespace prefix.

So it seems reasonable that an attempt to cluster a collection of RDF statements might cluster rather effectively. If so, we could expect that statements relating to each subject, each predicate, and each object would be collected together. In addition, statements for each combination of (subject+predicate), etc., ought to cluster as well. In other words, the clustering algorithm ought to pull together everything directly known about any resource. This would amount to performing a whole series of queries like the following:

  • Who does John know?
  • What cities are locatedIn Canada?
  • Tell me everything directly relating to Sara.

Bnodes would also be queried, with results something like the following:

There is something (the bnode) having
- type car
- color white
- make "Mazda"
- owner John

It is true that an ordinary RDF query language can pose and answer such questions, with the help of wildcards, but the results would not be grouped in useful ways. In addition, such a query language requires a query syntax, a query parser, and other complex components, none of which are needed with the dynamic clustering approach.

5.0 Examples of Clustered RDF

This section presents some preliminary results of applying the clustering procedure to RDF data. The URIs have been replaced by more readable words. The RDF fragment is based on a real RDF document; the readable words are generally taken from the final step of actual URIs. Bnodes have been given temporary identifiers like "gen002" for readability.

5.1 The Source RDF

A portion of the RDF is shown below in the Javascript form that the procedure consumes:

var orig_sources=[
'gw1 type gateway',
'gw1 operated-by mitretek',
'gw1 gets-atis-info gen001',
'gen001 atis-provider california',
'gen001 atis-info-type transit',
'gen001 atis-info-type incident',
'gen001 wsdl-location "http://www.example.org/calits/wsdl/xml"',
'gen002 atis-provider oregon',
'gen002 atis-info-type 511',
'gen002 atis-info-type incident',
'gen002 atis-info-schedule "hourly"',
'gw1 atis-info-format gen003',
'gen003 atis-provider oregon',
'gen003 display-format "map"',
'gen003 display-format "HTML table"',
...]

It is easy to obtain data in this format by starting with the so-called N-Triples format. A number of RDF tools can output N-Triples data, and some simple processing with regular expressions will produce data in the format shown above.

5.2 Clustered RDF Results.

Figure 6 depicts a portion of the results.

Figure 6: Clustered RDF results
[Link to open this graphic in a separate page]

A portion of the clustered RDF. The underlined, boldfaced words are the titles of the corresponding groups.

An examination of these results shows that the expectations of section 4.3 as to the characteristics of the results are largely met. Take, for instance, the group with the title "display-format". The five statements in this group represent everything known to this dataset about the property "display-format". Notice that "display-format" appears in the predicate position as well as the subject position in this group. It would be hard to write a wild-carded query with a conventional query language that would produce such a result. This type of grouping lets one rapidly scan the data set in a particularly efficient way.

Similarly, note the group with the title "gen-001". Notice that in the last of the five statements, "gen-001" appears in the object (last) position while it appears in the subject (first) position in the others. An English rendering of the meaning of these statements would say that "gw1" (which happens to be a gateway) receives a collection of specialized information, which collection is represented by the bnode labeled "gen-001". The gen-001 collection includes data of types incident and transit, is provided by the state of California, and is available in a manner described in a WSDL file located at the given URL.

Other clustering runs have been made using other RDF datasets (not shown here). It has proved useful to use entity-style abbreviations for the base URI of resource identifiers, just to make the clustered results ore readable.

5.3 Clustered Results as Parallel Queries

The results of this particular example described here are somewhat as if many queries had been run simultaneously. The results were clustered automatically into 16 fairly cogent groups. The cluster procedure in effect ran queries similar to the following:

Select triple where subj = gen002

as well as queries similar to the following:

Select triple where subj = gen002 and pred = atis-info-type

(plus obvious permutations of the above).

5.4 Algorithm Modifications For Use With RDF Source

To process RDF, only minor changes were needed to the Javascript algorithm described earlier. Stemming was eliminated, as it no longer made much sense (though it is possible that stemming would still be useful for data that had a significant amount of natural language stored in literal values). Also, most of the character translations were removed, since symbols like "/" and "#" are now significant. No other changes were necessary. Of course, an improved implementation of the algorithm would be desirable so that much larger data sets would be practical to process.

5.5 Immediate Uses

In its current form, the clustering procedure is valuable when one wants to analyze a smallish collection of RDF. One can rapidly discover which resources are in use and their main properties. It would be well-suited for clustering query results, especially when the queries contain many wildcards. Doing so would help the organize the query results in particularly useful ways.

For example, one could easily

  • Review all range/domain assignments.
  • Find all resources related to a specific bnode.
  • Find resources with duplicated properties.

It would also be easy to format the results as XML for further analysis. Clearly, such clustering can produce well-focused subgraphs, reminiscent of Stickler's URIQA protocol [URIQA]. URIQA returns a subgraph of information for a specified subject to a greater depth where bnodes are involved, but much of the additional information is included in the clustered results, although not organized the same way.

6.0 Possible Extensions

Potential extensions exist, some fairly obvious and others more speculative and more ambitious.

6.1 Relatively Straightforward Extensions

There are a variety of ways one could tune the algorithm. One could save the tree and later match it against specific triples or sets of triples to see if they are included in the data set. One could filter the triples during tree construction, to reduce its size and restrict it to certain areas of interest. The rules for scoring meta clusters could be tuned to achieve specific goals or effectively.

And, of course, one could upgrade the implementation to use a proper suffix tree. Although obvious, this probably should not be counted as "straightforward".

6.2 Visualizations

One can imagine applications where the user clicks on a cluster title and the corresponding subgraph lights up in a display of the graph. Alternatively, only the subgraph might be shown. No doubt there are many other visualizations for which the clustering technique would be useful.

6.3 Extension To Topic Maps And Conceptual Graphs

It is natural to try to apply the technique to other kinds of data sources, such as Topic Maps or Conceptual Graphs. It is not immediately clear how to proceed. The algorithm depends on regarding sentences, phrases, titles, or triples (all being essentially similar) as strings of tokens. Clearly, any tuple can be easily transformed into this form as long as its items do not contain internal structure. However, when the data of interest are structured, as is generally the case with Topic Maps and Conceptual Graphs, one needs to have a canonical way to serialize the structures of interest into strings of tokens.

No canonical serializations are currently available for either Topic Maps or Conceptual Graphs, at least none that would be guaranteed to produce a specific, repeatable order, to the best of the author's knowledge. Once such a serialization method has been defined, dynamic clustering could be used with such data sources.

6.4 Sheer Speculation

A properly implemented suffix tree algorithm is known to be highly efficient for a wide range of difficult string matching problems, including partial matches. A range of variations allow one to make useful trade offs between time and memory required to build or store the tree, and time or memory needed to perform different kinds of queries. The suffix tree family of algorithms is known to be capable of handling the very large string matching problems presented by DNA research.

It seems reasonable to wonder, then, whether a suffix tree might make a suitable primary data store for a large collection of RDF data. If this should prove feasible, then the matching, querying, and clustering benefits of the suffix tree would be available all the time, with no need to build a special tree when needed. While a suffix tree can be much larger than the strings that go into its construction, it is not obvious that such a tree would be larger than a more conventional triple store including all indexes necessary for good performance.

7.0 Summary

This paper has shown how a dynamic clustering algorithm, originally developed to help organize the results of Web searches, can be easily adapted to be useful for organizing RDF data. The value of the procedure is that it automatically groups together a great deal of information about each RDF resource in the dataset. An overview of the algorithm was discussed. An example of clustered data was shown, using a set of bookmark titles as the source.

The rationale for applying the algorithm to RDF data was explained, and an example of clustered RDF data was presented. Possible extensions of the method were discussed, leading to a speculation that the core data structure, the suffix tree, might prove to be suitable as the primary data store for a large RDF dataset.


Bibliography

[Gusfield 1997] Gusfield, Dan, "Algorithms on Strings, Trees, and Sequences", Cambridge, ISBN 0-521-58519-8

[Passin 2003] Passin, Thomas B., "Browser Bookmark Management With Topic Maps", http://www.mulberrytech.com/Extreme/Proceedings/html/2003/Passin01/EML2003Passin01.html

[RDF Primer] Minola, F., and Miller, E., "RDF Primer", The World Wide Web Consortium, http://www.w3.org/TR/rdf-primer

[URIQA] URIQA! The URI Query Agent Model - http://sw.nokia.com/uriqa/URIQA.html.

[Zaimir, Etzioni 1998] Zamir, O. and Etzioni, O., "Web Document Clustering: A Feasibility Demonstration", http://www.cs.washington.edu/homes/etzioni/papers/sigir98.pdf



On-the-fly Clustering As A Novel RDF Query Mechanism

Thomas B. Passin [Principal Systems Engineer, Mitretek Systems]
tpassin@comcast.net