On-the-fly clustering has developed in recent years as one of the most promising ways to organize search results. This technique groups search results into categories that are derived from the search results themselves. The categories can be constructed as the results stream in. With the right algorithms, the clusters can be remarkably useful in helping a person make sense of a large set of search hits.
Clustering techniques normally try to use as much information as possible, which in the case of typical search results on the Web means the title of the retrieved work along with whatever text snippet the search engine returns. By contrast, this work examines the effectiveness of the clustering when only the title is available. This is the case when the search occurs in a browser bookmark collection, for example. Even with this small amount of information, on-the-fly clustering can be surprisingly useful.
An RDF data set consists of a number of simple three-part "triples". A triple, being a kind of simplified sentence, is much like the title of a bookmark, though usually less informative. This similarity raises the interesting question of whether the same clustering technique used for bookmark clustering could be useful for analysis of an RDF data set. This question is investigated experimentally. It turns out that this version of clustering amounts to a remarkably powerful parallel query mechanism for RDF data. Among other characteristics, data about each of the key resources is grouped together.
The clustered results can be invaluable as an aid to investigating a collection of RDF data. The results include portions that are similar to Patrick Stickler's "Concise Bounded Description", except that all the important resources are so clustered in a single run of the algorithm. This approach has good potential to help a person trying to analyze and understand a small to moderate size RDF data set.
The results returned by a search engine can be difficult to work with, because the results are usually not differentiated by categories. For example, a query for "swing" will return hits for the Java Swing GUI library, childrens' swings, golf swings, porch swings, big band music, and so on. When there are many hits, the user can be hard-pressed to find the ones of interest. By contrast, a few search sites display the hits grouped into categories. In the past, this was done by developing an hierarchy of categories for the site, and labeling the groups with the corresponding category labels. However, this approach is often not satisfactory for various reasons. A newer approach uses so-called "on-the-fly" clustering.
In this approach, the search results are clustered based on a dynamic analysis of the actual corpus returned, rather than by using static, predetermined categories. Clustering is often based primarily on matching portions of strings. Common sequences suggest a cluster. A particular document may end up in more than one cluster. This is usually considered to be a desirable characteristic, one that reflects the complexity of real documents and the fact that real-life classifications are rarely totally exclusive. It is also possible to combine dynamic clustering with pre-established categories in various ways. One challenge is to do the large amount of processing in a short enough time to make it acceptable to users. Another is to select appropriate labels for each cluster.
Some sites cluster results to good effect. Notable are Vivisimo, Kartoo, Mooter, and IBoogie, with Vivisimo consistently returning the most useful results. With any of these sites, scanning the hit list is much easier than for a site that returns a simple flat list. There are two reasons for this. First, the user can select or reject whole groups of results because they fall into the wrong category. With flat results, the user must read the text snippet or open the linked page to see if it is what is desired. Second, sometimes the user notices unexpected topics, which can turn out to be very rewarding.
Typical sites that cluster request data from several established search engines. They combine the results and remove duplicates, and attempt to compute the most appropriate clusters and cluster labels. The more information available, the better this process is likely to work. Usually the main information is the title and the text snippet supplied for each hit.
The original motivation for this work was an investigation into clustering the results of queries into a large collection of browser bookmarks. In this case, only the title is available for dynamic clustering. This raised the question of just how little information is needed to get useful clustering results. The inquiry was moderately successful, as well be discussed in section 3.
RDF data consists of a number of short "statements" that link concepts and other "resources" together. Each statement by itself contains very little information, a circumstance reminiscent of the bookmark titles. Is it possible that dynamic clustering techniques can be applied to RDF?
RDF data sets are organized into three-part statements, or triples, These triples are in effect short sentences, consisting as they do of a subject, a predicate, and an object. A list of triples is therefore similar to a list of bookmark titles. This raises the possibility that useful clustering of a set of triples might be possible.
To serve as source data for the existing clustering implementation reported here, a collection of RDF triples would have to be written as a string, one triple per line. To simplify a bit, a triple in this form would resemble the following:
book_21 title "A Field Guide To The Semantic Web"
This form is very similar to the N-Triples text format for RDF. It is not hard to convert N-Triples data to this form, and with some additional coding, it might be feasible to use the N-Triples format directly.
A collection of RDF statements can be considered to form a graph, where the predicates correspond to edges between nodes, which represent subjects and objects. The more edges that connect to a given node, the more information that is available about that node (that is, about the resource that the node stands for). Thus an RDF data set is likely to contain a number of connected graph components, sometimes called subgraphs. A query over an RDF data set would return a subgraph of data relating to the conditions set by the query.
Flexible string matches of the sort found by the clustering algorithms can identify subgraphs. These subgraphs should be interesting because they collect information in depth about one or many resources, and also because there can be matches among any literal values that are present. This latter motion raises the possibility of discovering related information that would be unlikely to be found by standard query languages.
It turns out that the effect of the clustering algorithm is much like running an extensive set of wild-carded queries in parallel. For example, triples that contain a particular resource end up grouped together whether the resource occurs as the subject or object. This means that everything known about a resource (to a depth of one relationship, and more when the groupings around bodes are included) is collected together in one place. The groupings provide a practical alternative to graphical visualizations of RDF data, which become progressively harder to read as the amount of data becomes larger. The clustered results also can show groupings of related information that could not be shown on a typical graphic display. Thus, on-the-fly clustering can act as a powerful tool for people who need to inspect and analyze sets of RDF data.
Since clustering algorithms are fairly computationally intensive, queries can take some time, especially for larger data sets. They make up for this by being like multiple - even dozens - of queries all at once.
The rest of this paper is organized as follows. Section 2 discusses on-the-fly clustering as it operates on typical (non-RDF) search results. The discussion is motivated by a desire to cluster search results when only a minimum of information is available, typically just the titles of the works returned. One algorithm for dynamic clustering is described; it uses suffix trees to build up a matching structure. The advantages of suffix trees are summarized, and the remaining parts of the algorithm are sketched out.
Section 4 focuses on the fact that RDF data sets are organized into triples, which are in essence short sentences. If each triple is written out as a one-line string, it would be not unlike a short title of the type clustered in section 3. An RDF data set in this form can therefore be submitted to the very same algorithm discussed previously.
Section 5 presents examples of RDF data clustered by the algorithm. The results are quite striking, and their useful characteristics are discussed. Section 6 discusses promising extensions of the work, which are based in part on how suffix trees are used in the analysis of strings of DNA. This section also speculates on ways to adapt the approach to cover data contained in Topic Maps and Conceptual Graphs. Section 7 summarizes the paper.
Automated categorization of a collection of documents has been offered by commercial software for a number of years, with various degrees of success. The rise of large, capable search engines has led to their widespread use. Yet search engine results can be difficult to work with because they are usually presented in a flat, undifferentiated list. The user must skim both the title and the snippet of returned text for each hit to decide whether to follow its link and read more. Searching in this manner can become time-consuming and tedious. On-the-fly, or dynamic, clustering refers to a group of attempts to improve the lot of the searcher by grouping the search results based on their contents.
Dynamic clustering usually uses the returned search results themselves as the source of the information needed to cluster the results. For a given set of results, this is often more useful then using precomputed categories, because such categories may not fit the actual results well, since they are but a subset of a larger collection. Furthermore, for a collection that is a substantial subset of the set of pages available on the Web, keeping precomputed categories up to date is a large and time-consuming task.
Another significant point is that many times, a given search result does not fit neatly into a single category. Thus a clustering algorithm should be capable of placing a document into more than one category. Typically, precomputed categorization software attempts to find a single "best" category for each document.
A third consideration is that it is often necessary to combine search results from several sites or databases. Many current search sites operate in this way. No one site would be able to precompute a set of categories to cover all the contributing sites, nor to tag all of their documents with the categories. Even if the individual sites did so, and encoded their categories into their result streams, the categories from one site would not match those from the others. Thus, pre-established categories are impractical for search results, especially for aggregated results.
Dynamic clustering generally operates separately on each set of (aggregated) results. Indeed, certain algorithms can operate on the search hits as they stream in. This is a challenging task, since the delay from receipt of the result streams to presentation to the users must be small, perhaps on the order of a few seconds or less. To increase the difficulty of the task, it is normally infeasible for the software to analyse all the complete documents found by the search. Retrieving and analyzing them all would take much too long.
Thus clustering software must usually, as a practical matter, operate on just the retrieved titles and text snippets, or summaries, that are included in the stream of results. Consequently, on-the-fly clustering cannot be expected to be as precise (a least in a library science sense) as categorizations produced by a human librarian who is able to take a considerable length of time, and who has available the entire work to be classified. On the other hand, because the dynamic categories are adjusted to fit the actual results, they can turn out to be more cogent than preset categories, for the purpose at hand.
Examples will help make clear the benefits of clustering. We search for the word "swing". This word is a good choice because there are many possible senses with which it can be used. "Swing" can mean the Java GUI library, a child's swing, a form of dance, a golf swing, etc. Listing 1 depicts unclustered results of a search on Google, which does not currently cluster its data. Only a portion of the results can be shown here, but it will serve.
Figure 1 gives clustered results from three sites that cluster results. Vivisimo, Mooter, and iBoogie. The latter's basic technology is also used in the desktop search application, Grokker.
Search results are for three different search sites. Note that only Vivisimo found results for Java (the Java Swing GUI library). The results depict only the categories, not individual hits. The numbers after each category in the VIvisimo column indicate the number of hits for each category.
Figure 2 shows a screen capture of Vivisimo results for "blogging".
Screen capture of typical Vivisimo clustered search results. The graphic has been clipped at the bottom. Note that some categories may be expanded to reveal subcategories (not shown here).
When the number of search results is large, clustering can clearly help a person to choose which returns to sample, provided the clustering is of sufficient quality, as is apparent from figure 2.
The present work grew out of an experiment in applying dynamic clustering to a set of bookmark titles. The author's bookmark manager application [Passin 2003] can search through bookmark titles and return a list of matching resources. Currently, the list is grouped according to the top-level folders the bookmarks are filed under. Because titles are short, they do not offer much information for a clustering algorithm to work with, but it seemed possible that the clustering results might be useful nevertheless.
Accordingly, a clustering procedure was developed following the algorithm discussed in [Zaimir, Etzioni 1998]. Later it became clear that using the algorithm to cluster RDF data sets was feasible and interesting.
Dynamic clustering has been characterised as follows by Vivisimo -
Automatically categorizes search results on-the-fly into meaningful hierarchical folders.
and by iBoogie -
Clusterizer puts documents with similar content or with related topics into the same group. Each group is assigned a label based on the content of the documents.
Dynamic clustering typically uses just the information in the returned results to perform the clustering, although it is possible to incorporate pre-defined categories to a degree as well. Usually the goal is to group the results a Web searches, with the aim of improving the utility of the results. Therefore, a practical algorithm must return quickly after all the search results have arrived, generally within a few seconds. This will be easiest to accomplish if the clustering can be performed as the results stream in, rather than in a batch process afterwards.
It is also very desirable for the execution time to scale linearly with the number of documents, to allow the system to scale effectively. Note that a naive clustering algorithm is likely to require (N2) time, where N is the number of documents.
Dynamic clustering algorithms attempt to find similar works based on matching phrases in their titles and text. In some cases, works may be clustered into a kind of super cluster or meta cluster. If document D1 shares a phrase with D2, and D2 shares a different phrase with document D3, all three may be grouped together, depending on how the algorithm scores the similarities and the significance of the phrases that match. It is also possible to develop subcategories within higher-level clusters, as depicted in figure 2.
It is not uncommon for a given document to be clustered into more than one category. This is desirable, as long as documents do not get clustered into too many places, since a given document is likely to contain information about more than one subject or theme.
Each cluster must be given a title, of course. It can be challenging to compute a suitable title for a given cluster. Preferably, the titles will seem to the reader to have been written by a person rather than computed. This ideal is reached to greater or lesser degrees in practice, depending on the design and complexity of the algorithm for selecting a title.
Titles, being short summaries of a document's intent, contain much less information than even the text snippet returned by a typical search engine, let alone the entire document. Can titles alone contain enough information for provide useful clustering? This question formed the original motivation for the work reported here, as explained in section 2.3. [Zaimir, Etzioni 1998] reports on experiments that compared the use of snippets (averaging 20 words) vs. entire Web documents (averaging 760 words, or 220 words after preprocessing that removed certain non-significant words). There was little difference in the precision of the clustering. This data holds out some hope that titles alone might be sufficient to support a useful clustering quality.
The search results were not clustered, but have been grouped under the top-level folders they are filed under in the bookmark collection. The entire result set contains approximately 150 titles. The results have been clipped on the right side so as not to have excessive width.
The groupings depicted in figure 3 seem fairly sensible, as they ought since they were chosen by the owner of the bookmarks. Figure 4 shows the same search results, clustered by the procedure described in section 3.
The search results shown in figure 3 were clustered by the procedure described in this paper. The entire result set contains approximately 150 titles. The underlined, boldfaced words or phrases are the titles of the corresponding groups.
The clusterings depicted in figure 4 are completely unlike the groupings in figure 3, but nevertheless seem reasonable. One major difference is that the cluster headings are more closely tied to the text of the titles, as might be expected. In figure 3, in which the headings were created by a person, the heading often does not even appear in the documents' title. For example, in figure 3 there is an entry "DAML+OIL Definition in RDF", listed under the heading Ontology. To a person, it is not surprising to see this relationship, but since the title does not contain the word ontology, the dynamic clustering algorithm used here is not capable of making the connection.
We conclude from these results that clustering on the title seems promising, although naturally not of the quality of results achieved by, for example, Vivisimo.
As mentioned in section 2.3, the procedure used for this work was developed based on work reported in [Zaimir, Etzioni 1998]. The core of this work is an adaptation of the so-called suffix tree. The design and use of suffix trees is covered thoroughly in [Gusfield 1997]. Suffix trees are used for string matching and partial matching problems. They can be constructed in linear time, and matching can also be performed in linear time (linear in the number of test strings, and linear in the size of the suffix tree). Because of their good performance, they have been used extensively for matching DNA fragments, which can be represented as very long strings. In addition, The suffix tree approach has been adapted for matching against multiple test strings simultaneously, which is of considerable interest in the present context.
We briefly describe a suffix tree, and then describe its adaptation for dynamic clustering. The description will lack detail since the thrust of this work is the clustering results, not this particular implementation. A classic suffix tree contains all suffixes of a string of letters. A suffix of a string is defined as the substring from a given letter to the end of the string. The prefix is the rest of the string (the "front end", where the suffix would be the "rear end")
The figure shows all the suffixes of the string abcabf, arranged in a suffix tree. The path from each leaf node to the root is labeled such that the concatenation of the edge labels equals one suffix of the string, without duplication.
As depicted in figure 5, every suffix of the string is represented once in the tree, as the concatenation of the edge labels from a leaf node to the root. Interior nodes are inserted where two suffixes share a common portion. Although not shown here, the nodes are labeled with pointers into the string. Once the tree is built, it can be reused for as many string matches as desired. Typically, the string to be matched against the tree is much smaller than the length of the string encoded in the tree. There are several implementations that allow a suffix tree to be built in linear time.
The suffix tree as described so far is made up of letters. One of the central adaptations for dynamic clustering is to change from letters to arbitrary tokens; in this case, the words of the documents are taken to be the tokens. Each document to be clustered is broken up into words, and then its suffixes are added to the tree. In this way, the tree can be build incrementally as the results stream in. Information on which strings match which prefixes is added to the appropriate nodes during construction.
Clusters are found by examining the strings (of words) that share a node. There is a weighting procedure that eliminates words or phrases that are too common or not common enough. After all, if a word were to appear in every cluster, the clustering would be of no value for discriminating that word from others. Similarly, if a word were not shared by other documents, there would be little point in clustering it.
Next, The clusters are compared pairwise to see which ones share documents. Again, there is a scoring procedure. Cluster that survive this rating are considered "connected", and these connected components of the entire graph of clusters form the superclusters that are to be presented to the user. Again, once enough of the tree has been built, the clusters and superclusters can be formed, and simply modified slightly as more results stream in. However, in the work reported here, streaming is not used.
The final step is to compute a good title for each of the superclusters. This is not a trivial procedure, and the referenced material does not give details of how it should be accomplished. The title is selected from the matching phrases of a supercluster, but because a supercluster can include documents that do not share a phrase with one or more of the other members, there may be a number of candidate phrases.
In particular, a full suffix tree is not actually built, and matching is not done by traversing the tree. Instead, the suffixes are inserted into an associative array (which we will call a dictionary from here on). In effect, each suffix is attached to the root. This approach allows testing for a match by a simple and fast dictionary lookup instead of by a tree traversal. The dictionary entry for each key also contains a list of all documents that have a phrase (sequence of one or more words) that matches the key.
However, since the tree is not actually traversed, arbitrary subphrases cannot be found. To make up for this, all possible substrings of each title are inserted into the tree. Thus matches can be found by using each phrase as a key into the dictionary. This design is easy to code but does not scale well. However, it is adequate for the exploratory work reported here.
Apart from the changes discussed above, the implementation follows the description in [Zaimir, Etzioni 1998] as closely as possible.
Data modeled with the Resource Description Framework (RDF) follows a simple pattern. This paper does not attempt to give a comprehensive introduction to RDF. If one is needed, see [RDF Primer]. Here we summarize a few features that are essential for our theme.
The main essential feature for our purposes is that each bit of information in RDF consists of three parts, a subject, a property or predicate, and an object (also called a value or a property value). The three-part structure is known as a statement or a triple.
The subject and predicate of a statement are abstractions called resources. The object is either a resource or a literal value. A resource may stand for virtually anything, whether retrievable over a network or not. Resources are normally identified using a URI reference as a global identifier; the URI reference (we will just say URI even when we mean URI Reference) does not need to point to an actual retrievable resource. In some cases a resource will not have a global identifier. Such resources are known as blind nodes, or simply bnodes. The node terminology comes from viewing a collection of triples as an edge-labeled directed graph, where predicates are the edges, and an edge connecting a subject with an object points from the former to the latter.
One node may have more than one statement where it is the subject. For example, bnodes will commonly be the subject of numerous statements; they typically represent an unnamed collection point. For example, a row in a relational table might be modeled by a bnode; each column would become an edge, and each cell value would become an object.
Titles are normally phrases or short sentences. Here are some typical examples:
An RDF statement is essentially a simple sentence, and even short that most titles. Here are some examples, with URIs replaced by simple words to make them more readable:
Natural language titles are subject to all the foibles of any natural language utterance, including variations in word order that would prevent clustering of titles that would otherwise be put together. RDF statements do not have such a problem. In addition, each part of a statement is but a single piece - either a URI or a literal value (although some literal values may contain several words). Various techniques are available to make URIs more readable, such as replacing the base of a URI with an XML entity or an XML namespace prefix.
So it seems reasonable that an attempt to cluster a collection of RDF statements might cluster rather effectively. If so, we could expect that statements relating to each subject, each predicate, and each object would be collected together. In addition, statements for each combination of (subject+predicate), etc., ought to cluster as well. In other words, the clustering algorithm ought to pull together everything directly known about any resource. This would amount to performing a whole series of queries like the following:
Bnodes would also be queried, with results something like the following:
It is true that an ordinary RDF query language can pose and answer such questions, with the help of wildcards, but the results would not be grouped in useful ways. In addition, such a query language requires a query syntax, a query parser, and other complex components, none of which are needed with the dynamic clustering approach.
This section presents some preliminary results of applying the clustering procedure to RDF data. The URIs have been replaced by more readable words. The RDF fragment is based on a real RDF document; the readable words are generally taken from the final step of actual URIs. Bnodes have been given temporary identifiers like "gen002" for readability.
var orig_sources=[ 'gw1 type gateway', 'gw1 operated-by mitretek', 'gw1 gets-atis-info gen001', 'gen001 atis-provider california', 'gen001 atis-info-type transit', 'gen001 atis-info-type incident', 'gen001 wsdl-location "http://www.example.org/calits/wsdl/xml"', 'gen002 atis-provider oregon', 'gen002 atis-info-type 511', 'gen002 atis-info-type incident', 'gen002 atis-info-schedule "hourly"', 'gw1 atis-info-format gen003', 'gen003 atis-provider oregon', 'gen003 display-format "map"', 'gen003 display-format "HTML table"', ...]
It is easy to obtain data in this format by starting with the so-called N-Triples format. A number of RDF tools can output N-Triples data, and some simple processing with regular expressions will produce data in the format shown above.
Figure 6 depicts a portion of the results.
A portion of the clustered RDF. The underlined, boldfaced words are the titles of the corresponding groups.
An examination of these results shows that the expectations of section 4.3 as to the characteristics of the results are largely met. Take, for instance, the group with the title "display-format". The five statements in this group represent everything known to this dataset about the property "display-format". Notice that "display-format" appears in the predicate position as well as the subject position in this group. It would be hard to write a wild-carded query with a conventional query language that would produce such a result. This type of grouping lets one rapidly scan the data set in a particularly efficient way.
Similarly, note the group with the title "gen-001". Notice that in the last of the five statements, "gen-001" appears in the object (last) position while it appears in the subject (first) position in the others. An English rendering of the meaning of these statements would say that "gw1" (which happens to be a gateway) receives a collection of specialized information, which collection is represented by the bnode labeled "gen-001". The gen-001 collection includes data of types incident and transit, is provided by the state of California, and is available in a manner described in a WSDL file located at the given URL.
Other clustering runs have been made using other RDF datasets (not shown here). It has proved useful to use entity-style abbreviations for the base URI of resource identifiers, just to make the clustered results ore readable.
The results of this particular example described here are somewhat as if many queries had been run simultaneously. The results were clustered automatically into 16 fairly cogent groups. The cluster procedure in effect ran queries similar to the following:
Select triple where subj = gen002
as well as queries similar to the following:
Select triple where subj = gen002 and pred = atis-info-type
(plus obvious permutations of the above).
In its current form, the clustering procedure is valuable when one wants to analyze a smallish collection of RDF. One can rapidly discover which resources are in use and their main properties. It would be well-suited for clustering query results, especially when the queries contain many wildcards. Doing so would help the organize the query results in particularly useful ways.
For example, one could easily
It would also be easy to format the results as XML for further analysis. Clearly, such clustering can produce well-focused subgraphs, reminiscent of Stickler's URIQA protocol [URIQA]. URIQA returns a subgraph of information for a specified subject to a greater depth where bnodes are involved, but much of the additional information is included in the clustered results, although not organized the same way.
Potential extensions exist, some fairly obvious and others more speculative and more ambitious.
There are a variety of ways one could tune the algorithm. One could save the tree and later match it against specific triples or sets of triples to see if they are included in the data set. One could filter the triples during tree construction, to reduce its size and restrict it to certain areas of interest. The rules for scoring meta clusters could be tuned to achieve specific goals or effectively.
And, of course, one could upgrade the implementation to use a proper suffix tree. Although obvious, this probably should not be counted as "straightforward".
One can imagine applications where the user clicks on a cluster title and the corresponding subgraph lights up in a display of the graph. Alternatively, only the subgraph might be shown. No doubt there are many other visualizations for which the clustering technique would be useful.
It is natural to try to apply the technique to other kinds of data sources, such as Topic Maps or Conceptual Graphs. It is not immediately clear how to proceed. The algorithm depends on regarding sentences, phrases, titles, or triples (all being essentially similar) as strings of tokens. Clearly, any tuple can be easily transformed into this form as long as its items do not contain internal structure. However, when the data of interest are structured, as is generally the case with Topic Maps and Conceptual Graphs, one needs to have a canonical way to serialize the structures of interest into strings of tokens.
No canonical serializations are currently available for either Topic Maps or Conceptual Graphs, at least none that would be guaranteed to produce a specific, repeatable order, to the best of the author's knowledge. Once such a serialization method has been defined, dynamic clustering could be used with such data sources.
A properly implemented suffix tree algorithm is known to be highly efficient for a wide range of difficult string matching problems, including partial matches. A range of variations allow one to make useful trade offs between time and memory required to build or store the tree, and time or memory needed to perform different kinds of queries. The suffix tree family of algorithms is known to be capable of handling the very large string matching problems presented by DNA research.
It seems reasonable to wonder, then, whether a suffix tree might make a suitable primary data store for a large collection of RDF data. If this should prove feasible, then the matching, querying, and clustering benefits of the suffix tree would be available all the time, with no need to build a special tree when needed. While a suffix tree can be much larger than the strings that go into its construction, it is not obvious that such a tree would be larger than a more conventional triple store including all indexes necessary for good performance.
This paper has shown how a dynamic clustering algorithm, originally developed to help organize the results of Web searches, can be easily adapted to be useful for organizing RDF data. The value of the procedure is that it automatically groups together a great deal of information about each RDF resource in the dataset. An overview of the algorithm was discussed. An example of clustered data was shown, using a set of bookmark titles as the source.
The rationale for applying the algorithm to RDF data was explained, and an example of clustered RDF data was presented. Possible extensions of the method were discussed, leading to a speculation that the core data structure, the suffix tree, might prove to be suitable as the primary data store for a large RDF dataset.
[Gusfield 1997] Gusfield, Dan, "Algorithms on Strings, Trees, and Sequences", Cambridge, ISBN 0-521-58519-8
[Passin 2003] Passin, Thomas B., "Browser Bookmark Management With Topic Maps", http://www.mulberrytech.com/Extreme/Proceedings/html/2003/Passin01/EML2003Passin01.html
[Zaimir, Etzioni 1998] Zamir, O. and Etzioni, O., "Web Document Clustering: A Feasibility Demonstration", http://www.cs.washington.edu/homes/etzioni/papers/sigir98.pdf