From Metadata to Personal Semantic Webs

Eric Freese
eric.freese(a)lexisnexis.com

Abstract

This paper uses common applications to demonstrate how small-scale, or personal, semantic webs of information can be constructed by harvesting and grouping certain metadata. The web is grown as more metadata is added to it. Personal tagging schemes can be used to customize the web for each user. This paper will also suggest the use of larger taxonomies in the creation and interchange of the personal tagging. In doing so, these personal webs will be able to merge into larger webs as well as increase the semantic searching capabilities.

A larger scale creation of semantic webs will also be proposed in an environment where web service interfaces are used to gather and organize metadata. These large scale webs would demonstrate the scalability of the personal webs as well as the importance of uniform URIs that allow data to be grouped in a consistent manner.

Keywords: Metadata; Semantic Web

Eric Freese

Eric Freese is a consulting software engineer with LexisNexis. He has 18 years of experience in the areas of document, information, and knowledge management with specific expertise in the development and implementation of XML technologies. His experience includes research, analysis, specification, design, development, testing, implementation, integration and management of information systems in a wide range of environments. He has significant research experience in human interface design, graphics interface development and artificial intelligence. Freese was a founding member of TopicMaps.Org, the organization that developed the XML Topic Maps (XTM) specification, and served as the chairman of this group. He is also the chief architect and developer of SemanText, an open source application that uses topic maps to harvest and manage knowledge.

From Metadata to Personal Semantic Webs

Eric Freese [LexisNexis]

Extreme Markup Languages 2006® (Montréal, Québec)

Copyright © 2006 Eric Freese. Reproduced with permission.

Introduction

In Tim Berners-Lee's 2001 article on the Semantic Web which appeared in Scientific American, he described an environment where devices throughout a home or office are interconnected and aware of each other, online software agents are able to find very specific information on the Internet, and users are able to collaborate based on the information being presented. While parts of the scenario are possible today, others do not seem any closer than they did when the article was published.

For example, Bluetooth technology has made it possible to design devices that can become aware of each other to form a "personal area network". Therefore the part of the scenario where the volume on the entertainment system is reduced when the phone rings is possible right now.

However, from that point in the scenario we seem to move rapidly from science reality to science fiction. The use, and possibly the very existence, of software agents seems very limited at this point in time. Also limited is the amount of uniformly encoded information such agents could operate on. Without a mass of data to process, the utility of the agents is limited. Without agents, the impetus to semantically mark data is reduced. While this could rapidly degenerate into a chicken and egg argument, one can be reminded of the famous line from the movie "Field of Dreams" - "If you build it, they will come".

Berners-Lee also states that for the Semantic Web to work, data needs to be identified semantically. Users need to be able to express the meaning of the data they are putting on the web in order to enable computers and people to work in cooperation. While this may be a true statement, how many users will take the time to do this? How many of those users will construct and publish complete ontologies that might allow agents to be able to use the data?

HTML helped launch the web explosion because it is simple to use and requires little training in order to successfully produce a product. In order for the Semantic Web to experience such success, creation and organization of information needs not only to be as easy to create as HTML but also provide users a reason to expend the extra effort.

Several W3C recommendations are available that would seem to enable the creation and organization of the information needed to develop agents, including RDF and related standards including RSS, OWL, SKOS, etc. However, there hasn't been a groundswell of web sites marking up their data using these standards. Granted, there are thousands of RSS feeds available and even a few aggregators that are able to process these feeds. However, these still seem very specialized. Might the case be different if semantic markup was easier to create?

We know people are willing to tag their information in a way that is useful for them to find it. This tagging also makes it findable by others. Millions of people work collectively to tag pictures uploaded to Flickr or videos uploaded to YouTube. Del.icio.us does the same thing for bookmarks.

Perhaps the key is in taking advantage of user generated content. Many common computer users might be intimidated by the prospect of creating a markup scheme and applying it in a consistent manner. However, if people are simply asked to label things so that these things will be easy to find in the future, they might be more comfortable. By getting a community of users to buy into an idea, sites such as those mentioned previously have grown to tremendous proportions. But it doesn't stop there. Once users generate the content, they also are asked to organize it. Flickr and del.icio.us allow users to invent tags, usually common words or phrases, and apply them to their contributions. When other users use the same tags, items start becoming connected based on the common tagging. People use the tags they like and ignore the tags they don't. As more and more people use common tags, entire webs of tags start to appear. The term for these collections of tags are known as "folksonomies". All this without a bunch of computer scientists and knowledge engineers getting together to define ontologies. Sites such as these have demonstrated that the web tends to bow to the wisdom of crowds. That's why many people believe that an army of bloggers can provide an alternative to mainstream journalists, and that if millions of eyes monitor encyclopedia entries that anyone can write and rewrite (i.e. Wikipedia), the result could very well take on Britannica.

What if we took user generated content one step further? What if, in addition to user created content, it was possible to grab the metadata that is hidden inside files already and add it to the mix? This would provide the beginnings of personal semantic webs. A personal semantic web (PSW) applies the same principles of Tim Berners-Lee's concepts, but on a local scale. Metadata is identified semantically in such a way as to allow computer applications to recognize and process it. Files on a local computer or web site could be organized along with the information on the Internet, such as Flickr or YouTube, in order for users to manage their own pieces of the World Wide Web. Their personal semantic webs grow by adding new bits of information as the users encounter it.

Applications work with metadata constantly. Multimedia files such as MP3 or MPEG contain information about the creator, title, source, length, or genre of particular files. RSS feeds contain title, author and categorization info, brief synopses, as well as links to the full text of documents. Tags can be applied to digital images.

The next section of this paper discusses a system that allows users to create their own semantic webs based on fairly common applications. In addition, larger data sets can be used to enable more semantically accurate matching to occur, further enhancing the semantic quality of the user created tagging being employed.

The final portion of this paper takes the previously introduced concepts and applies them to a more industrial weight application. It is based on a legal research scenario combined with other commercially available web service interfaces. This is intended to show the scalability and applicability of such a solution to a wider arena.

Semetag - Semantic Metadata Aggregator

Semetag is a Swing-based application written in Java. It uses the Jena Semantic Web framework to manage the metadata stored within the application. A JDBC-enabled database (MySQL) is used to allow persistence of the data models. Jena includes an RDF API, an OWL API, query capabilities using the RDQL language and a rule-based inference engine. In addition to Jena, an AIML (Artificial Intelligence Markup Language) engine is integrated into the system allowing users to interact with the system via natural language conversations. Future updates will also employ speech recognition and text-to-speech abilities in order to provide information through voice interaction.

The basic Semetag application consists of a set of common desktop tools including an MP3 player, an MPEG player, a POP3 email client, an RSS reader, an image slideshow viewer, a Jabber instant messaging client and a web browser. Each of these tools is open source software written in Java. They have each been modified to work in an integrated desktop environment and have been RDF enabled. In this environment, "RDF enabled" means that when the application is made aware of a file or web page or RSS channel, RDF statements are made about the new resource and added to the collected knowledge base. The integrated environment allows the applications to run separately or in conjunction with each other. For example, when the user is listening to an audio file, the image viewer can be set to show the album art of the album from which the song came or pictures of the performer, the web browser can be requested to download the lyrics for the song from a particular website or be pointed to a website for tickets to a concert. In another example, the user can set up the RSS reader with a set of keywords. The reader software can then monitor the incoming RSS feeds and alert the user when new items appear that are relevant. The email client and Jabber client can also alert the user based on settings such as subject keyword or sender.

The AIML engine, when activated, allows the interaction with the system to occur in natural language. So a user can ask the chatbot to check the latest sports news and the system would know to look for an RSS feed on sports. If new items were located and the speech engine is turned on, the chatbot would ask the user if they would like to hear the stories or read them and presents the information based on the user's response.

Using Metadata to Seed a Personal Semantic Web

As described previously, a set of common desktop applications have been integrated into the Semetag application. Each of these applications uses metadata associated with data file in order to build RDF statements that are combined into a PSW. In addition, users have the ability create custom statements for each resource as desired.

The following sections will explain the metadata that is harvested from different types of files. Within Semetag, an overall OWL ontology is defined to provide the background knowledge and connectivity for the resources added to the PSW. Special subclasses are defined to allow applications using a specific resource to automatically access other resources and applications based on user preferences. A portion of the main ontology is included in Appendix A.

Slideshow Viewer

The slideshow viewer is able to process JPEG and GIF files. Playlists can be built that allow users to group and play sets of files.

When the user selects a file or URL or adds it to a playlist, it becomes a resource in the PSW. The URI for the file becomes its identifier. The snippet of RDF below illustrates the statements made about an image file when added to the Semetag application.

<Image rdf:about="file:/U:/My%20Pictures/caguilera_13.jpg">
  <altLocation rdf:Resource="http://www.christinaaguilera.com"/>
  <Subject>
    <foaf:Person rdf:ID="ChristinaAguilera">
      <foaf:name>Christina Aguilera</foaf:name>
    </foaf:Person>
  </Subject>
  <Format>JPEG</Format>
  <Description>This is a picture of Christina Aguilera.</Description>
  <Keyword>Latina</Keyword>
  <Keyword>singer</Keyword>
  <Keyword>blonde</Keyword>
  <Keyword>Mickey Mouse Club</Keyword>  
  <partOfPlaylist rdf:Resource="file:/U:/My%20Pictures/xtina.pix"/>
</Image>

The statements shown above say the following about an image file resource: the file ia an image, the file has a specific URI, it can also have an alternate location, the main subject of the image is a specific person, the file has specific format, the image has a description and a set of keywords, and the file is part of a specific playlist. Because of the inverse properties defined in the main ontology, we can also make the following statement: the playlist includes the file.

The playlist can include local files as well as URLs, so personal slide shows can include web pictures without uploading everything to Flickr. These statements can be extended by users. One possible scenario includes using the keywords to link to Flickr tags.

Audio Player

The audio player is able to process MP3, AU and AIFF files. Playlists can be built that allow users to play sets of files.

When the user plays a file or adds it to a playlist, the file and the playlist become resources in the PSW. The URIs for the files become their identifiers. The following statements can be made about the resource:

  • file has title,
  • file has performer (and performer performs song),
  • file appears on album (and album includes song),
  • file has genre,
  • file has release date,
  • file is part of playlist (and playlist includes file),
  • file has specific format,
  • file has alternate location,
  • file has keywords,
  • performer participates in album,
  • album has release date.

If resources do not exist for the performer, album or genre, resources are created for them also. This is done to allow for additional statements to be made about these resources. Users can also add other statements such as ratings, and other albums on which the file might have appeared. The statements above can be used to connect to other music information sources such as MusicBrainz or AudioScrobbler. These sources can provide additional background information about songs, artists and albums.

Special RDF statements can be made about audio files and playlists that connect them to other modules within Semetag. For example, if such a statement were made for an audio playlist, a slideshow playlist could be started within the slideshow application under the control of the audio module. In addition to being RDF-enabled, the audio player has other modifications to allow it to participate in the integrated Semetag environment. If desired, each track can be introduced before it is played via a text-to-speech module. Also, if the video player module is playing when the audio module is started, the video player can be paused or muted, based on user preferences.

Video Player

The video player is able to process MPEG files. When the user plays a file it becomes a resource in the PSW. The URI for the file becomes its identifier. The following statements are made about the resource:

  • file has specific format.
  • file has alternate location,
  • file has keywords,
  • file has subject,
  • file has description.

Users can also add other statements such as ratings, links to the Internet Movie Database, etc. As mentioned above, this module is also integrated into the overall application and can cooperate with the other modules to avoid possible conflicts and to enhance the overall user experience.

RSS Reader

The RSS reader allows RSS and Atom news channels to be collected within the system.

When the user adds a new channel into the system, it becomes a resource in the PSW. The URI for the channel becomes its identifier. Additionally, each article within each feed has a URI which can then be used as an identifier. The following statements are made about the resources:

  • channel has name,
  • channel has description,
  • channel has language,
  • channel contains news items,
  • news item has title,
  • news item is from a specific news feed,
  • news item was sent on a particular date,
  • news item has description,
  • news item has publisher,
  • news item has source.

Within the news items a natural language processor can be used to detect keywords which can be associated with the articles. These keywords can be added much like the tags in Flickr. Users can also add other statements as desired.

As with many of the other modules, the text-to-speech module can be used to announce the retrieval of news items that contain specific keywords identified by the text processor. The integration can be set to interrupt immediately if another audio enabled module is running, or to wait until the next available opening.

Instant Messenger (Jabber) Client

The Jabber client allows users to communicate with other users on the Internet by typing messages back and forth to each other in real time.

When the user adds a contact to the Jabber client, the contact becomes a resource in the PSW. The contact's chat name becomes the resource's identifier. The following statements are made about the resource:

  • contact has chat name,
  • chat name is used on service,
  • contact was last online.

A natural language processor can be used to detect keywords within the conversation which can be associated. These can be added much like the tags in Flickr. Users can also add other statements such as contact information, etc.

Email Client

The Email client allows users to read email messages from a POP3 mailbox. The module does not allow email messages to be sent.

When the user reads an email message, the sender and other recipients become resources in the PSW. The contacts email address becomes the resource's identifier. The following statements are made about the resource:

  • message has sender,
  • message also sent to,
  • message was sent on a particular date,
  • message has subject.

A natural language processor can be used to detect keywords within the messages which can be associated with the message. These can be added much like the tags in Flickr. Users can also add other statements as desired. More advanced processing could allow summaries of email messages to be stored in the PSW.

Web Browser

The web browser is able to process HTML files.

When the user browses to a URL, it becomes a resources in the PSW. The URL becomes the identifier. The following statements are made about the resource:

  • page contains links to other URLs,
  • page contains links to images,
  • page was target of link from another page.

A natural language processor can be used to detect keywords within the web pages which can be associated with the pages. These can be added much like the tags in Flickr. Users can also add other statements as desired.

Other Possibilities

Any number of other sources of data are candidates for inclusion in an application such as Semetag. For example, the XML format within the OpenOffice standard could be used to add data to a PSW. Web service interfaces to other data collections can also be used to gather information to be added to the PSW. This will be discussed later in this paper.

Adding Ontological References to the Personal Semantic Web

This section will discuss how the PSW can be expanded by using standard URIs from larger ontologies based on Wordnet, Wikipedia, or Cyc. The hook into the larger ontologies could be used as the Rosetta Stone for many PSWs when they are combined. These combined PSWs can then become the beginnings of the larger Semantic Web as described by Berners-Lee.

WordNet

WordNet is an online lexical reference system whose design is inspired by current psycholinguistic theories of human lexical memory. English nouns, verbs, adjectives and adverbs are organized into synonym sets, each representing one underlying lexical concept. Different relations link the synonym sets. WordNet was developed by the Cognitive Science Laboratory at Princeton University under the direction of Professor George A. Miller.

Alvaro Graves has developed an RDF representation of the WordNet modelled as part of his work toward a Master of Computer Science degree at the University of Chile.

The RDF version of WordNet can be used to apply the semantic definition of words to the keywords identified in the various modules. Recall the example provided in the description of the slideshow module. Several keywords including "blonde", "singer", and "Latina" were identified. By connecting these keywords with WordNet definitions we are able to alleviate any ambiguity.

The markup below shows the entries for the keyword "blonde":

  
  <rdf:Description rdf:about="http://wordnet.princeton.edu/~agraves/wordnet/0.9/words#blonde">
      <rdf:type rdf:resource="http://wordnet.princeton.edu/~agraves/wordnet/0.9/schema#Word"/>
      <wns:hasWordForm>blonde</wns:hasWordForm>
  </rdf:Description>
  <rdf:Description rdf:about="http://wordnet.princeton.edu/~agraves/wordnet/0.9/words#blond">
      <rdf:type rdf:resource="http://wordnet.princeton.edu/~agraves/wordnet/0.9/schema#Word"/>
      <wns:hasWordForm>blond</wns:hasWordForm>
  </rdf:Description>

  <rdf:Description rdf:about="#109231893blonde">
      <rdf:type rdf:resource="http://wordnet.princeton.edu/~agraves/wordnet/0.9/schema#WordSense" />
      <wns:hasWord rdf:resource="http://wordnet.princeton.edu/~agraves/wordnet/0.9/words#blonde" />
      <wns:hasSynSet rdf:resource="http://wordnet.princeton.edu/~agraves/wordnet/0.9/synsets#109231893" />
  </rdf:Description>
  <rdf:Description rdf:about="#109231893blond">
      <rdf:type rdf:resource="http://wordnet.princeton.edu/~agraves/wordnet/0.9/schema#WordSense" />
      <wns:hasWord rdf:resource="http://wordnet.princeton.edu/~agraves/wordnet/0.9/words#blond" />
      <wns:hasSynSet rdf:resource="http://wordnet.princeton.edu/~agraves/wordnet/0.9/synsets#109231893" />
  </rdf:Description>

  <rdf:Description rdf:about="http://wordnet.princeton.edu/~agraves/wordnet/0.9/synsets#109231893">
      <wns:hasGloss>(a person with fair skin and hair)</wns:hasGloss>
  </rdf:Description>

  <rdf:Description rdf:about="http://wordnet.princeton.edu/~agraves/wordnet/0.9/synsets#109231893">
      <wns:hyponymOf rdf:resource="http://wordnet.princeton.edu/~agraves/wordnet/0.9/synsets#100006026" />
  </rdf:Description>
  <rdf:Description rdf:about="http://wordnet.princeton.edu/~agraves/wordnet/0.9/synsets#109741776">
      <wns:hyponymOf rdf:resource="http://wordnet.princeton.edu/~agraves/wordnet/0.9/synsets#109231893" />
  </rdf:Description>

  <rdf:Description rdf:about="http://wordnet.princeton.edu/~agraves/wordnet/0.9/synsets#100006026">
    <rdf:type rdf:resource="http://wordnet.princeton.edu/~agraves/wordnet/0.9/schema#NounSynSet" />
  </rdf:Description>

  <rdf:Description rdf:about="#100006026person">
      <rdf:type rdf:resource="http://wordnet.princeton.edu/~agraves/wordnet/0.9/schema#WordSense" />
      <wns:hasWord rdf:resource="http://wordnet.princeton.edu/~agraves/wordnet/0.9/words#person" />
      <wns:hasSynSet rdf:resource="http://wordnet.princeton.edu/~agraves/wordnet/0.9/synsets#100006026" />
  </rdf:Description>
  <rdf:Description rdf:about="#100006026individual">
      <rdf:type rdf:resource="http://wordnet.princeton.edu/~agraves/wordnet/0.9/schema#WordSense" />
      <wns:hasWord rdf:resource="http://wordnet.princeton.edu/~agraves/wordnet/0.9/words#individual" />
      <wns:hasSynSet rdf:resource="http://wordnet.princeton.edu/~agraves/wordnet/0.9/synsets#100006026" />
  </rdf:Description>
  <rdf:Description rdf:about="#100006026someone">
      <rdf:type rdf:resource="http://wordnet.princeton.edu/~agraves/wordnet/0.9/schema#WordSense" />
      <wns:hasWord rdf:resource="http://wordnet.princeton.edu/~agraves/wordnet/0.9/words#someone" />
      <wns:hasSynSet rdf:resource="http://wordnet.princeton.edu/~agraves/wordnet/0.9/synsets#100006026" />
  </rdf:Description>
  <rdf:Description rdf:about="#100006026somebody">
      <rdf:type rdf:resource="http://wordnet.princeton.edu/~agraves/wordnet/0.9/schema#WordSense" />
      <wns:hasWord rdf:resource="http://wordnet.princeton.edu/~agraves/wordnet/0.9/words#somebody" />
      <wns:hasSynSet rdf:resource="http://wordnet.princeton.edu/~agraves/wordnet/0.9/synsets#100006026" />
  </rdf:Description>
  <rdf:Description rdf:about="#100006026mortal">
      <rdf:type rdf:resource="http://wordnet.princeton.edu/~agraves/wordnet/0.9/schema#WordSense" />
      <wns:hasWord rdf:resource="http://wordnet.princeton.edu/~agraves/wordnet/0.9/words#mortal" />
      <wns:hasSynSet rdf:resource="http://wordnet.princeton.edu/~agraves/wordnet/0.9/synsets#100006026" />
  </rdf:Description>
  <rdf:Description rdf:about="#100006026human">
      <rdf:type rdf:resource="http://wordnet.princeton.edu/~agraves/wordnet/0.9/schema#WordSense" />
      <wns:hasWord rdf:resource="http://wordnet.princeton.edu/~agraves/wordnet/0.9/words#human" />
      <wns:hasSynSet rdf:resource="http://wordnet.princeton.edu/~agraves/wordnet/0.9/synsets#100006026" />
  </rdf:Description>
  <rdf:Description rdf:about="#100006026soul">
      <rdf:type rdf:resource="http://wordnet.princeton.edu/~agraves/wordnet/0.9/schema#WordSense" />
      <wns:hasWord rdf:resource="http://wordnet.princeton.edu/~agraves/wordnet/0.9/words#soul" />
      <wns:hasSynSet rdf:resource="http://wordnet.princeton.edu/~agraves/wordnet/0.9/synsets#100006026" />
  </rdf:Description>

  <rdf:Description rdf:about="#109741776peroxide_blond">
      <rdf:type rdf:resource="http://wordnet.princeton.edu/~agraves/wordnet/0.9/schema#WordSense" />
      <wns:hasWord rdf:resource="http://wordnet.princeton.edu/~agraves/wordnet/0.9/words#peroxide_blond" />
      <wns:hasSynSet rdf:resource="http://wordnet.princeton.edu/~agraves/wordnet/0.9/synsets#109741776" />
  </rdf:Description>
  <rdf:Description rdf:about="#109741776peroxide_blonde">
      <rdf:type rdf:resource="http://wordnet.princeton.edu/~agraves/wordnet/0.9/schema#WordSense" />
      <wns:hasWord rdf:resource="http://wordnet.princeton.edu/~agraves/wordnet/0.9/words#peroxide_blonde" />
      <wns:hasSynSet rdf:resource="http://wordnet.princeton.edu/~agraves/wordnet/0.9/synsets#109741776" />
  </rdf:Description>
This markup shows that there are 2 spellings for the word. It defines the word as "a person with fair skin and hair". It also states that more general terms for a blonde include: person, individual, someone, somebody, mortal, human, and soul. A more specific term is "peroxide blonde". By modifying the line
<Keyword>blonde</Keyword>
to
<Keyword rdf:resource="http://wordnet.princeton.edu/~agraves/wordnet/0.9/words#blonde"/>
we will be able to incorporate all the semantic knowledge contained within the WordNet system into a PSW. This will also facilitate the combination of PSWs by creating common merge points that are semantically consistent.

Wikipedia

Wikipedia³ is a conversion of the English Wikipedia into RDF done by Danny Ayers at System One in Vienna, Austria. It is dataset containing around 47 million triples and will be updated monthly at some point in the future. The creation of the dataset was motivated by several factors, one being the desire to have more real-world RDF datasets of reasonable size. Wikipedia assembles a wealth of information created and maintained by people all over the globe - opening up that rich pool of data or even only a small part of it to the semantic web seems like a worthy pursuit. The dataset currently combines structural information like link and category relationships with basic per-page metadata. At the moment, only Wikipedia pages in the Article (NS 0) and Category (NS 14) namespaces are extracted. Further, "internal link" and "redirects to" relations are limited to targets in those two namespaces. Page text is not extracted at the moment, as this severely increases the dataset's size but the intent is to provide a seperate dataset containing only text triples in the future.

Wikipedia contains an entry for Christina Aguilera at http://en.wikipedia.org/wiki/Christina_aguilera. The statements shown below are a portion of the entry for this performer.

  
  <rdf:Description rdf:about="http://en.wikipedia.org/wiki/Christina_Aguilera">
        <rdf:type rdf:resource="http://www.systemone.at/2006/03/wikipedia#Article"/>
        <dc:title>Christina_Aguilera</dc:title>
        <dc:contributor>YurikBot</dc:contributor>
        <dc:modified>2006-06-25 01:14:07.0</dc:modified>
        <skos:subject rdf:resource="http://en.wikipedia.org/wiki/
                                    Category%3Articles_with_unsourced_statements"/>
        <skos:subject rdf:resource="http://en.wikipedia.org/wiki/Category%3A1980_births"/>
        <skos:subject rdf:resource="http://en.wikipedia.org/wiki/Category%3AAmerican_child_singers"/>
        <skos:subject rdf:resource="http://en.wikipedia.org/wiki/Category%3AAmerican_female_singers"/>
        <skos:subject rdf:resource="http://en.wikipedia.org/wiki/Category%3AAmerican_pop_singers"/>
        <skos:subject rdf:resource="http://en.wikipedia.org/wiki/
                                    Category%3AAmerican_singer-songwriters"/>
        <skos:subject rdf:resource="http://en.wikipedia.org/wiki/Category%3ABlue_eyed_soul"/>
        <skos:subject rdf:resource="http://en.wikipedia.org/wiki/Category%3AChristina_Aguilera"/>
        <skos:subject rdf:resource="http://en.wikipedia.org/wiki/Category%3ADisney_child_actors"/>
        <skos:subject rdf:resource="http://en.wikipedia.org/wiki/Category%3AEcuadorian-Americans"/>
        <skos:subject rdf:resource="http://en.wikipedia.org/wiki/Category%3AGrammy_Award_winners"/>
        <skos:subject rdf:resource="http://en.wikipedia.org/wiki/Category%3ALiving_people"/>
        <skos:subject rdf:resource="http://en.wikipedia.org/wiki/Category%3AMouseketeers"/>
        <skos:subject rdf:resource="http://en.wikipedia.org/wiki/Category%3AMTV_Music_Award_Winners"/>
        <skos:subject rdf:resource="http://en.wikipedia.org/wiki/Category%3APeople_from_Pittsburgh"/>
        <skos:subject rdf:resource="http://en.wikipedia.org/wiki/Category%3ARCA_Records_musicians"/>
        <skos:subject rdf:resource="http://en.wikipedia.org/wiki/Category%3ARhythmic_Top_40_acts"/>
        <skos:subject rdf:resource="http://en.wikipedia.org/wiki/Category%3AStaten_Islanders"/>
        <skos:subject rdf:resource="http://en.wikipedia.org/wiki/
                                    Category%3ASuper_Bowl_halftime_performers"/>
        <skos:subject rdf:resource="http://en.wikipedia.org/wiki/Category%3AWhistle_register_singers"/>
        <wiki:internalLink rdf:resource="http://en.wikipedia.org/wiki/1980"/>
        <wiki:internalLink rdf:resource="http://en.wikipedia.org/wiki/1990"/>
        <wiki:internalLink rdf:resource="http://en.wikipedia.org/wiki/Staten_Island"/>
        <wiki:internalLink rdf:resource="http://en.wikipedia.org/wiki/New_York"/>
        <wiki:internalLink rdf:resource="http://en.wikipedia.org/wiki/Pittsburgh%2C_Pennsylvania"/>
        <wiki:internalLink rdf:resource="http://en.wikipedia.org/wiki/Wexford%2C_Pennsylvania"/>
        <wiki:internalLink rdf:resource="http://en.wikipedia.org/wiki/Texas"/>
        <wiki:internalLink rdf:resource="http://en.wikipedia.org/wiki/Spanish_language"/>
        <wiki:internalLink rdf:resource="http://en.wikipedia.org/wiki/March_15"/>
  </rdf:Description>
By adding a new statement hooking the Wikipedia entry to the resource, a larger set of information can be added to the PWS.

TAP

TAP is a distributed project involving researchers from the Knowledge Systems Laboratory at Stanford, Knowledge Management Group at IBM Almaden and W3C's Semantic Web Advanced Development Initiatives. TAP is a shallow but broad knowledge base containing basic lexical and taxonomic information about a wide range of popular objects. The goal is to bootstrap the Semantic Web by providing a comprehensive source of basic information about popular objects.

The KB currently includes knowledge about:

  • Music: Popular music musicians & groups, instruments, styles, composers
  • Movies: Top Movies, actors, television shows
  • Authors: Top book authors, classic books
  • Sports: Athletes, sports, sports teams, equipment
  • Autos: Auto & motorcycle types and models
  • Companies: Fortune 500 companies
  • Home Appliances: Different types of appliances and most well known brands
  • Toys: Different types of toys and most well known brands
  • Baby products: Different types of baby products and most well known brands
  • Places: Countries, states, cities, tourist attractions
  • Consumer electronics: Audio/Video, Communication, game equipment and titles and brands
  • Health: Diseases and common Drugs

This knowledge base is intended to complement, not replace, systems like Cyc, which have a deep knowledge about basic, common sense phenomenon, but don’t have knowledge about particulars. So, for example, Cyc knows a lot about what it means to be a Musician. If it is told that Yo-Yo Ma is a Cellist, it can infer that he probably owns one or more Cellos, plays the Cello often, etc. But it might not know that there is a famous Cellist called Yo-Yo Ma (it certainly does know all the famous composers, classical instrumentalists and opera singers!)

TAP is also not intended to replace the so-called "Upper Ontologies". It provides the other 95% that the upper ontologies don't provide. It should be possible to hoist the TAP on top of any of those other systems.

The TAP knowledge base happens to contain an entry for Christina Aguilera. The markup is shown below:

  <tap:Musician rdf:about="http://tap.stanford.edu/data/MusicianAguilera,_Christina">
    <rdfs:label xml:lang="en">Christina Aguilera</rdfs:label>
    <tap:genre rdf:resource="http://tap.stanford.edu/data/1990sMusicGenre"/>
  </tap:Musician>

  <rdfs:Class rdf:ID="http://tap.stanford.edu/data/Musician">
    <rdf:type rdf:resource="http://tap.stanford.edu/data/ProfessionalType"/>
    <tap:plural>Musicians</tap:plural>
    <rdfs:label xml:lang="en">Musician</rdfs:label>
    <rdfs:subClassOf rdf:resource="http://tap.stanford.edu/data/Person"/>
    <rdfs:subClassOf rdf:resource="http://tap.stanford.edu/data/Artist"/>
    <tap:ebayMap>1049</tap:ebayMap>
    <tap:amazonMap>5174</tap:amazonMap>
  </rdfs:Class>

  <rdfs:Class rdf:ID="http://tap.stanford.edu/data/Person">
    <tap:plural>People</tap:plural>
    <rdfs:label xml:lang="en">Person</rdfs:label>
    <rdfs:subClassOf rdf:resource="http://tap.stanford.edu/data/Agent"/>
  </rdfs:Class>

This markup classifies Christina as a musician and places her performances within the 1990s music genre. It further defines musicians as subclasses of persons and artists. It also provides additional metadata on the musician resource so that it becomes possible to more accurately search within Ebay and Amazon using their classification systems. By adding the statement:

  
  <rdf:Description rdf:about="#ChristinaAguilera">
    <owl:sameAs rdf:resource="http://tap.stanford.edu/data/MusicianAguilera,_Christina"/>
  </rdf:Description>
to the PSW we will be able to associate the resource to a resource in the wider TAP knowledge base.

OpenCyc

OpenCyc is the open source version of the Cyc technology, the world's largest and most complete general knowledge base and commonsense reasoning engine. OpenCyc can be used as the basis of a wide variety of intelligent applications such as:

  • rapid development of an ontology in a vertical area
  • email prioritizing, routing, summarization, and annotating
  • expert systems
  • games
to name just a few.

Release 0.9 of OpenCyc includes:

  • 47,000 concepts: an upper ontology whose domain is all of human consensus reality.
  • 306,000 facts about the 47,000 concepts, interrelating them, constraining them, in effect (partially) defining them.
  • A compiled version of the Cyc Inference Engine and the Cyc Knowledge Base Browser.
  • Documentation and self-paced learning materials to help users achieve a basic- to intermediate-level understanding of the issues of knowledge representation and application development using Cyc.
  • A specification of CycL, the language in which Cyc (and hence OpenCyc) is written.
  • A specification of the Cyc API for application development.

Release 1.0 (announced for April 2006 release, but not available as of the writing of this paper) will additionally include:

  • The entire 300,000+ term Cyc ontology, with over 1 million facts relating the terms to each other.
  • English strings corresponding to all concept terms, to assist with search and display.
  • English translations (strings) of all facts that ship with OpenCyc (the ability to automatically generate English translations for newly added facts is not included).

The markup below shows several Cyc statements that could be connected to Christina Aguilera. An Cyc engine would be able to

    <owl:Class rdf:ID="Singer">
        <rdfs:comment></rdfs:comment>
        <guid>bd58e2cf-9c29-11b1-9dad-c379636f7270</guid>
        <rdf:type rdf:resource="#PersonTypeByOccupation"/>
        <rdf:type rdf:resource="#ConventionalClassificationType"/>
        <rdfs:subClassOf rdf:resource="#MusicalPerformer"/>
    </owl:Class>
    <owl:Class rdf:ID="MusicalPerformer">
        <rdfs:comment>A specialization of #$Artist-Performer, #$Musician and #$MusicPerformanceAgent. 
           Each instance of this collection is a an individual person performing music for an audience. 
         For groups or bands, see #$MusicPerformanceAgent.</rdfs:comment>
        <guid>bd58ea5e-9c29-11b1-9dad-c379636f7270</guid>
        <rdf:type rdf:resource="#EntertainerTypeByActivity"/>
        <rdf:type rdf:resource="#ConventionalClassificationType"/>
        <rdfs:subClassOf rdf:resource="#SocialBeing"/>
        <rdfs:subClassOf rdf:resource="#Musician"/>
        <rdfs:subClassOf rdf:resource="#Artist-Performer"/>
    </owl:Class>
     <owl:Class rdf:ID="Musician">
        <rdfs:comment>A specialization of #$Artist. Each instance of this collection is an artist 
            whose medium is #$Sound.  Notable specializations of this collection include #$MusicalComposer 
          and #$MusicalPerformer.</rdfs:comment>
        <guid>bd58ac07-9c29-11b1-9dad-c379636f7270</guid>
        <rdf:type rdf:resource="#PersonTypeByOccupation"/>
        <rdf:type rdf:resource="#PersonTypeByActivity"/>
        <rdfs:subClassOf rdf:resource="#Artist"/>
    </owl:Class>
    <owl:Class rdf:ID="Artist-Performer">
        <rdfs:comment>A specialization of #$Artist and of #$Entertainer.  Each instance of this 
          collection is a person watched by the audience during some instance of 
            #$EntertainmentPerformance.  Note that this collection includes only individual performers 
            and not, for instance, musical groups (for that, see #$MusicPerformanceAgent).</rdfs:comment>
        <guid>bd58e310-9c29-11b1-9dad-c379636f7270</guid>
        <rdf:type rdf:resource="#PersonTypeByActivity"/>
        <rdfs:subClassOf rdf:resource="#Entertainer"/>
        <rdfs:subClassOf rdf:resource="#Artist"/>
        <conceptuallyRelated rdf:resource="#EntertainmentPerformance"/>
        <definingMt rdf:resource="#HumanSocialLifeMt"/>
        <genls rdf:resource="#Entertainer"/>
        <genls rdf:resource="#Artist"/>
    </owl:Class>
    <owl:Class rdf:ID="Entertainer">
        <rdfs:comment>#$Entertainer is a specialization of #$EntertainmentOrArtsProfessional.  Each 
            instance of #$Entertainer is a person whose job it is to entertain people, i.e., to perform 
            for an audience (live or via a recording) in an attempt to distract them from their worries 
          and make them laugh, cry, smile, get excited, etc.</rdfs:comment>
        <guid>c10ae4b8-9c29-11b1-9dad-c379636f7270</guid>
        <rdf:type rdf:resource="#PublicConstant-CommentOK"/>
        <rdf:type rdf:resource="#PublicConstant-DefinitionalGAFsOK"/>
        <rdf:type rdf:resource="#PersonTypeByOccupation"/>
        <rdf:type rdf:resource="#PublicConstant"/>
        <rdfs:subClassOf rdf:resource="#EntertainmentOrArtsProfessional"/>
        <rdfs:subClassOf rdf:resource="#Individual"/>
        <definingMt rdf:resource="#HumanSocialLifeMt"/>
        <facets-Generic rdf:resource="#EntertainerTypeByActivity"/>
        <genls rdf:resource="#Person"/>
        <genls rdf:resource="#EntertainmentOrArtsProfessional"/>
    </owl:Class>
    <owl:Class rdf:ID="Artist">
        <rdfs:comment>A specialization of #$Person. Each instance of this collection is a person 
            who produces or performs works of art.  This includes performing artists (whose works are 
           transitory unless recorded) as well as visual artists, literary writers, and composers 
           (whose works are intended to last for a significant length of time and be viewed or 
           otherwise appreciated after the artist finishes them). A notable specialization of this 
             collection is #$Artist-Visual. See also #$ArtObject, #$artisticWorksCreated.</rdfs:comment>
        <guid>c0c71923-9c29-11b1-9dad-c379636f7270</guid>
        <rdf:type rdf:resource="#ProposedPublicConstant-CommentOK"/>
        <rdf:type rdf:resource="#ProposedPublicConstant-DefinitionalGAFsOK"/>
        <rdf:type rdf:resource="#PersonTypeByActivity"/>
        <rdfs:subClassOf rdf:resource="#EntertainmentOrArtsProfessional"/>
        <definingMt rdf:resource="#HumanSocialLifeMt"/>
        <genls rdf:resource="#Person"/>
        <genls rdf:resource="#EntertainmentOrArtsProfessional"/>
        <keStrongSuggestionInverse rdf:resource="#createdBy"/>
    </owl:Class>
    <owl:Class rdf:ID="Blonde-HairColor">
        <rdfs:comment>The human hair color blonde, a kind of yellow or gold or very light 
          brown.</rdfs:comment>
        <guid>2c881676-74bd-11d6-8000-00a0c99cc5ae</guid>
        <rdf:type rdf:resource="#Color"/>
        <rdfs:subClassOf rdf:resource="#Hair-Stuff"/>
        <rdfs:subClassOf rdf:resource="#HairOnHead-Human"/>
        <rdfs:subClassOf rdf:resource="#ColoredThing"/>
    </owl:Class>
    <owl:Class rdf:ID="BlondeHairedHuman">
        <rdfs:comment>The collection of all #$HomoSapiens with #$Blonde-HairColor evident on 
          the head.  The color may be natural or dyed, and it may include woven hair.  Removable 
          wigs are excluded from consideration in determining this class of people.</rdfs:comment>
        <guid>c156bb71-9c29-11b1-9dad-c379636f7270</guid>
        <rdf:type rdf:resource="#ConventionalClassificationType"/>
        <rdf:type rdf:resource="#FirstOrderCollection"/>
        <rdfs:subClassOf rdf:resource="#HomoSapiens"/>
    </owl:Class>

This markup provides several opportunities to add the simple statements made previously into a an extremely powerful artifical intelligence tool. For example, by adding the statement:

  
  <rdf:Description rdf:about="#ChristinaAguilera">
    <rdf:type rdf:resource="http://www.cyc.com/2002/04/08/cyc#Singer"/>
  </rdf:Description>
or by modifying the line
<Keyword>blonde</Keyword>
to
<Keyword rdf:resource="http://www.cyc.com/2002/04/08/cyc#BlondeHairedHuman">blonde</Keyword>
Cyc will be able to make statements and inferences about the resource representing Christina Aguilera. These new statements and inferences can then be used to allow more intelligent searching and retrieval of the the resources within the PSW. They could also allow intelligent merging and discovery of resources across many disparate PSWs.

Taking the Personal Semantic Web Further

The previous sections have focused on using common applications to build semantic webs on a personal scale and discussed potential methods for provide semantic merge points to allow those PSWs to be combined to form a larger web. However, these applications are not generally viewed as business applications. Nor are they sources of information commonly used in a business setting. However, the underlying technology of the creation of PSWs from the common applications is directly applicable to larger environments. One possible source of semantic information could be from the web service based applications that are increasingly available. For example, Amazon, Google, Yahoo, and LexisNexis have published web service interfaces that allow users to access their vast stores of information. The next sections will discuss how this can be done.

LexisNexis Web Service Kit

The LexisNexis Web Services Kit (WSK) can be used to integrate the full range of LexisNexis content (over 32,000 sources) into custom applications. The WSK is an XML based application programming interface (API) that follows industry accepted standards to provide complete integration and presentation control.

The WSK is used to control those portions of the online research process that interact directly with users. For example, the developer can design the mechanism for users to request information. The request is packaged in a SOAP message to the WSK API. Once the request is processed, a response message containing the XML documents or other information requested is returned. The documents can then be processed for inclusion into a PSW.

Once a user has been authenticated into the system, they can select from any of the sources to which they have been entitled. The sources and their attributes could be used to populate a PSW application devoted to legal or business research. After the sources have been selected, search parameters are used to build a search that will be sent to the LexisNexis search system.

The search operation is used to locate documents that satisfy the parameters specified in the search request. The search specifies the word or phrases that should or should not appear within the documents. In addition, other parameters allow the search to be further restricted by specifying the source(s) to search, the terms to look for within the candidate documents, and a date range or other restrictions that must be observed. The search operation also allows the application to specify how the search result set is delivered. Delivery options include display format (cite list or full document text), markup method, sort order (date, relevance or source specific), and the range of documents desired from within the set. There are two markup method options: display and semantic. When the display markup option is selected, the returned documents are marked up using XHTML. When semantic display is selected, the returned documents are marked up in a more semantically rich markup scheme. News documents are returned in NITF. Other types of documents will be returned in LexisNexis specific semantic markup. Other public interchange markup standards may be adopted in the future based on market conditions.

The response message contains the search results in the format requested. Each item includes a unique document identifier that can be used to retrieve the requested document in whatever format is desired. Once a search set has been returned, it is also possible to narrow the results set to more relevant documents by applying additional search restrictions to the set.

Results can be retrieved by specifying a range of documents. This returns a subset of the entire search result set. Each time a request is made for a new range of documents, it is possible to specify a different format or type of markup. The response message for this type of retrieval contains the collection of documents in the format and markup requested. It also includes a document identifier for each document delivered that can be stored and used to retrieve specific documents individually.

Applications can also retrieve documents based on their document identifiers. In the case mentioned above where a list of documents is retrieved, the user can specify that the full text of a particular document be retrieved. Some documents also have attachments associated with them. The WSK provides a mechanism for retrieving the attachments documents for display or storage.

Many times users develop a search to retrieve specific information about a topic and then want to re-run the search at a later time. The WSK provides a method for saving searches and managing the saved searches. Any number of searches can be saved, each with a user specified name for later use. The names can then be used to recall and rerun the searches from time to time to obtain the most current information available about the topic. Depending on the sources used to query the WSK, any number of possible statements can be made.

Amazon E-Commerce Service

The Amazon E-Commerce Service (ECS) is an API that allows you to access Amazon data and functionality through a Web site or Web-enabled application. The ECS follows the standard Web services model: users of the service request data through XML over HTTP using REST or SOAP and data is returned by the service as an XML-formatted stream of text.

Through ECS, the following types of data can be accessed:

  • Product data including information about product availability and pricing for items in the Amazon catalog.
  • Content from customers including reviews, wish lists, and listmania lists.
  • Seller information including general information and customer feedback about the wide range of vendors on the Amazon site.
  • A great number of third-party products are available, including products sold by smaller vendors on the Amazon Web sites.
  • Shopping carts of products for purchase through Amazon. This allows an application developer to receive commissions on sales that originate with a Web site or application.

ECS provides two types of inquiries: search and lookup. A search is a request that returns information matching specified criteria. Searches can return no data (if nothing matches the criteria specified) or multiple objects that match the search criteria. An example of a search might be a request to retrieve all books about constitutional law. A lookup is a request for a specific object or set of objects, specified by a unique identifier(s). An example of a lookup might be to retrieve information about a book by its Amazon Standard Identification Number (ASIN).

The search operation in ECS uses keywords or other criteria to search for products. This operation combines several of the searches that might be familiar from use of the amazon.com website, including keyword search, power search, author search, artist search, actor search, director search, manufacturer search, and text stream search. Setting up a search operation consists of three steps: choosing the Amazon store to search; specifying search parameters; and, requesting the desired output.

The text stream search retrieves products based on a block of text specified in the request. The text block could be a search term, a paragraph from a blog, an article excerpt, or any other text for which product matches are to be retrieved. When Amazon receives the request, it parses out recognized keywords and returns an equal number of products (ten total) for each recognized keyword. For example, if a request is sent with five recognized keywords, Amazon will return two products matching each recognized keyword. This functionality is available only on the US store.

The power search is used to perform book searches on Amazon using a complex query string. Complex query strings are of the format: key:value where keys include ASIN, author, author-exact, author-begins, keywords, keywords-begin, language, publisher, subject, subject-words-begin, subject-begins, title, title-words-begin, and title-begins. For example the query "author:ambrose" returns a list of books that include "Ambrose" in the author name. A query of "subject:history and (spain or mexico) and not military and language:spanish" would return a list of books in the Spanish language on the subject of either Spanish or Mexican history, excluding all items with military in their subject.

Constructing a Semantic Web using Web Services

The LexisNexis WSK allows application developers to access a wide range of the full LexisNexis collection (within subscription limitations). While the semantic markup provides more detailed markup, the display markup can be more consistent across a wider range of document types. This allows an application to process a greater variety of the data with the same general rules. The example application proposed in this paper is based on caselaw data retrieved using the display markup scheme, but could easily be extended to include news or financial data.

Caselaw Metadata

Several items appear consistently throughout US caselaw. These include court information (docket number, court name, etc.), case names, judge information, parties in the case, core terms within the case, headnote information that discusses the most salient points of law within the case, opinions, dissensions, concurrences, cited cases and statutes, etc. Some of these pieces of information provide descriptive metadata that can be readily represented within a PSW. Other pieces that are primarily large areas of flowing text would not map as well. In addition, certain ontologies might be defined, such as the US court structure or the taxonomy used to classify headnote topics. These items might also be stored in separate ontologies and then linked with the cases as needed. This would allow the hierarchies to be maintained and updated without directly affecting the overall PSW.

Within the caselaw documents, the following resources could be harvested from each case retrieved and loaded into the PSW: court name, case information (including long name, short name, docket number, cite IDs, posture, overview, outcome), judges hearing the case, core terms from within the case, headnote classification and referenced cases and statutes.

Each retrieved case might result in the addition of dozens of new resources that are instances of the types previously mentioned. The resource represents the case, instead of directly representing the document. The statements harvested from the document would then be attached to the case itself. In order to have a persistent identifier for the documents, the LexisNexis Identifier (LNI) is used. The LNI is stored in a <META> tag in the header of the XHTML document. In order to re-retrieve the document a new search must be submitted using the LNI.

A case is identified in many different ways. Within LexisNexis each file has a unique LNI. This value can be used to create a LexisNexis-specific Uniform Resource Identifier (URI) for the case. However, such a value would most likely not be appropriate for publicly exchanged URIs. Each case has a long (full) name. Most cases also have short names that are used when the case is cited. The case is assigned a docket number when it is heard in court. Case reporters also assign citations to the case that are different based on each reporter. They are used to build URIs for the case. By building URIs using these identifiers, new cases can be automatically connected into the PSW as they are retrieved.

As a case progresses through the court system, it is assigned a different docket number by each court that hears it. It also receives a new citation from each of the case reporters. Special statements can be created that track a case through the court system. This allows the user to see what points of law become instrumental in a argument or what areas are overruled.

Cases and statutes referenced from within the case are also represented as RDF statements about the referencing case. An application might be able to determine if and when to retrieve the referenced documents. In some cases, it might be appropriate to automatically retrieve cases to complete the database. However, doing so could result in a large number of retrievals, each with a cost. The most likely scenario would have the documents being retrieved on an on-demand basis. Since the referenced documents are not retrieved, there in no LNI available from which to construct a URI. For this reason, the best decision would be to use each unique case citation to build a URI to allow merging. It is likely that a case will have multiple URIs, so the capability to use the OWL sameIndividualAs property might be useful.

The judges hearing the case can be represented as separate resources that can then be associated to each case they hear. Several different types of statements can be developed to describe how they ruled or who was the primary author of a portion of a decision. Statements might also be made associating the judges to a specific court and given a specific role, such as justice or chief justice.

Each core term discovered within the case is represented as a resource and linked to each case in which it occurs. Each headnote item can be used to connect the case to a legal taxonomy. Since the headnote information is stored as a full taxonomy, it would become possible to group cases by areas of law, in either a more granular or less granular fashion.

Lexis.com allows searches to be run periodically using the ECLIPSE feature. An application could also be set up through WSK to run saved searches periodically. By adjusting the date range applied to the search, only new results will be retrieved. Any new answer sets can be then added into the database. Depending on the presentation system being used, new statements and resources can be identified to the user as they are added to the database.

Adding Amazon Links to the Caselaw PSW

As resources are added to the database, searches can be run against ECS to determine if there are any items available for sale on Amazon that might be related to the resource. If ECS returns an empty set, no link is created. For example, special searches could be run to determine if a judge in a case is the author of any books, as opposed to being the subject of any books. Another example might include checking to see if a headnote subject occurs within the title of a publication or as one of the keywords associated with the publication. For example, by knowing that a judge is a person, the application might be able to do a power search on the judge's name using an author query.

As the user manipulates the database, materials that are related to the subjects covered in the case can be presented as items available for purchase. This model could be further extended as other web service APIs are added to the application.

It is important to note that as resources are added to the database and ECS is queried for the availability of materials that might be appropriate to the subject, the materials returned from Amazon should not be added into the database. Only links that perform the appropriate searches should be added to the database. This is because the Amazon inventory, as well as the suppliers and prices, changes constantly. It would be a daunting, if not impossible, task to keep the database synchronized with the Amazon databases. By only including specific searches through links, the application would be able to provide the most current information without having to store it.

This same concept could also be applied to the storage of the documents retrieved from LexisNexis. While the application as described would store the documents, doing so would also open the possibility of having outdated documents within the database. If the sample application were taken to a production state, the conventional wisdom would be to only manage the metadata in the PSW and reference the source documents to allow retrieval if the full text was needed at a later date.

Acronyms

API - Application Programming Interface

ASIN - Amazon Standard Identification Number

ECS - E-commerce Service

HTTP - Hypertext Transfer Protocol

LNI - LexisNexis Identifier

NITF - News Industry Text Format

OWL - Web Ontology Language

RDF - Resource Description Framework

REST - Representational State Transfer

RSS - RDF Site Summary

SOAP - Simple Object Access Protocol

SKOS - Simple Knowledge Organization System

URI - Uniform Resource Identifier

URL - Uniform Resource Locator

WSK - Web Services Kit

XHTML - Extensible Hypertext Markup Language

XTM - XML Topic Maps

Conclusion

This paper has discussed how common applications can be enabled to create small-scale, or personal, semantic webs of information by collecting and grouping certain metadata. As more metadata is gathered, more and more connections are made, building the web. By allowing people to provide their own tagging schemes, the personal semantic webs can be further grown. If these personal tags or folksonomies are connected to larger ontologies, a common platform is defined that allows the personal semantic webs to be connected into a yet larger web.

The scalability of the PSWs is presented in a discussion of an environment where web service interfaces are used to gather and organize metadata in such a way as to create larger-scale semantic webs. These large scale webs also demonstrate the importance of uniform URIs that allow data to be grouped in a consistent manner. Semantics identified in a consistent manner will allow agents to be developed for wider use.

Appendix A

<?xml version="1.0"?>
<rdf:RDF 
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
  xmlns:owl="http://www.w3.org/2002/07/owl#"
  xml:base="http://www.semetag.com/#">

<owl:Ontology rdf:about="">
  <owl:versionInfo>semetag.owl</owl:versionInfo>
  <rdfs:comment>Ontology for resources managed within Semetag</rdfs:comment>
</owl:Ontology>

<owl:Class rdf:ID="Image">
  <rdfs:label xml:lang="en">Image</rdfs:label>
  <rdfs:subClassOf rdf:resource="http://xmlns.com/foaf/0.1/Image"/>
</owl:Class>

<owl:Class rdf:ID="Playlist">
  <rdfs:label xml:lang="en">Playlist</rdfs:label>
  <rdfs:subClassOf rdf:resource="http://www.w3.org/2002/07/owl#Thing"/>
</owl:Class>

<owl:Class rdf:ID="ImagePlaylist">
  <rdfs:label xml:lang="en">Image Playlist</rdfs:label>
  <rdfs:subClassOf rdf:resource="#Playlist"/>
</owl:Class>

<owl:Class rdf:ID="AudioPlaylist">
  <rdfs:label xml:lang="en">Audio Playlist</rdfs:label>
  <rdfs:subClassOf rdf:resource="#Playlist"/>
</owl:Class>

<owl:ObjectProperty rdf:ID="altLocation">
  <rdf:type rdf:resource="http://www.w3.org/2002/07/owl#FunctionalProperty"/>
  <rdfs:label xml:lang="en">Alternate File Location</rdfs:label>
  <rdfs:comment>Provides an alternate (either online or local) copy of the file</rdfs:comment>
  <rdfs:domain rdf:resource="#Image"/>
  <rdfs:range rdf:resource="http://www.w3.org/2001/XMLSchema#uriReference"/>
  <owl:equivalentProperty rdf:resource="http://www.w3.org/2002/07/owl#sameIndividualAs"/>
</owl:ObjectProperty>

<owl:ObjectProperty rdf:ID="subject">
  <rdfs:label xml:lang="en">Image Subject</rdfs:label>
  <rdfs:comment>Provides an alternate (either online or local) copy of the file</rdfs:comment>
  <rdfs:domain rdf:resource="#Image"/>
</owl:ObjectProperty>

<owl:ObjectProperty rdf:ID="fileFormat">
  <rdfs:label xml:lang="en">File Format</rdfs:label>
  <rdfs:comment>Provides an alternate (either online or local) copy of the file</rdfs:comment>
</owl:ObjectProperty>

<owl:ObjectProperty rdf:ID="location">
  <rdfs:label xml:lang="en">Location shown</rdfs:label>
  <rdfs:comment>Lists the location shown in the image</rdfs:comment>
  <rdfs:domain rdf:resource="#Image"/>
  <rdfs:range rdf:resource="http://www.w3.org/2001/XMLSchema#uriReference"/>
  <rdfs:range rdf:resource="http://www.w3.org/2001/XMLSchema#string"/>
</owl:ObjectProperty>

<owl:ObjectProperty rdf:ID="keyword">
  <rdfs:label xml:lang="en">Keyword</rdfs:label>
  <rdfs:comment>Lists user defined keywords to be associated with the image</rdfs:comment>
  <rdfs:domain rdf:resource="#Image"/>
  <rdfs:range rdf:resource="http://www.w3.org/2001/XMLSchema#uriReference"/>
  <rdfs:range rdf:resource="http://www.w3.org/2001/XMLSchema#string"/>
</owl:ObjectProperty>

<owl:ObjectProperty rdf:ID="partOfPlaylist">
  <rdfs:label xml:lang="en">Part of playlist</rdfs:label>
  <rdfs:comment>A resource is part of a playlist</rdfs:comment>
  <rdfs:domain rdf:resource="#Image"/>
  <rdfs:range rdf:resource="#Playlist"/>
</owl:ObjectProperty>

<owl:ObjectProperty rdf:ID="playlistIncludes">
  <rdfs:label xml:lang="en">Playlist includes</rdfs:label>
  <rdfs:comment>A playlist includes a set of resources</rdfs:comment>
  <owl:inverseOf rdf:resource="#partOfPlaylist"/>
</owl:ObjectProperty>

</rdf:RDF>


Bibliography

[AMAZON] Amazon E-Commerce Service Developer Guidehttp://www.lexisnexis.com/webserviceskit/developers

[JAVAXML] Professional Java XML

[LNWSK] LexisNexis Web Services Kit Developers Guidehttp://www.lexisnexis.com/webserviceskit/developers

[NEWSWEEK] The New Wisdom of the Web

[SCIAM] The Semantic Web

[WEBSERV] Web Services: A Technical Introduction



From Metadata to Personal Semantic Webs

Eric Freese [LexisNexis]
eric.freese(a)lexisnexis.com