Thesaurus Progress

I’ve been working on a Thesaurus.

I’ve always been fascinated by thesaurii – I used to read Roget’s when I was a teen-ager (although read is probably a less accurate description than surf.) I also have had many an occasion to use a thesaurus in my programmig work, when I want to name something.

I’ve decided to create a programmer’s thesaurus. I looked around on the web, and while there are quite a number of computer dictionaries, there was no “computer thesaurus” that I could find.

The project actually has two sides. The first is the construction of the actual thesaurus database itself. The other is the creation of a software application for creating and maintaining thesaurii. I looked at some of the various commercial and academic thesaurus-editing applications, and while I learned quite a bit about the technology, I could see that this was not the right solution for my needs.

The current implementation is an AJAX application written using TurboGears, a very sophisticated Python-based web application framework. The editor currently supports the following features:

  • Search by word or by description.
  • Assign parts of speech (noun, verb, adjective, adverb) to each definition.
  • Expanding tree-view of definitions using dynamic HTML.
  • Descriptions can contain links to other terms.
  • Editor panel allows modification of term attributes, add / remove synonyms, link to other terms, etc.
  • Supports a broad set of inter-term linkages: BroaderTerm, NarrowerTerm, Sibling Term, Antonym, PartOf, HasPart, AspectOf, HasAspect, OperatesOn, OperandOf, Entails, Entailed By, and the generic “Related” link.
  • Supports version history of changes to each term.
  • Various kinds of analysis queries: Orphan words, Popular (Most Frequently Referred), Islands (terms with no links), Top Level (terms with no parent), etc. I’m also working on various kinds of cluster analysis queries.

On the actual content part, I am currently up to 600 words, with 400 discrete “meanings” or definitions.

One thing that I have observed in doing this is that the more I work on it, the less it resembles a conventional thesaurus. I knew going in that there are many terms that are synonyms in the software world that are not so in English, “array” and “vector” being an example. In addition to that, however, I find that the structure of the links between terms is starting to look more like UML, since I have added “part of”, “aspect of” and “operates on” relationships.

Leave a Reply

You must be logged in to post a comment.