Charles Petzold



Roget’s Hierarchical Thesaurus in a Silverlight App

August 24, 2009
Roscoe, N.Y.

To most people these days, a Thesaurus (from the Greek for "treasury") is something invoked from a word processor to suggest synonyms, and help prevent one's prose from appearing tired and repetitive. But the original conception of English scientist and physician Peter Mark Roget (1779 – 1869) was an assemble of all the words of the English language into 1000 categories, which are then grouped into various classes, sections, sub-sections, and sub-sub-sections — in short, a hierarchy.

I have not been able to find an 1852 first edition of Roget's Thesaurus of English Words and Phrases, Classified and Arranged so as to Facilitate the Expression of Ideas and Assist in Literary Composition on Google Book Search, but here's the 1853 second edition, which is probably close enough.

I thought it might be a fun challenge to write a Silverlight application that allows navigating through Roget's hierarchy. On the Project Gutenberg site I found what I needed: an ancient ASCII text file containing the bulk of a 1911 edition of Roget's Thesaurus in the public domain. This file was originally assembled by a company called MICRA, Inc. in 1991, and it is Gutenberg E-Text Number 22, which means it originates from the very early days. The 1911 American edition of Roget's Thesaurus used to create this text file is also available on Google Book Search. MICRA then added a bunch of words to this file and (apparently) ran a spell check on it and flagged a bunch of words as "obsolete". By 1911, several more categories had been added to bring the total past 1000.

The first step for me was to write a program that converted this ASCII text file to XML — actually two XML files. The first one contains the hierarchy up to but not including the ~1000 categories. This came out to be 17.5K. The second XML file weighs in at 2.26M and contains all the words groups associated with the ~1000 categories. Both XML files required some hand cleaning.

The Silverlight app accesses these two XML files and presents the hierarchy in a way that seemed to me to be simple, useful, and intuitive, but which also allowed for some fun animations. Total coding time for producing the XML files and the Silverlight app: 30 hours.

You can run the program from this link:


Roget1911Experiment1.html

I targetted a browser client area of 1200 pixels wide and 960 pixels high. If you can't get your browser window that large, scroll bars will be displayed. Otherwise, scrolling has been almost entirely eliminated, but there is still a little that sometimes occurs in the word-list area at the bottom (the part that's colored Alice-blue).

Roget began his hierarchy with six classes: Abstract Relations, Space, Matter, Intellect, Volition, and Affections. (Obviously words like Volition and Affections betray the Victorian origins of this thing.) My program displays these six classes at the top of the screen. You click one to see the further hierarchy, and you keep clicking until you get a list of numbered categories in the ListBox at the lower left. (Some of these will have tooltips associated with them.) Click one of these categories to gets parts of speech (Noun, Verb, etc), and then click one of those to get the groupings of words in the Alice-blue area.

Often the groupings of words have cross-references to other categories. These function as normal hyperlinks. Obsolete words are flagged with daggers. I have removed the additional flags for obsolete words added in 1991 and discussed in the roget15a.txt file available at the above Gutenberg Project link. Considering the state of the original file and the limited time I spent trying to fix it, it is extremely likely you'll find some errors or oddities in my program.

Getting a grip on Roget's hierarchy was the first challenge. It starts out simple with the six classes, but right away any chance that this is a nice balanced hierarchy with a lot of parallel structure entirely collapses. Four of the classes are divided into sections (as few as 3, as many as 8) but two of the classes (Intellect and Volition) are first divided into two divisions each, and then into sections.

Sometimes the sections are then composed of categories, but often the hierarchy goes further. The longest is: Matter, Organic Matter, Sensation, Special, and then either Sound (which has four nested sections) or Light (which has three). In some parts of the tree, only two categories are listed in the ListBox in the lower left; but the hierarchy Space, Motion, Motion with Reference to Direction has 38 Categories, numbered 278 through 315. I agonized how I would display such a wide range of categories, and I settled upon using a WrapPanel with a Vertical orientation as the ItemsPanel, so the ListBox actually gets wide to display multiple columns of up to 14 categories each.

The WrapPanel also came to the rescue in the part of the program that implemented the hyperlinks from one category to another. Something like this is fairly easy in WPF, but not in Silverlight. Of course I wanted to use a TextBlock to get text wrapping for multi-line entries, and it was easy enough putting multiple Run objects in a TextBlock, some of which were underlined and colored blue. But when the TextBlock is clicked, there is no easy way to determine which Run in the TextBlock got the mouse click. I ended up creating a TextBlock for each word (including each hyperlink), and then using a WrapPanel to do the text wrapping.

The code is a little too chaotic — disorderly, untidy, anarchical, disjointed — to post at this time, but I'll try to get it in shape if there's sufficient demand.