New Suite of Lexical Tools Developed

The collection of texts, morphological analysis tools, and reference works in the Perseus Digital Library provide an excellent testbed for examining ways that techniques from the fields of corpus linguistics and information retrieval can be adapted for the study of ancient Greek and Latin texts. Study in these fields has been the focus of much recent research at the Perseus Project. These explorations have resulted in the development of three new tools for the Perseus Digital Library.
Synonym Tool for Greek and Latin
The first tool is based on a program that analyzes dictionary entries in the Lewis and Short Latin Dictionary and the Liddell, Scott, and Jones Greek Lexicon in Perseus and suggests possible synonyms. For example, this program suggests that words such as "polis" ("city") and "dêmokratia" ("democracy") have definitions that are similar to the definition for "dêmos" ("people," etc.). Likewise, the program suggests "canto" ("to sing, etc.") and "sono" ("to sound, etc.") as possible synonyms for "cano" ("to sing, etc.").
Greek Word Collocation Tool
The second tool allows users to see the words that are likely to appear within five words of each other in Perseus Greek texts. This sort of collocation information can yield interesting information about common patterns of language usage.
For example, in English, collocation data shows that the mutual information score for the words "strong" and "tea" is much higher than the score for "powerful" and "tea." This suggests that it is much more common to speak of "strong tea" than "powerful tea."
Collocation data can also provide a quick overview of the sense in which an author uses a word. For example, if the most common collocates of the word "bank" in a collection of texts were words such as "water," "shade," or "cool," we would know that the author probably was writing about rivers rather than financial institutions. Collocation information yields similar information about Greek texts as well. Just as in English, commonly used word pairs have a high mutual information score. For example, the mutual information score for the Greek words "agathos" and "kalos" is quite high. It is also possible to use collocation data to determine the semantic range of a word in a Greek text. For example, the most common collocates of the word "thuô" ("to sacrifice") are, as one might expect, the implements, objects, and personnel associated with sacrifice.
Both of these tools are fully integrated with the Greek and Latin lexica in the Perseus Digital Library. To see the possible synonyms and the most common collocates of a word, simply look up that word in the Perseus lexica. The possible synonyms and collocation information will appear in tables at the top of each dictionary entry.
Words in Context Search Tool
The third tool allows users to form complex queries about the Greek and Latin texts in the Perseus Digital Library. Users can enter any number of query words in their inflected forms. The search program will parse the query words and find sentences that contain words derived from the same lexical forms. For example, if "pempousin" and "angelon" are entered as search terms, the program will find all sentences that contain words derived from "pempô" and "angelos." It is also possible for users to limit their searches by specifying words that should or should not be included in the search results. Searches can also be limited to individual authors or general categories such as prose or poetry.
Other research in these areas continues to take place at Perseus. We are currently working on ways to identify fixed phrases and idioms in Greek and Latin texts, ways to conceptually cluster the definitions in the Greek and Latin lexicon into a thesaurus or WordNet, and ways to identify possible Greek synonyms for Latin words and Latin synonyms for Greek words. Try these tools and let us know what you think!


document placed on-line 12/29/99, LMC