Thesaurus (information retrieval)
In the context of information retrieval, a thesaurus is a form of controlled vocabulary that seeks to dictate semantic manifestations of metadata in the indexing of content objects. A thesaurus serves to minimise semantic ambiguity by ensuring uniformity and consistency in the storage and retrieval of the manifestations of content objects. ANSI/NISO Z39.19-2005 defines a content object as "any item that is to be described for inclusion in an information retrieval system, website, or other source of information". The thesaurus aids the assignment of preferred terms to convey semantic metadata associated with the content object.
A thesaurus serves to guide both an indexer and a searcher in selecting the same preferred term or combination of preferred terms to represent a given subject. ISO 25964, the international standard for information retrieval thesauri, defines a thesaurus as a “controlled and structured vocabulary in which concepts are represented by terms, organized so that relationships between concepts are made explicit, and preferred terms are accompanied by lead-in entries for synonyms or quasi-synonyms.”
A thesaurus is composed by at least three elements: 1-a list of words, 2-the relationship amongst the words, indicated by their hierarchical relative position, 3-a set of rules on how to use the thesaurus.
History
Wherever there have been large collections of information, whether on paper or in computers, scholars have faced a challenge in pinpointing the items they seek. The use of classification schemes to arrange the documents in order was only a partial solution. Another approach was to index the contents of the documents using words or terms, rather than classification codes. In the 1940s and 1950s some pioneers, such as Calvin Mooers, Charles L. Bernier, and Hans Peter Luhn, collected up their index terms in various kinds of list that they called a “thesaurus”. The first such list put seriously to use in information retrieval was the thesaurus developed in 1959 at the E I Dupont de Nemours Company.The first two of these lists to be published were the Thesaurus of ASTIA Descriptors and the Chemical Engineering Thesaurus of the American Institute of Chemical Engineers, a descendant of the Dupont thesaurus. More followed, culminating in the influential Thesaurus of Engineering and Scientific Terms published jointly by the Engineers Joint Council and the US Department of Defense in 1967. TEST did more than just serve as an example; its Appendix 1 presented Thesaurus rules and conventions that have guided thesaurus construction ever since.
Hundreds of thesauri have been produced since then, perhaps thousands. The most notable innovations since TEST have been:
Extension from monolingual to multilingual capability; and
Addition of a conceptually organized display to the basic alphabetical presentation.
Here we mention only some of the national and international standards that have built steadily on the basic rules set out in TEST:
- UNESCO Guidelines for the establishment and development of monolingual thesauri. 1970
- DIN 1463 Guidelines for the establishment and development of monolingual thesauri. 1972
- ISO 2788 Guidelines for the establishment and development of monolingual thesauri. 1974
- ANSI American National Standard for Thesaurus Structure, Construction, and Use. 1974
- ISO 5964 Guidelines for the establishment and development of multilingual thesauri. 1985
- ANSI/NISO Z39.19 Guidelines for the construction, format, and management of monolingual thesauri. 1993
- ISO 25964 Thesauri and interoperability with other vocabularies. Part 1 published 2011; Part 2 published 2013.
Purpose
In information retrieval, a thesaurus can be used as a form of controlled vocabulary to aid in the indexing of appropriate metadata for information bearing entities. A thesaurus helps with expressing the manifestations of a concept in a prescribed way, to aid in improving precision and recall. This means that the semantic conceptual expressions of information bearing entities are easier to locate due to uniformity of language. Additionally, a thesaurus is used for maintaining a hierarchical listing of terms, usually single words or bound phrases, that aid the indexer in narrowing the terms and limiting semantic ambiguity.The Art & Architecture Thesaurus, for example, is used by countless museums around the world to catalogue their collections. AGROVOC, the thesaurus of the UN's Food and Agriculture Organization, is used to index and/or search its AGRIS database of worldwide literature on agricultural research.