“Synonym rings and authority files are simple tools that can bridge the gap between natural language and complex controlled vocabularies (taxonomies and thesauri) quite nicely.”
This is Part 3 in our continuing series on controlled vocabularies and faceted classification. Previous parts in the series include:
- All About Facets and Controlled Vocabularies (series introduction)
- 1. What is a Controlled Vocabulary?
- 2. Creating a Controlled Vocabulary
As any connoisseur of duct tape knows, when you need to get a job done, the simplest tool is often your best friend. This is as true for controlled vocabularies (CVs) as it is for home repair. Remember that our goal for CVs is to “impose some order to facilitate agreement between the concepts within the site and the vocabulary of the person [natural language] using it.”
But that doesn’t mean the CV has to be complicated. Resources do not always allow for a full-fledged thesaurus, and often such a large undertaking is not necessary. Synonym rings and authority files are simple tools that can bridge the gap between natural language and complex controlled vocabularies (taxonomies and thesauri) quite nicely. We can explain how synonym rings work by way of an example.
International SEMATECH, the semiconductor research consortium, had a searching problem. Documents were uploaded to a private research website in a highly decentralized manner. Member company employees from all over the world had the ability to upload their own research documents and meeting presentations to the website.
A look at the search logs, however, revealed that people entered search terms that were yielding only a percentage of the documents they were trying to find. The problem was consistency of terminology. A review of the metadata found that those uploading information were equally as likely to call silicon “Si” as they were to spell out the whole name, “silicon.” There were many similar examples. Besides chemical symbols, users were both searching and uploading documents with acronyms (“PSM” vs. “Phase Shift Mask”) and simple variants in spelling (“low K dielectrics” vs. “low-K dielectrics” vs. “lowk dielectrics”).
The way the system previously worked, a user who searched for “Si,” “PSM,” or “low K dielectrics” would get only exact matches. In other words, they would miss documents that had “Silicon,” “Phase Shift Mask,” or “low-K dielectrics” in their metadata. Furthermore, they would get enough hits so they might not have realized that some relevant documents were missing (if they had gotten zero hits, they might have suspected something was wrong and tried another term).
It was our assumption that when users searched one term, they intended to find the entire set of documents related to that concept. But trying to get such an organization to adopt a style guide for metadata was not viable. The solution was to install a synonym ring into our search engine, Oracle Text.
What the synonym ring does
A synonym ring connects a series of terms together and treats them all as equivalent for search purposes. When a user enters “PSM,” for instance, the search term will be sent through the synonym ring to see if there are any equivalent terms. For “PSM” we would find “Phase Shift Mask” as a synonym. The search engine would then retrieve all documents with either “PSM” or “Phase Shift Mask” in their metadata. The searcher would get the complete set of relevant documents as though they had searched both terms (something few people would think to do).
If there is no match in the synonym list, the search is simply sent through the index as usual and any documents with “PSM” are returned. The synonym ring goes into effect only when there is a matching synonym for the term entered into the search box by the user.
Although getting a synonym ring up and running sounds pretty simple, the difficulties often come from trying to answer a simple question: “What is a synonym?” The example above was clear case of synonyms: An acronym and the full name of the object. It is not always this simple. A synonym can generally be two words with the exact or very similar meanings. Sounds simple, but how similar is similar enough? True synonyms are a rare thing.
What is a synonym?
Some synonyms may appear to be pretty straightforward. These include:
- Acronyms: BBC, British Broadcasting Company; MPG, miles per gallon
- Variant spellings: cancelled, canceled; honor, honour
- Scientific terms versus popular use terms: acetylsalicylic acid, aspirin; lilioceris, lily beetle
But synonyms, in general, quickly become more difficult. Are “medicine” and “drugs” synonyms? Are “fired” and “laid off”? What about “forest” and “woods” or “arid” and “dry”? With these examples, it is more difficult to say for sure. To answer the question about whether two terms are synonyms, you often have to consider the overall content of your site, as well as the site’s context and its users.
In our first article, we gave the following example of a synonym (which demonstrates the equivalence relationship):
But one could easily argue that these are not true synonyms. You may be looking for information about Elizabeth Taylor only during the time she was married to Larry Fortensky. In this case “Elizabeth Fortensky” might be the only part of the ring you would be interested in. Expanding the results by including results for both “Elizabeth Warner” and “Elizabeth Fortensky” would reduce the precision of the search results.
When creating a synonym ring, or any controlled vocabulary, you will spend a lot of time evaluating near synonyms. What guidelines should you use for making these decisions?
Recall and precision
Information architecture in the real world is all about the tradeoffs, right? Librarians have long been aware of the tradeoffs one makes between a search system that is broad and one that is specific. A search system that is broad is one with high recall, while one that is very narrow is one with high precision. Let’s look at these two terms a little more closely.
Recall is often represented as a ratio:
number of retrieved relevant documents / all relevant documents in a collection
Recall measures how many of the relevant documents are returned to the user. When you are searching a system with high recall, you are able to get a comprehensive set of documents returned, but you increase the possibility that less relevant documents will also get returned. This is great when you want to look through a large number of documents to make sure you have seen everything on a certain topic. Techniques for increasing recall include a synonym ring, stemming (some search engines will automatically return “jumping” and “jumps” when someone searches “jump”), and wildcards.
Precision, like recall, is often represented as a ratio:
number of retrieved relevant documents / total number of documents retrieved
You want to return all relevant documents to each user. So why not return all documents in your system for every search? That way you can be sure that every single relevant document is returned to the user, right? Well, true, but you’re also returning many irrelevant documents at the same time, making it harder for users to find what they want.
Precision ensures that only the relevant documents are returned to the user. When you are searching a system with high precision, your results are specific to your search. This is closer to a known-item search. You want only relevant search results and are less tolerant of getting some irrelevant results mixed in.
You can increase search precision by using specific indexing terms (“Ferrari” and not “sports car”), little or no stemming, word proximity operators (how closely words appear next to each other), and search zones.
Measuring the recall and precision of a particular search engine can be difficult. Measuring recall and precision using hard numbers is questionable. Relevance is difficult to quantify since it is inconstant (even during the course of a single search, relevance may change) and subjective.
A better way to get a handle on precision and recall is to collect responses from your users. What do people complain about? Do they say, “I get too many results?” This really means, “I get too many irrelevant results” and is a sign your recall might be too high. Do people say “I know it is in there, but I can’t find it?” or “I get no hits for too many searches?” If so, you might have precision too high. Just remember, recall and precision are inversely related: as one goes up, the other goes down. You will need to strike a balance.
Authority files
So now that we know what a synonym ring is, we can define an authority file. An authority file is similar to the synonym ring, with the addition of one type of term relationship. Instead of all of the terms being equal, one term is identified as the preferred term and the others are considered variant terms.
Authority files help with tagging content consistently. Catalogers for large library collections have long used authority files to find approved terms for describing an item. When they get a book about the Italian city of Firenze and another one about the Italian city of Florence, they use one of the names (based on prescribed rules) and describe all books in the collection about the city using a single, consistent term.
Similarly, in most major academic libraries, all books about “Native Americans” and “American Indians” are described with the term “Indians of North America.” When someone performs a subject search on “Native Americans” they get a note that says something like “This term is indexed as INDIANS OF NORTH AMERICA.” The authority file is the place you go find which term is the heading (the main term) and which term is the cross reference (the variant term).
A more typical example on a website might work like this: Let’s say you have a website devoted to comic books. It would be great if when someone typed “Caped Crusader” or “Dark Knight” into the search box, they got results for “Batman.” In this example, “Batman” and “Caped Crusader” would not be considered equivalent terms; the authority file would explain their relationship. You would not want to identify each Batman comic book with all three terms, just the main term. But when a user entered “Caped Crusader,” you would want the system to convert their term to “Batman” and return the appropriate results.
The relationships among the terms could be expressed like this:
Or in the language of a controlled vocabulary, like this:
Batman
USE FOR: Dark Knight, Caped Crusader
Caped Crusader
USE Batman
Dark Knight
USE Batman
Another way that people use authority files is to reinforce a correct term and to discourage an incorrect term. The Polar Bear Book uses the example of how drugstore.com corrects the spelling of Tylenol using an authority file. If you enter “tilenol” into the site’s search box, you get the results for “Tylenol.” Users will see the correct spelling prominently displayed, which will remind them how the word is really spelled. Maybe they will remember the correct spelling in the future.
Guidelines for implementation
When putting a synonym ring or authority file in place, consider the following guidelines:
- Show users how their search term was changed or added to by the system and exactly what was searched. At International SEMATECH, a search for Silicon would look like this:
The line under the search box tells users exactly how the search was submitted. When users understand how their term is expanded to include synonyms, they have a better understanding of how the site works. When done well, explanations can also increase confidence that users have in the system, since it shows them that the system understands what they are looking for.
- Keep the display simple. Include a search box at the top of the page so users can edit their terms if they see they have made a mistake. Try to follow the prescient words of the old poem:
Give me a look, give me a[n inter] face,
That makes simplicity a grace;— Ben Johnson (slightly modified) [http://www.bartleby.com/100/146.9.html]
- Try to characterize your content and the way your users understand it. At International SEMATECH, the majority of the synonym ring we use is made up of acronyms, since the scientific community seems to love creating and communicating with them. The content is also very narrow and scientific. There is not a great deal of the mushy language that comes from the general culture; most of it is very well defined. A general rule: The broader the content your site covers, the more you will find yourself dealing with near synonyms. Try to make similar evaluations of the content you are searching.
- Review search logs every day to look for new terms and synonyms. Is someone looking for an acronym that is not on your list? Try to find out what it means and make sure the next person looking for it gets the correct results.
Conclusions
Synonym rings and authority files are simple, common-sense ways to help users connect the various semantic concepts that are inherently intertwined with the term they choose. They are particularly good for large decentralized sites that are search dominant and have little centralized control over content.
Most of us know by now that users tend to use a small number of words for each search. They should not be forced to consider all the synonyms their search terms might have. Tim Bray said it well: “If you need to know about cow farming, you’re probably also searching for cattle ranching, beef (or dairy) production, and Kuhbauernhof, whether you know it or not.”
- All About Facets and Controlled Vocabularies (series introduction)
- What is a Controlled Vocabulary?
- Creating a Controlled Vocabulary
- An Annotated Bibliography
Karl Fast is a PhD student in library and information science at the University of Western Ontario. He also has a master’s in LIS. His graduate work has included courses on organization of information, subject analysis, thesaurus construction, and facet analysis.
Fred Leise, president of ContextualAnalysis, LLC, is an information architecture consultant providing services in the areas of content analysis and organization, user experience design, taxonomy and thesaurus creation, and website and back-of-book indexing.
Mike Steckel is an Information Architect/Technical Librarian for International SEMATECH in Austin, TX.
If you are interested in this article, Peter Van Dijck has just written on some of the same ideas, including some great ideas for implementation using MySQL:
Better Search Engine Design: Beyond Algorithms
http://www.onlamp.com/pub/a/onlamp/2003/08/21/better_search_engine.html
“Most of us know by now that users tend to use a small number of words for each search.”
2.5, in fact. http://www.personal.psu.edu/faculty/a/h/ahs4/Docs/IEEE2002.pdf