“A controlled vocabulary is a way to insert an interpretive layer of semantics between the term entered by the user and the underlying database to better represent the original intention of the terms of the user.”The most effective communication occurs when all parties involved agree on the meaning of the terms being used. Consequently, finding the right words to communicate the message of your website can be one of the most difficult parts of developing it.
When we converse, we speak in “natural language.” This is language in all its raw, rich, gooey glory. When we organize our information and label it however, there is so much richness, variance, and confusion in terminology that we often need to impose some order to facilitate agreement between the concepts within the site and the vocabulary of the person using it.
This order can come through a controlled vocabulary. Amy Warner defines a controlled vocabulary (CV) as “organized lists of words and phrases, or notation systems, that are used to initially tag content, and then to find it through navigation or search.” This means that a CV is a type of metadata that functions as a “subset of natural language”(Wellisch); it is not how we normally speak. Using a CV is also a way to overtly display relationships among the various concepts that your site covers in order to increase findability. The most basic, and often overlooked, form of controlled vocabulary is a consistent labeling system. If you are careful to call the same thing, or the same concept, by the same name everywhere on your site, you are using a very simple controlled vocabulary. And you’re also ensuring that your users start developing a mental model of the information they can find.
A controlled vocabulary is a way to insert an interpretive layer of semantics between the term entered by the user and the underlying database to better represent the original intention of the terms of the user. Consider what happens when you do not use a controlled vocabulary. An uncontrolled vocabulary simply uses the natural language of the documents and matches that with the natural language of the user. This is extremely specific, and it gives the user exactly what they ask for. Sounds great right? Consider, however, a site about chemistry, where many of the documents use the chemical name of the element (“iron”), and many use the chemical symbol of the element (“Fe”). Using an uncontrolled vocabulary, the results will only include the terms entered by the user. If the user entered “Fe” in the search box, he will not get any of the results for documents that use the term “iron.” There is a good chance the user is missing some documents he would like to have. Very few users will enter both terms, and many will be reviewing their results thinking they are seeing the results from all relevant documents.
The equivalence relationship
You probably are aware of certain categories or items on your site that might go by multiple names. You realize that if you said “automobiles” on your homepage and “cars” on the next page, users might get confused. Users will start to wonder if there is a difference between the two terms. Instead you choose “automobiles” and don’t use “cars” at all. In this case “automobiles” is the term you prefer to use throughout your site. We call this the “preferred term.” “Cars” is a variant term, a different word representing the same concept. Or, consider this example:
Here, each term refers to the same concept, Elizabeth Taylor (your preferred term). We could tell our system, when people ask for “Elizabeth Burton” use “Elizabeth Taylor.” This is more traditionally expressed using standard CV notation as:
Elizabeth Fortensky USE Elizabeth Taylor
Elizabeth Taylor UF Elizabeth Fortensky (UF = Use For)
Or even this:
Liz Taylor USE Elizabeth Taylor
Elizabeth Taylor UF Liz Taylor
Think about Gap’s web page (http://www.gap.com). We already know what they sell (they have excellent branding), and most of their content is generally referred to by the same terms as used in our general culture. In other words, people consistently say “jeans,” “pants,” and “shirts.” Even though you might get the occasional person using the word “dungarees” or “slacks,” nearly everyone would see “jeans” and know what the category referred to (the visuals help support this too). Furthermore, Gap does not carry hundreds of pairs of jeans that must somehow be distinguished from one another. If you examine the natural language people use when talking about Gap’s products, there’s an unusually small amount of term variance. Content like this works great in the very simple organization system used on the Gap site. It works so well that they do not even need to offer search; this is very unusual for an ecommerce site. What they have is a system in which all of the concepts are consistently labeled using language familiar to their users. They’re lucky. Few sites have the option to work in this way.
Let’s say, however, that gap.com decided to offer search. Then they would somehow need to translate the natural language of search into the controlled language of the website. People search in the same language they speak, natural language, so a more advanced controlled vocabulary needs to take the concepts of your users (natural language) and match them to the concepts expressed in the language of your website (controlled vocabulary). That means if the developers of the site began to see that people were searching for “dungarees” and getting zero hits, they would need to create a way to tell the system, “when someone searches for ‘dungarees,’ give them the results for ‘jeans.’” In the language of a controlled vocabulary, “jeans” becomes the preferred term and “dungarees” is a variant term, and they have an equivalence relationship. This can be a powerful tool for increasing findability.
There are many examples of the situations that alternate terms cover. Here are a few:
- synonyms (two words with the same meaning, like “jeans” and “dungarees”)
- homonyms (words that sound the same, but have different meanings, like “bank” the financial institution and “bank” the side of a stream or river)
- common misspellings
- changes in content (e.g., countries that change their name or have multiple spellings)
- identifying “Best Bets” or the most popular pages associated with a certain term (http://www.BBC.com is great at this)
- connecting a woman’s married name to her maiden name
- connecting abbreviations to the full word (e.g., NY and New York, the chemical symbol Si with the element Silicon)
There are two types of synonym equivalence lists: synonym rings and authority files. Synonym rings are generally used for searching behind the scenes as a way to connect the various terms for a concept. It can be used to say, “when someone searches for “Si,” give them all documents with both “Si” and ‘Silicon.’” However, what happens when you want to display one of these terms in your navigation? Then you will need to pick one to be your preferred term. Now, you have an authority file. In each of the above examples, different terms may be used, but each one represents the same concept. They are tied together and given meaning by making their equivalent relationship explicit.
Hierarchical relationships: broader and narrower terms
If your content is more complex, for instance if you sold only pants and you had hundreds of types, you might require more from your controlled vocabulary.The natural language we use to describe the concept of “pants” quickly enlarges as “pants” becomes more specific. In other words, “slacks,” “khakis,” “jeans,” “trousers,” “corduroys,” and other kinds of pants will all need to be differentiated so users don’t have to rummage through pages and pages of search results for the word “pants,” when pants are your whole inventory.
What will help is a systematic way to map out the different terms so people quickly find the specific kinds of pants they are interested in. What you need is a hierarchy showing the broader terms (BTs), the narrower terms (NTs), and the variant terms (most often displayed as “USE” and “UF” for Used for). These will show which terms are subsets of larger, broader concepts. You are starting off with a jumble of words that are all related to “pants” in some way. It might look something like Figure 2.
We have a bucket we can call “Pants” and inside are a lot of terms with a relationship to the concept of pants. In this example, “pants” is the broader term, and the kinds of pants refer to subsets of the whole universe of pants. In a controlled vocabulary, we might reconfigure the chart above to look like this:
This is what people are increasingly calling a Taxonomy. This term makes traditional librarians a little uncomfortable, but we are learning to live with it. Originally it was a term for biological classifications (Genus, species, etc.), but has quickly become a standard word for describing hierarchies.
The standard CV notation used to express hierarchical relationships are NT (narrower term) and BT (broader term). Using this notation, the term “Women’s Pants” would be expressed like this:
NT Casual Pants
NT Dress Pants
NT Sports Pants
There is a lot you can do with this hierarchical arrangement. It can help you formulate your homepage navigation. It could improve your searching and browsing. It can help users broaden and narrow their search results quickly by showing them where each set of results fits into the site’s hierarchy (see Keith Instone’s “attribute breadcrumbs” for more examples). Generally, few sites need to go beyond the level of a taxonomy, but it might be useful to see the next level of complexity in controlled vocabularies.
Associative relationships: related terms
How far can I extend the pants example? Oh, quite far. Let’s say that you are a research institute that studies pants (ridiculous I know, but stay with me). You not only study pants themselves, but the materials they are made from, their history, how they are manufactured, and more. Your institute might do well to take the time to develop what Peter Morville has called the “Rolls Royce of controlled vocabularies”—a thesaurus. A thesaurus shows all of the relationships described so far (BT, NT[LD3], and UF), but will also include related terms (RT). This is an associative relationship. It shows how one term is associated with another.
If a user looked to your institute for research on jeans, you would be able to give them that term embedded in a rich series of relationships. An example of the range of relationships would be expressed like this using the standard format for thesauri:
UF Waist Overalls
Denim is related to Jeans, but not hierarchically. It is not a type of jeans, nor is one a subset of the other. Yet someone interested in one term might be interested in the other because they are related concepts. In the interface, you might identify “Denim” as a “see also” option for “Jeans.” If users looked for the term “Denim” in the thesaurus they might see something like this:
NT Ring Spun
NT Dark Indigo
The Denim example alone could be filled with many additional terms, and it is easy to see how well this would accommodate user browsing (and “berrypicking”). This is also one of the dangers of creating associative relationships: knowing when to stop. This relationship is also the most difficult and subjective of all the relationships in a CV. You are identifying a relationship between two concepts that may not be obviously apparent. On an Amazon product page, when the page identifies an item that others have purchased along with the one being displayed, Amazon is identifying a potentially useful associative relationship.
To push the concept a little further, if a user is interested in a paper from your pants institute on the “Hemingway wore khakis” advertisement from Gap, they might also be interested in a paper you have on how it was really Rock Hudson’s subtle use of Khakis that made “A Farewell to Arms” such a great movie. The connection between the two documents, the intersection of the concepts of “Hemingway” and “khakis,” is less direct than the Denim example above. This is expanding the concept of “related terms” farther than many would be prepared to go, but it is an option.
Internal uses of controlled vocabularies
So far, we have focused on how controlled vocabularies help the user, but there are also benefits to the organization using the CV. Here are a few:
- CVs can help with category analysis or keeping your categories distinct.
- CVs can help establish a site’s navigation.
- CVs can be the basis for personalization features.
- CVs can help with preparation for CMS or knowledge management projects, since many of these require this sort of structure to your content to do their magic.
- CVs get the organization using the same language as the users (which should result in better communication with them).
- CVs can help the organization (and the user) understand what concepts your site covers. Your controlled vocabulary is in reality a “concept map” of what is on your site.
While controlled vocabularies can be powerful, by themselves they are not the magic pill that will cure what ails your site. CVs are a lot of work, they are often difficult and time consuming to maintain, and they can be very political. Some skepticism toward all metadata is a healthy thing (everyone still reading this should see Metacrap). As with anything important, there are a lot of people who are doing it loudly and badly.
Human beings are natural makers of patterns. That is how we understand what our senses are taking in. When people visit your site, they will immediately begin trying to understand what they see. A well-designed and regularly updated controlled vocabulary can help connect the concepts your users have in their heads to the concepts you present on your site. That is when real communication will occur.
Next in the series: How to create a controlled vocabulary.
- Wellisch, Hans. Indexing from A to Z. New York: H.W. Wilson, 1995. p.214
- Amy J. Warner Taxonomy Primer
- Keith Instone’s attribute breadcrumbs
- ASIS Summit 2001
- An Annotated Bibliography
Karl Fast is a PhD student in library and information science at the University of Western Ontario. He also has a master’s in LIS. His graduate work has included courses on organization of information, subject analysis, thesaurus construction, and facet analysis.
Fred Leise, president of ContextualAnalysis, LLC, is an information architecture consultant providing services in the areas of content analysis and organization, user experience design, taxonomy and thesaurus creation, and website and back-of-book indexing.
Mike Steckel is an Information Architect/Technical Librarian for International SEMATECH in Austin, TX.