“Creating a clear plan early on can save you a lot of trouble down the road and minimize unwelcome surprises. The broad strokes of CV design are like any other type of design: planning and preparation are essential, fundamental steps in producing a good design.”You have probably heard IAs discussing the benefits of their latest taxonomy project and how you should be implementing one. But how, you might wonder, can you get started?
This article describes a process for building your own controlled vocabulary (CV). A previous article discussed the concept of a CV—the “what.” This article focuses on the “how.”
In this article we are looking at a process for creating any kind of controlled vocabulary. While our ultimate goal in this series is to explain facets, the details of facet analysis will be described in a future article. At this point, we are still exploring fundamental concepts and techniques.
There are many ways to create a controlled vocabulary. What follows is just one methodology. Also, keep in mind that many of the steps described here are not discrete units. When you actually create a CV, some steps may overlap.
Now, let’s get started. Imagine we are a company that sells camping gear, and we want to create a controlled vocabulary for our ecommerce site.
1. Develop a strategy. What do you want your controlled vocabulary to do?
The natural inclination when developing a CV is to start by gathering potential terms. But first, you need to consider a wide range of questions. Creating a clear plan early on can save you a lot of trouble down the road and minimize unwelcome surprises. The broad strokes of CV design are like any other type of design: planning and preparation are essential, fundamental steps in producing a good design.
First, what kind of CV do you need? The answer depends on a variety of issues. Start by thinking about some general questions such as these:
- What do you want your CV to accomplish?
- Do you want the CV to integrate with your navigation system?
- Are you planning on using the CV to improve searching? To improve browsing? Both?
- Are you planning to show term relationships in your search results?
- How much vocabulary control do you want to provide? Synonym ring? Facets? What level of vocabulary control is appropriate?
Second, think about your dependencies:
- Content – Consider this in two parts: specificity and stability.Specificity: If you are selling camping gear, are you selling 7-10 styles of tent or 100 styles? If you are selling 100 styles, you will need terms that are more specific and more exhaustive. This is because you will need to further differentiate among tents that are similar. The more items that are similar, the more specific you need to be.
Stability: Do the concepts and names for them change often? Do people generally call the same concept (or item or product) by the same name? In our example, we would ask if there are a lot of variant terms for the kinds of items we’re selling. What will be your method for keeping up with changing terminology?
- Technology – There are two pieces to this one: tools and integration. Each will help you think about implementation early on.Tools: Think about where the CV will ultimately sit. Do you have a CMS that will be involved? Will you be uploading your CV into a search engine? What software will you use to hold your terms: a thesaurus maintenance program like Multites, Term Tree, or Lexico? Or will you be creating it in Excel? Also consider tools you might use while gathering your terms. Many people collect their terms in a large Excel spreadsheet, others on Post-it Notes, sometimes even a wiki might work nicely.
Integration: How will your CV be integrated with the other pieces of your system? If the CV is going to be used in multiple applications, you need to consider the requirements of each. Be sure you talk to someone in IT and outline what your goals are.
- Users – CV design is a user-centered process. You must understand the target audience before setting down your terms. Who is the target audience for the site? The general public? Experienced campers? Are they web-savvy? How do they shop? Do they tend to buy one item at a time or several items at once? Do they need to do a lot of research before they buy? In other words, good, standard user-centered design methods, such as interviews and observation, are appropriate.
- Maintenance – Who from the organization will maintain your controlled vocabulary? What amount of time can they spend on this task? What is their training? If you decide to create a highly complex controlled vocabulary that your high school intern is going to maintain, you will have to provide additional training for that person. This is also a user-centered design issue, but along a different axis. Above we talked about a process that is extroverted: it looks towards the external users of the system. Here our axis is introverted: it looks towards the internal users of the system, the creators and maintainers of the vocabulary.
At this point, any normal person will say to himself, “Geez! Enough with the questions! Let me get on with creating my controlled vocabulary!” Resist this urge and stick with the discovery process; developing a strategy is important. You will probably change some of your answers as the project develops, but considering these questions up front will prevent you from wasting time later on.
2. Start gathering terms. What are the terms used to describe your content?
Now you are ready to start gathering your terms. Your goal here, considering the constraints and strategies that came out of Step 1, is to identify the terms that will bring the most success to your user population, enabling them to find exactly the information they need.
This is where the process becomes a little bit like “The Newlywed Game.” In this TV game show, the contestants are newly married couples. While one half of the couple is in a soundproof room, the host asks the remaining partner some intimate questions (often about “making whoopee”). Later, they reunite the couple and ask the other partner the same questions to see how well their answers match up. The couple with the most matched answers wins the big prize. The underlying questions for this game include “How well do the two sides of this relationship know each other?” and “How well can one half of the couple guess the answer the other half will give?”
To win the big prize of increased content findability, your site must describe your content in the terms that best match those terms the users are likely to use. When your partner (the user) comes out of that soundproof booth, you want to feel confident that you have provided the terms he will use on your site.
There are lots of great ways to get started with this process.
A. Look inward. What are the terms you already use to describe items on your site? If you are selling something, what are you selling? Look at each item and start generating terms to describe the object. What are the concepts the terms cover? List them. If we were doing a thesaurus for camping gear, we might start with something like: backpacks, tents, bug spray, etc. Then consider alternative terms you might use for each item.
Consider the level of granularity you want to use to reach your target audience or need to use based on the number of similar items you sell. If the target audience for your camping gear CV is beginning campers, you might distinguish thick sleeping bags from your thinner options by making a distinction by season (as in “winter” and “summer” bags). However, if you are targeting expert campers, you may need to describe your bags as “2-season” or “3-season” bags, in terms of insulating material (goose down, Polarguard, PrimaLoft), or by the temperature ratings. You don’t need to describe the entire field of camping gear; you need only describe your content in terms that will resonate with your target audience.
There is a danger here, however. Don’t look inward and exclude the additional options for gathering terms described below. It is important to get outside of your own understanding of terms and their concepts. Be sure to follow the next steps as well.
B. Look outward. Where are people using terms related to your content? You might review competitors’ sites, journals or magazines on your subject matter, or discussions by subject experts on the web. For example, if you are looking for terms about camping gear, you might look here:
Look at the sites on the list and note how they describe items that you also sell. Are there relevant variant terms you didn’t include from the looking inward step?
Consider the differences between REI and MEC (Mountain Equipment Co-op, a Canadian outdoor equipment store). Note the differences and similarities between the terms they use. In this example, we have shown only the terms for their top-level categories; you should dig deeper and find out what terms they use for sub-categories and individual items.
Sometimes, someone may already have developed a similar controlled vocabulary that you can use or modify. When this happens, we recommend that you perform an exuberant dance of joy. This won’t work for our camping gear example, but if you are building a large controlled vocabulary on another topic, you might want to see if you could borrow from one of the controlled vocabularies here:
More than likely, you will need to simplify anything you use from one of these lists, but they might be worth reviewing. Often, just the exercise of reviewing other CVs can be helpful in discovering ways to improve your own.
But be careful. Borrowing terms from other sites can muddy your own particular site’s strategy. Don’t borrow so much that your message gets confused or loses distinction.
C. Log files. If you already offer search, an easy option is to review your log files. Log files are goldmines of valuable customer information. They will give you an idea of what people think they might find on your site, as well as the words they use to describe what they are looking for. If you can get the file to display search results (as in 8 hits, 0 hits, etc.), you can see how successful people are. Or, reproduce the searches yourself to determine if people are getting relevant hits. See how Nordstrom’s benefited from this technique.
D. Ask people. Is there a way to ask users what they look for on your site? How would they describe your site’s contents?
Throughout Step 2, you are building into your CV what librarians call “user warrant.” This means that a term “is justified for inclusion in an index (or CV) only if it is of interest to the users of the information service.” (Lancaster, 26). Your CV will have high user warrant if the terms you include are real terms that people use to describe your content. If you include a lot of terms you suspect people might use, but that did not actually show up during your research, you will lower the user warrant. You are taking a risk: You may be unnecessarily muddying your CV.
At the end of this process you should have a large number of terms describing your site’s content.
3. Establish preferred terms, variants and hierarchies. How do the pieces fit together?
After Step 2, we are left with what is essentially a big bucket of unrelated terms. Now we start to put like terms together and identify each one’s relationships. For each term, ask what is the broader (more general) term? What are the narrower (more specific) terms? If you are using terms to establish a navigation system, is this a preferred term or a variant? Your controlled vocabulary will start to come together as context is added to each term.
Using our camping gear example, a traditional CV notation for the terms we have collected about sleeping bags might look like this:
BT Camping Equipment
NT Down Sleeping Bags
NT Synthetic Sleeping Bags
NT Family Sleeping Bags
NT Cold Weather Sleeping Bags
NT 2-Season Sleeping Bags
NT 3-Season Sleeping Bags
NT Ultralight Sleeping Bags
(BT = broader term; NT = narrower term)
Some in your group might say, “Hey, sleeping bags should go under Backpacking Equipment, not Camping Equipment.” A perfectly good assertion. Somehow, you will need to decide this issue. Can “Sleeping Bags” be in both places? Should the term live in one place in the CV with a cross-reference from the other location? Maybe there is a distinction among different kinds of sleeping bag that you had not previously considered.
It might be a good time to do some research. For instance, ask yourself, “How do REI and MEC describe their sleeping bags?”
|MEC does it like this:
|REI takes a completely different approach:
The differences are striking. The main ones include the following:
- Depth: The most obvious distinction is how REI goes for increased depth, whereas MEC uses a shallower category set.
- Term Choice: REI uses the general term “Sleeping Gear,” whereas MEC uses “Sleeping Bags.” What’s interesting is that both sites classify terms for related materials—pillows, stuff sacks, and so on—as narrower terms, yet only REI uses the more generic term “Sleeping Gear” to describe this breadth.
- Broader Terms: REI has “Sleeping Gear” as a narrower term under the top-level term “Camp/Hike.” MEC also has a similar top-level term—“Hiking/Camping Gear”—but instead of making “Sleeping Bags” a narrower term they put it at the same level.
- Bags and Pads: MEC puts sleeping pads as a narrow term under sleeping bags. REI doesn’t put them below sleeping bags, but at the same level in the hierarchy.
Which is better? That’s difficult to say. REI is more sophisticated in their categorization, probably because of their larger product line. While REI’s scheme is more sophisticated, it’s also more complicated, so perhaps the simplicity of the MEC approach is better. Most likely, these differences are the result of differing strategies.
Our intention here is not to suggest which is better, only to show how even a simple situation can give many alternative answers. Certainly one can find much to like about these schemes. But in each case, improvements can be made. They are muddled. Concepts are mixed and matched haphazardly. There are questions about scalability and future directions. Material, temperature, gender, and age are combined in surprising and inconsistent ways. For example, why does MEC put sleeping pads as a narrower term of sleeping bag when they are obviously related, yet distinct items? And we’re still confused about the distinction between these two terms in the REI scheme: “Kids’ Camping Bags” and “Kids Backpacking Bags.”
We will return to this example in a future article showing how facets can clarify this situation. But let’s not get too far ahead of ourselves.
For now, the question is: How do you clarify these issues? How do you make these difficult decisions? Making these decisions can quickly get messy in a group environment. Perhaps you need to ask a smaller team to consider the question and report back to the larger group. Doing some analysis, as we did with MEC and REI, and looking at your own strategy should help clarify what it is you want to do. However you decide your questions, be sure to note why you made the decision you did (for more on this, see Step 5).
We have been arguing that a good CV design process is essentially a user-centered process. Getting feedback from users will give you a great deal of insight into the problems we have raised.
A simple and commonly used method of getting feedback is called card sorting. Find some people whom you consider to be your target users. Give them cards with examples of items for sale on your site and ask them to arrange them into groups of like objects, or objects that they believe should be together. Then ask them to label their groups of cards. Look for patterns among their responses, compare the results to your original content labels, and make any necessary adjustments. For some good additional materials on card sorting, see the IA Wiki. Yes, it really is that simple and effective.
4. Identify the “see also” terms. What else might be interesting to your target audience?
In most cases, related terms need to be identified only for large projects. If you are working on an ecommerce site, here is a way to connect related products that people might buy at the same time. In other words, you need to identify places where interest in one item might lead to interest in another. If your site users are buying camping boots, do they need socks? If they are buying backpacks, would they be interested in water bottles? Often, these are what the Polar Bear book calls “contextual navigation” (116-118).
To get you started here, think about these possible relationships when considering related terms:
- process/agent (camp fires/matches);
- action/product of action (baking/cakes);
- agent/counteragent (allergies/antihistamine);
- raw material/product (wool/sweater).
Putting this idea of cross-selling into traditional CV notation might look something like this:
NT Down Sleeping Bags
NT Synthetic Sleeping Bags
NT Family Sleeping Bags
NT Cold Weather Sleeping Bags
NT 2-Season Sleeping Bags
NT 3-Season Sleeping Bags
NT Back Packing Sleeping Bags
NT Expedition Class Sleeping Bags
NT Ultralight Sleeping Bags
RT Ultralight Backpacking
RT Sleeping Bag Liners
RT Sleeping Pads
RT Stuff Sacks
(RT = related term)
What constitutes a related term? That is something for you to decide. Try to strike the right balance between suggesting options and overwhelming a user with choices. You might want to run the card sorting exercise again, this time giving people a list of items on cards and ask, for each item, if there are any objects from your inventory that they might look for when purchasing it. Adjust your CV accordingly.
5. Establish a record of the rules you are using if you are creating a large thesaurus.
I suspect most CV creators do not take the time to do this, and that is unfortunate. Remember all those decisions about what term goes where? Review the decisions you made and record what the decision was and why you made it. This will enable you to maintain consistency as your CV changes and expands. This makes your system easier to learn, and consequently, training your staff is easier. This is especially important for keeping categories pure if multiple people will be adding terms to content. It also makes for better decision-making in the future.
I am reminded of an interview with cellist Yo-Yo Ma who told some students, “If you make specific choices in the music, we hear them.” He added later in the class, “If you don’t make specific choices, we don’t hear them.” This is as true for the actions of a controlled vocabulary as it is for a piece of music. Be aware of the assumptions you are making and make them conscious choices; users will “hear” them.
Some possible questions to consider here are: When do you include a new term? What constitutes a relationship or RT? When do you delete terms? What is the basis for choosing a preferred term? When are terms singular or plural? Nouns or verbs? How will you deal with punctuation?
A place to look for generating issues you might want to consider is the ANSI/NISO standard for thesaurus construction. Reviewing these guidelines and deciding what is relevant to your particular situation will help ensure the best possible outcome for your CV creation process. Now is also a great time to review the assumptions you made in Step 1.
This step is difficult to write about because implementation is extremely dependant on your specific context. The other steps are not easy, but in the real world implementation is often the most difficult. It is also something the literature on CVs rarely tackles in a meaningful way. For now, we will take the metaphorical 50,000-foot view.
If you are using your controlled vocabulary for developing a menu for navigation or categories for browsing, continue your user testing. At this stage you can present a more complete version for users to evaluate. If you have completed some testing earlier, this should involve only minor changes to your CV.
If you are using your controlled vocabulary for searching, get ready for more work: Tweaking the algorithms for a search engine is a difficult job involving lots of tradeoffs. It will also require a good relationship with your IT staff (good thing you started this already in Step 1!). A lot of difficult decisions will need to be made. Examples include how you use punctuation, Boolean operators (when to use AND and when to use OR connectors), and recall versus precision. Multiple word terms can sometimes be difficult (if your CV term is “walking staff” and the user enters “Walking Staff Wood,” does he get any variant terms for “walking staff?”). Your solution will depend on the search engine you are using, the audience, the content, and the tradeoffs you need to make to get your project up and running.
7. Test and evaluate.
You have done some testing during the CV creation process, now it is time to make sure the assumptions you have made throughout the process are correct when you consider the implementation as a whole.
Start with yourself. Use the site to find various types of information based on assumptions you made earlier. Can you identify which content goes in which slot pretty easily? Can you search and get the results you expect? If using your CV to improve searching, enter a term and carefully look at the first page of returns. Are these the results you want your users to get for this search term?
After you feel like the CV is working as you believe it should, contact some outsiders and ask them to use your site. Do your terms reflect the concepts these people are searching for? Are they getting the results they expect? Are your terms too broad or too narrow? Remember, you are not always going to be successful. This is another time to keep the 80/20 rule in mind.
8. Go back and refine. What can be improved?
A controlled vocabulary is never finished. The goal of the initial creation of your CV is simply to create a system for controlling vocabulary that is agile, easy to update, consistent in both scope (what is covered) and granularity (how deeply it is covered), and helps users find what they are looking for.
However, maintenance is required to keep your CV viable and usable. Constant monitoring, evaluation, and tweaking are critical. This may require daily reviews of search logs, regular testing with users, regular conversations with subject specialists, or other analysis. One of the arguments against using a controlled vocabulary is that it requires so much time to maintain, that it doesn’t keep up with the changing terminology of the given field. Therefore, constant analysis is key to success. The list of improvements you can imagine needing to make will always be long, but don’t lose sight of the smaller, daily “housecleaning” tasks.
There is a lot of talk about how controlled vocabularies improve a site’s information architecture. If you decide to create one, however, it is important to realize that an effective controlled vocabulary involves regular maintenance. Doing it right will keep you aware of both the dynamic developments of your content and keep you close to the language of your users and their information needs.
- Cooper, Alan (1999). The Inmates are Running the Asylum: Why High-Tech Products Drive Us Crazy and How to Restore the Sanity. SAMS publishing: Indianapolis, IN.
- Lancaster, F.W. (1986). Vocabulary Control for Information Retrieval (2nd Edition). Information Resources Press: Arlington, VA.
- Rosenfeld, Louis, & Morville, Peter. (2002). Information Architecture for the World Wide Web: Designing large scale web sites. (2nd Edition). O’Reilly & Associates: Sebastopol, CA.
- An Annotated BibliographyKarl Fast is a PhD student in library and information science at the University of Western Ontario. He also has a master’s in LIS. His graduate work has included courses on organization of information, subject analysis, thesaurus construction, and facet analysis.Fred Leise, president of ContextualAnalysis, LLC, is an information architecture consultant providing services in the areas of content analysis and organization, user experience design, taxonomy and thesaurus creation, and website and back-of-book indexing.
Mike Steckel is an Information Architect/Technical Librarian for International SEMATECH in Austin, TX.