Content Analysis Heuristics

by:   |  Posted on

Most website designers are aware that an important part of understanding the background of any website redesign project is performing a content inventory as well as a content analysis.

After all, authorities Lou Rosenfeld and Peter Morville include this famous Venn diagram in their classic Information Architecture for the World Wide Web:


Clearly, we are supposed to understand the current website content before we begin the process of redefining and reorganizing the website.

So we all dutifully go through the website and prepare a content inventory spreadsheet capturing page titles, details of page content, and so on. leise.contentinventory.sample.jpg

Each content inventory contains a different set of columns and fields; each has a purpose specific to the needs of the particular site being analyzed. Sarah Rice has developed another example(xls) that’s available as part of the IAInstitute’s tools project.

Sarah’s version captures additional information from the site, uses an indented format for capturing the page titles at different hierarchical levels and uses color coding to indicate content types, external links and open questions.

So doing a content inventory is all well and good, but what exactly is it about the content that we are supposed to understand? What are we supposed to tell our client, other than that the website has 4,321 pages, of which 358 are dead-ends, 427 have no page titles, 27 have content that has expired, there are 432 different document templates in use, and there are 17 distinct document types?

In her 2002 article on rearchitecting the PeopleSoft website1, Chiara Fox noted that document inventories and analyses form part of bottom-up IA. “It deals with the individual documents and files that make up the site, or in the case of a portal, the individual sub-sites. Bottom-up methods look for the relationships between the different pieces of content, and uses metadata to describe the attributes found. They allow multiple paths to the content to be built.”

Certainly content relationships are important, as is the development of appropriate metadata to describe content, but are there specific things we can look for during a content inventory? In the remainder of this article, I hope to show that the answer is a resounding “Yes.”

Content Analysis Heuristics

In the fall of 2006, I was working on a navigation taxonomy project for a major media industry client that was redesigning its public-facing website. It was while preparing the content analysis report for that client that I developed the following set of 11 heuristics for analyzing website content.

  • Collocation
  • Differentiation
  • Completeness
  • Information scent
  • Bounded horizons
  • Accessibility
  • Multiple access paths
  • Appropriate structure
  • Consistency
  • Audience-relevance
  • Currency

These heuristics provide an important way to organize my report and help me identify significant problems that I might not otherwise notice. They provide qualitative results and indicate general trends, but are not statistically valid in the strict sense.

While you can use heuristics for any kind of website or intranet, regardless of size or content, certain heuristics may be less applicable for some sites. For example, a game site that is designed to encourage users’ exploration may not present bounded horizons. In fact, it would be doing gamers a disservice to let them know the entire game path from the start. So some evaluation is necessary as to whether or not (or how strongly) a specific heuristic should apply to the site you are designing.

Each of these heuristics will be discussed in detail in turn.


Bring together items with similar content or items about the same topic in one area.

Users should be able to find all relevant content easily. Accordingly, collect related content in one area, or at the least, make it accessible through one area. While the exact way content is related may differ (e.g., by document type, by subject, by author, by date), the information that users will want to find in one place should be in one place.

Obviously, if the quantity of content is large enough, users may have to visit different subsections to view all of the related content. In that case, the content organization itself should make it easy for users to understand how different areas are related and how. When those areas are viewed together, they will provide a unified picture of the product or subject of interest.

The important point here is to not have “dangling” content that lives in one area perhaps because of historical growth of the website, while most of the related content is accessible in another area.


Place dissimilar items or items about different subject areas in different content areas. Use navigation labels for different areas that clearly indicate those differences.

One of the typical ways that websites break this guideline is in the use of Frequently Asked Questions. FAQs often bring together a wide variety of topics on issues that are important for users. Perhaps website creators think they are making it easier for users to find information when they put everything “important” in one place.

The problem for the user is that their search for specific information becomes like looking for the proverbial needle in a haystack. Unless FAQs use a well-thought-out topical arrangement, users may have to read through every question in a long list to find the particular information they are looking for. How much better it would be to separate this content into meaningful sections!

The World Bank’s website is one example of an organized set of FAQs. They use four main topics and clearly identify secondary subject areas for each. Yet even this example is not totally successful in using a good topical arrangement, as the “Ask the Expert” section contains the usual miscellany of important information without topic differentiation.

“World Bank Website: FAQ Section, February 2007”


All content mentioned or linked to should exist.

In this day and age, there is no excuse for the 404 Error, Page Not Found. Nor is there any excuse for the “Under Construction” sign on a page. If the content doesn’t exist, don’t lead the user to where it might be sometime in the future.

If you mention a related topical area, be sure that content is actually on the website. Directing users to non-existent information simply breaks their trust in the website.

Information Scent

Content labels should be appropriately descriptive of content so that users know they are on the proper path to finding the information they are looking for. Content labels should therefore also reflect information collocation and differentiation.

The idea of information scent was first developed by Peter Pirolli, Stuart K. Card, and Mija M. Van Der Wege of the famous Xerox Palo Alto Research Center (PARC). In their paper [2], they note that, “Information scent is provided by the proximal cues perceived by the user that indicate the value, cost of access, and location of distal information content. In the context of foraging for information on the World Wide Web, for example, information scent is often provided by the snippets of text and graphics that surround links to other pages. The proximal cues provided by these snippets give indications of the value, cost, and location of the distal content on the linked page.”

Simply put, a good website will provide users with strong clues as to the content that can be found by clicking on a specific link.

In his Alertbox column of June 30, 2003, Jakob Nielsen says, “ensure that links and category descriptions explicitly describe what users will find at the destination. Faced with several navigation options, it’s best if users can clearly identify the trail to the prey and see that other trails are devoid of anything edible.

“Don’t use made-up words or your own slogans as navigation options, since they don’t have the scent of the sought-after item. Plain language also works best for search engine visibility: searching provides a literal match between the words in the user’s mind and the words on your site.”


A site’s users should be able to easily understand the breadth of content they are looking at.

While a labyrinthine website that leads users along a single, linear path through groves of rambling information might be appropriate for a conceptual artist’s site, such a principle for organizing content is useless in most cases.

Users should be able to identify in relatively short order the depth and breadth of relevant content.
Providing good navigation cues and a strong hierarchical structure when appropriate means that users quickly learn how long their search for information may take. They can thus make an informed decision whether to continue content exploration on your site or to abandon ship and continue elsewhere.


Users should be able to access the content they want through the browsing hierarchy or by using search.

It may seem obvious, but I have seen sites where search is so poor and navigation hierarchy so limited that it is hit-or-miss whether users can find what they seek. Often, information is hidden by contextual links to content areas not exposed in the main navigation. You are no doubt devoting considerable time and effort to creating content. Let users find it.

Multiple Access Paths

Because users think about content in different ways, they should be able to take multiple paths to get to specific content.

Facets provide one of the important ways to provide multiple paths to content. I’m looking for a blue coat to go with my gray suit. Or I want a wool sweater, because my cotton one won’t cut it in Boulder, Colorado. My wife says it has to be Prada. Size, color, material, designer: each can be the most important way for someone to find an item or some content.

While faceted navigation schemes are often useful for e-commerce sites, they can also be especially useful for information-rich sites. You might provide search filters by document type, date, or author in addition to subject. For scientists, methodology or researcher often become more important than subject in finding relevant research papers.

Multiple access paths provide greater findability for more users.

Appropriate Structure

Organization of content should (1) match users’ mental models of the information space and (2) support the differences in users’ information-seeking behaviors: known-item finding; exploratory browsing; unknown information finding; and refinding.

Whether you have multiple access paths or a single hierarchy, the organization and structure of your site should be appropriate to both the nature of the content and to your users.
As with many of these heuristics, there is no single “best” approach. Rather, based on your knowledge of business context, users, and content, determine whether content access structures are valid for the specific context.


Whenever possible, content structures in similar content areas should be consistent.

If all of your products have accessories, they should be accessible through similar links or tabs or icons. Consistency enables users to more quickly build a mental model of your site and to understand how to find information.

Think of the rather complex page structure on for a book:

  • Cover illustration
  • Title
  • Author
  • List price
  • Savings
  • Availability
  • Delivery information
  • New/used copies
  • Customers also bought
  • Editorial reviews
  • Product details
  • What customers ultimately buy
  • Help others find this book
  • Customer tags
  • Customer reviews
  • Customer discussion
  • Listmania
  • Recently viewed items
  • Similar items by category
  • Similar items by subject

Who in their right mind would create such a structure? Obviously people who did lots of research on their users. Why does this structure work? Because once we have seen it, we know that we will see it again and again and again. Power users of probably know exactly how many turns of their mouse’s scroll wheel it takes to the to the information they want.

This book product page may be long and complex, but it is consistent in structure and format. We know what to expect. A good website provides users with a consistent experience.


Content organization allows different audience segments to easily find relevant content.

This heuristic is especially important if your site’s audience comprises multiple distinct segments, holiday travelers and business travelers, or students and faculty. In some cases, it might be appropriate to use audience segment as the primary way to organize information.

Additionally, audience relevance may be legally mandated. Drug websites, for instance, are governed by FDA regulations dictating that prescribing information should be available only to health-care professionals, not the general public.

However, even with a relatively unitary audience, you want to be sure that the site’s labeling system is appropriate for its users. It is also important that the site mirror how users think about the site’s content.


Content should be kept up to date.

Nothing frustrates a user more than finding that the information you provide is out of date: you don’t make that product any more, that color is out of stock, or that drug is no longer indicated for that condition.

Put an expiration date on all content through your CMS, thus ensuring that it is reviewed for currency on a regular basis. That is a good way to ensure that you website provides users with information that is still valid.

Another way to ensure currency is to have a good website maintenance plan in place. Such a plan should cover, among other things: who is responsible for content reviews, extraordinary internal and external events that should automatically trigger a content review, and how users or content authors can suggest a review.


Although the above eleven heuristics provide good qualitative information, you may find it helpful to add a five-point scale (derived from the Lickert Scale), indicating how well the site under analysis conforms to the heuristic:
1. Strongly deviates from the heuristic
2. Deviates from the heuristic
3. Neither deviates nor conforms to the heuristic
4. Conforms to the heuristic
5. Strongly conforms to the heuristic

Providing such a scale may help the client understand the results of your content analysis better than a purely descriptive report.

Whether you use a numerical scale in discussing the results or not, provide your client with a written content analysis heuristics report. You can offer the analysis as part of your content inventory or content analysis report, or you can create a separate document entirely. The report should include sections describing your evaluation of the website using each of the heuristics (if applicable). In discussing each heuristic, indicate how well the site meets the heuristic in general and then note instances for improvement, or places where the site does not conform to the heuristic at all.

The following are several sections from an actual content analysis report that used these heuristics (modified to mask the company’s identity).

Although [company].com is relatively good at gathering like content into one area, there a number of exceptions. For example, information on money and vacations is available as a content sub-area under both the Money and the Vacations topical areas. However, the content is different in each place. In essence, there are two separate areas dealing with the same subject of money and vacations.

The most significant problem area with regards to collocation is the Specials section, which offers much content that would be best distributed among and combined with other areas of the site.

The [company].com Specials section is the primary place where the principle of differentiation is not observed. It combines subject areas such as health, relationships and travel, along with a number of the company’s special projects. Because this content area contains such disparate information, users may not always spend enough time looking through it to find relevant information.

Because labels on the website often reflect a supportive and encouraging emotive vocabulary, those labels sometimes obscure important information. For example, it is doubtful that a user looking at “Tips for Living” would realize that there is information on home decoration and time management in that section.

[Company].com generally supports exploratory browsing and unknown information finding. However, shortcomings in the search results (a limit of only 21 results) sometimes make it difficult for users to find specific information.

[Company].com is not always good at providing access to audience-relevant information. For example teachers may be totally unaware of the fact that there are classroom videos and teaching aids available in the Library section.

By arming the client with such information, you give them more well-structured ideas about how to improve their website. And that, after all, is the goal of our work.

[1] Fox, Chiara, “Re-architecting from the bottom-up”: in, June 16, 2002.
[2] Pirolli, Peter, Stuart K. Card and Mija M. Van Der Wege, “The Effect of Information Scent on Searching Information Visualizations of Large Tree Structures”:
[3] Nielsen, Jakob, “Information Foraging: Why Google Makes People Leave Your Site Faster”:

Improving Usability with a Website Index

by:   |  Posted on

Indexes are important information-finding tools that can enhance website usability. They offer easy scanning for finding known items, they provide entry points to content using the users’ own vocabulary and they provide access to concepts discussed, but not named, in the text. Perhaps most importantly, site indexes provide direct access to granular chunks of information without the need for traversing multiple links in a hierarchy.

What are indexes?
Before I explore how website indexes can improve usability, let’s start with background knowledge that will help show how they fit into the broader picture, especially since indexes have more to them than people often assume.Although great strides have been made with the technology, automatic classification tools come nowhere near the human brain in terms of accuracy in evaluating text.According to Nancy Mulvaney an index is “a structured sequence—resulting from a thorough and complete analysis of text—of synthesized access points to all the information contained in the text.”1

What are the important points about this definition? First, that the index is a sequence, that is, it has a known order of items. While most indexes are arranged alphabetically, other orders are possible, such as numerical (for a parts list index) or chronological (for a timeline). But the index isn’t just a list of entries, it is structured. In other words, the index shows relationships between various subjects, thus leading users to more specific or related topics that might meet their information needs more closely.

Most importantly for the construction of an index, a human has looked at and analyzed the text. Although great strides have been made with the technology, automatic classification tools come nowhere near the human brain in terms of accuracy in evaluating text. There is simply too much contextual meaning that texts carry, too much social and cultural knowledge that while not stated in the text, needs to be accounted for when creating the index. Certainly no computer can yet understand the actual meaning of all texts.

Mulvaney’s final point is that the index comprises access points to all the information contained in the text. An index contains all significant mentions of people, places, things and ideas. Important here is the idea of significance. An index should lead users to relevant material, to significant content chunks that provide useful information, rather than to passing mentions of words.

Thus, indexes are not concordances—lists of every occurrence of every word in a text. This is primary reason why indexes are much more valuable in certain cases that searches. Search results are often overwhelming or even useless; the fact that a word or phrase is mentioned in the text does not mean that the subject is discussed in the text. And it is the discussion that provides information for the user.

How do indexes increase usability?
Indexes, as flat lists of terms, are easily scannable. Users need only use their browser’s scroll bar to navigate through the entire index. (Large indexes often provide alphabetical anchor links at the top of the index, which take users quickly to the portion of the index they need to use.) There are no multiple levels to navigate, nor must users decide which branch of a hierarchy to click on, which often results in their missing information they are looking for or taking longer to find it. In fact, the easy scannability of the index on a single page is an important argument against having separate pages for letter of the alphabet, whenever possible.

Through the use of multiple access points or “see” references, indexes help translate the vocabulary of the users to that of a text. In this example, for instance:

cancer. See oncology

The index is telling the user that this site does have information on cancer, but that it uses the term “oncology” to represent this concept. And, if users click on the link, the index will bring them directly to the relevant information about that term.

“See also” references can lead users to additional or more specific information that might more closely meet their information needs. Every reference librarian knows that many users come to them with ill-formed queries. “See also” references assist users by helping them think about the information they are looking for.

training. See also online training; web-based training

Indexes are especially useful in “know-item finding,” those cases where users know specifically what they are looking for (or what information they saw previously and want to get back to). They simply find the term in the index and click on the link to go directly to the information. No need to drill down through multiple site levels or try to remember what path they took before.

Indexes can also serve an important function by leading users to concepts discussed but not specifically mentioned in the text. For example, a good indexer analyzing a paragraph that talks about Alpo and Purina Dog Chow might add an index entry for “pet nutrition.” Such intellectual analysis and synthesis adds significant value for users. Automated indexing tools fail at providing this kind of added value.

A site index acts as an important complement to the site map or table of contents. Where the latter look at the high-level (or top-down) organization of information on the site, indexes look at the bottom-up view, that is, at specific, granular information chunks.

When should site index be used?
Clearly, small sites have little need for indexes. Usually the navigation labels and page titles themselves will be enough for users to find the information they need (assuming that labels have been well thought out and provide an appropriate information scent).

For extremely large sites, with millions of pages, including everything in the index would be so time consuming and labor intensive as to be uneconomical. In addition, the resulting index would be almost impossible to scan. However, such sites can be improved and their usability increased by providing an index that directs users to the set of information that is most used or that most users need to do their jobs efficiently.

Most mid-sized sites, with hundreds or thousands of pages can benefit from the additional navigation that site indexes offer and can be indexed in a reasonable amount of time at a reasonable cost.

How are website indexes created?
Indexing, no matter what the material under consideration, consists of two steps. First, the content is analyzed to establish indexable concepts and then terms (or labels) for those concepts are created or selected. In website indexing, the URL for the page on which the information resides is captured and used to turn the index term into a hypertext link. For best results, a human mind needs to do the content analysis process.

There is software available, such as HTML Index, that helps automate the index preparation process by spidering a site and creating a preliminary version of an index using page titles and named anchors. The indexer then needs to massage those results to create a truly useful index.

Indexers can also create a site index using regular indexing software. CINDEX, MACREX and Sky Professional are the programs most used by professional indexers to assist with important, but time-consuming housekeeping tasks such as alphabetizing entries, checking spelling or verifying cross references. After the initial index entries have been created, they can then be copied or output (with embedded HTML coding) into a content management system’s index page template for later publishing to the website itself.

That process was the one I used to create the site index for PeopleSoft, Inc.’s website, which won an Australian Society of Indexers Web Index Award 2002–2004. Here, for example is the simple link code used to create the fourth line in the PeopleSoft site index illustrated below:

<a href=”/corp/en/about/pspartner/apply/apply_partner.asp”>Alliance partners, applying to become</a><br>

Special codes (available in most indexing programs) were used to “hide” the HTML coding so that the program alphabetized only the actual index labels themselves.

Having the site index use the conventions of back-of-book indexes, for example, indented subheads, makes it instantly recognizable for users. If they have any familiarity at all with using indexes, they will feel right at home with your site index. And that helps make for good usability.

Creating index labels
Label terms for indexes may be created by one of two different methods, depending on whether indexing is being carried out in a “closed” system or an “open” system.

In the former, nothing other than the text itself needs to be considered. The indexer derives index labels using “literary warrant” from the terminology used in the website itself and adjusts the labels as necessary for whatever reason.

Alternately, in an open system, the indexer selects terms from a previously created list of terms that exists separately from the text itself. These term lists may be authority files, simple lists of approved terms, or thesauri, which show relationships between terms (related terms, broader terms or narrower terms) that help the indexer select the most appropriate term to describe the specific text being analyzed. Open system indexing is used in cases where it is necessary to ensure consistency among multiple, related sites or to control vocabulary in a single large, complex site with multiple authors.

Who should create site indexes?
Whenever possible, a professional indexer should be hired. Such individuals are thoroughly experienced in analyzing content, accounting for user terminology and in creating an appropriate index structure.

The American Society of Indexers has an indexer locater on its website through which you can find indexers with experience in indexing web/HTML documents.

Corporate librarians often have training or experience in indexing and can also be important resources in identifying individuals with indexing skills.

Index maintenance
Once you have created a fabulous site index and have tested it to ensure that all its links work properly, you need to have an index maintenance policy in place. You will need to consider such things as: How often does the index get updated? Who decides when newly created information gets included. When does ROT (redundant, outdated or trivial information) get removed? Who is responsible for updating the index?

Keeping this important information access tool up to date will help ensure that your site’s users continue to find what they need when they need it.

For more information:

  • The American Society of Indexers maintains a page on its website listing indexing courses and workshops:
  • Anderson, James D. Guidelines for Indexes and Related Information Retrieval Devices (NISO Technical Report 2, NISO-TR02-1997. Bethesda, Maryland: NISO Press, 1997.
  • The Chicago Manual of Style. 14th ed. Chicago: The University of Chicago Press, 1993
  • Mulvaney, Nancy. Indexing Books. Chicago: The University of Chicago Press, 1994
  • Wellisch, Hans H. Indexing from A to Z. Bronx, New York: H. W. Wilson Company, 1991
  • American Society of Indexers:
  • Australian Society of Indexers:
  • CINDEX indexing software:
  • MACREX indexing software:
  • Sky indexing software:
Fred Leise, president of ContextualAnalysis, LLC, is an information architecture consultant providing services in the areas of content analysis and organization, user experience design, taxonomy and thesaurus creation, and website and back-of-book indexing.