Tree Testing

Written by: Dave OBrien

A big part of information architecture is organisation – creating the structure of a site. For most sites – particularly large ones – this means creating a hierarchical “tree” of topics.

But to date, the IA community hasn’t found an effective, simple technique (or tool) to test site structures. The most common method used — closed card sorting — is neither widespread nor particularly suited to this task.

Some years ago, Donna Spencer pioneered a simple paper-based technique to test trees of topics. Recent refinements to that method, some made possible by online experimentation, have now made “tree testing” more effective and agile.

How it all began

Some time ago, we were working on an information-architecture project for a large government client here in New Zealand. It was a classic IA situation – their current site’s structure (the hierarchical “tree” of topics) was a mess, they knew they had outgrown it, and they wanted to start fresh.

We jumped in and did some research, including card-sorting exercises with various user groups. We’ve always found card sorts (in person or online) to be a great way to generate ideas for a new IA.

Brainstorming sessions followed, and we worked with the client to come up with several possible new site trees. But were they better than the old one? And which new one was best? After a certain amount of debate, it became clear that debate wasn’t the way to decide. We needed some real data – data from users. And, like all projects, we needed it quickly.

What kind of data? At this early stage, we weren’t concerned with visual design or navigation methods; we just wanted to test organisation – specifically, findability and labeling. We wanted to know:
* Could users successfully find particular items in the tree?
* Could they find those items directly, without having to backtrack?
* Could they choose between topics quickly, without having to think too much (the Krug Test)1?
* Overall, which parts of the tree worked well, and which fell down?

Not only did we want to test each proposed tree, we wanted to test them against each other, so we could pick the best ideas from each.

And finally, we needed to test the proposed trees against the existing tree. After all, we hadn’t just contracted to deliver a different IA – we had promised a better IA, and we needed a quantifiable way to prove it.

The problem

This, then, was our IA challenge:
* getting objective data on the relative effectiveness of several tree structures
* getting it done quickly, without having to build the actual site first.

As mentioned earlier, we had already used open card sorting to generate ideas for the new site structure. We had done in-person sorts (to get some of the “why” behind our users’ mental models) as well as online sorts (to get a larger sample from a wider range of users).

But while open card sorting is a good “detective” technique, it doesn’t yield the final site structure – it just provides clues and ideas. And it certainly doesn’t help in evaluating structures.

For that, information architects have traditionally turned to closed card sorting, where the user is provided with predefined category “buckets” and ask to sort a pile of content cards into those buckets. The thinking goes that if there is general agreement about which cards go in which buckets, then the buckets (the categories) should perform well in the delivered IA.

The problem here is that, while closed card sorting mimics how users may file a particular item of content (e.g. where they might store a new document in a document-management system), it doesn’t necessarily model how users find information in a site. They don’t start with a document — they start with a task, just as they do in a usability test.

What we wanted was a technique that more closely simulates how users browse sites when looking for something specific. Yes, closed card sorting was better than nothing, but it just didn’t feel like the right approach.

Other information architects have grappled with this same problem. We know some who wait until they are far enough along in the wireframing process that they can include some IA testing in the first rounds of usability testing. That piggybacking saves effort, but it also means that we don’t get to evaluate the IA until later in the design process, which means more risk.

We know others who have thrown together quick-and-dirty HTML with a proposed site structure and placeholder content. This lets them run early usability tests that focus on how easily participants can find various sublevels of the site. While that gets results sooner, it also means creating a throw-away set of pages and running an extra round of user testing.

With these needs in mind, we looked for a new technique – one that could:
* Test topic trees for effective organisation
* Provide a way to compare alternative trees
* Be set up and run with minimal time and effort
* Give clear results that could be acted on quickly

The technique — tree testing

Luckily, the technique we were looking for already existed. Even luckier was that we got to hear about it firsthand from its inventor, Donna Spencer, the well-regarded information architect out of Australia, and author of the recently released book “Card Sorting”:http://rosenfeldmedia.com/books/cardsorting/.

During an IA course that Donna was teaching, she was asked how she tested the site structures she created for clients. She mentioned closed card sorting, but like us, she wasn’t satisfied with it.

She then went on to describe a technique she called “card-based classification”:http://www.boxesandarrows.com/view/card_based_classification_evaluation, which she had used on some of her IA projects. Basically, it involved modeling the site structure on index cards, then giving participants a “find-it” task and asking them to navigate through the index cards until they found what they were looking for.

To test a shopping site, for example, she might give them a task like “Your 9-year-old son asks for a new belt with a cowboy buckle”. She would then show them an index card with the top-level categories of the site:

She would then show them an index card with the top-level categories of the site.

The participant would choose a topic from that card, leading to another index card with the subtopics under that topic.

 The participant would choose a topic from that card, leading to another index card with the subtopics under that topic.

The participant would continue choosing topics, moving down the tree, until they found their answer. If they didn’t find a topic that satisfied them, they could backtrack (go back up one or more levels). If they still couldn’t find what they were looking for, they could give up and move on to the next task.

During the task, the moderator would record:
* the path taken through the tree (using the reference numbers on the cards)
* whether the participant found the correct topic
* where the participant hesitated or backtracked

By choosing a small number of representative tasks to try on participants, Donna found that she could quickly determine which parts of the tree performed well and which were letting the side down. And she could do this without building the site itself – all that was needed was a textual structure, some tasks, and a bunch of index cards.

Donna was careful to point out that this technique only tests the top-down organisation of a site and the labeling of its topics. It does not try to include other factors that affect findability, such as:
* the visual design and layout of the site
* other navigation routes (e.g. cross links)
* search

While it’s true that this technique does not measure everything that determines a site’s ease of browsing, that can also be a strength. By isolating the site structure – by removing other variables at this early stage of design – we can more clearly see how the tree itself performs, and revise until we have a solid structure. We can then move on in the design process with confidence. It’s like unit-testing a site’s organisation and labeling. Or as my colleague Sam Ng says, “Think of it as analytics for a website you haven’t built yet.”

So we built Treejack

As we started experimenting with “card-based classification” on paper, it became clear that, while the technique was simple, it was tedious to create the cards on paper, recruit participants, record the results manually, and enter the data into a spreadsheet for analysis. The steps were easy enough, but they were time eaters.

It didn’t take too much to imagine all this turned into a web app – both for the information architect running the study and the participant browsing the tree. Card sorting had gone online with good results, so why not card-based classification?

Ah yes, that was the other thing that needed work – the name. During the paper exercises, it got called “tree testing”, and because that seemed to stick with participants and clients, it stuck with us. And it sure is a lot easier to type.

To create a good web app, we knew we had to be absolutely clear about what it was supposed to do. For online tree testing, we aimed for something that was:
* Quick for an information architect to learn and get going on
* Simple for participants to do the test
* Able to handle a large sample of users
* Able to present clear results

We created a rudimentary application as a proof of concept, running a few client pilots to see how well tree testing worked online. After working with the results in Excel, it became very clear which parts of the trees were failing users, and how they were failing. The technique worked.

However, it also became obvious that a wall of spreadsheet data did not qualify as “clear results”. So when we sat down to design the next version of the tool – the version that information architects could use to run their own tree tests – reworking the results was our number-one priority.

Participating in an online tree test

So, what does online tree testing look like? Let’s look at what a participant sees.

Suppose we’ve emailed an invitation to a list of possible participants. (We recommend at least 30 to get reasonable results – more is good, especially if you have different types of users.) Clicking a link in that email takes them to the Treejack site, where they’re welcomed and instructed in what to do.

Once they start the test, they’ll see a task to perform. The tree is presented as a simple list of top-level topics:
In Treejack, the tree is presented as a simple list of top-level topics.

They click down the tree one topic at a time. Each click shows them the next level of the tree:
In Treejack, each click shows them the next level of the tree.

Once they click to the end of a branch, they have 3 choices:
* Choose the current topic as their answer (“I’d find it here”).
* Go back up the tree and try a different path (by clicking a higher-level topic).
* Give up on this task and move to the next one (“Skip this task”).

In Treejack, the participant selects an answer.

Once they’ve finished all the tasks, they’re done – that’s it. For a typical test of 10 tasks on a medium-sized tree, most participants take 5-10 minutes. As a bonus, we’ve found that participants usually find tree tests less taxing than card sorts, so we get lower drop-out rates.

Creating a tree test

The heart of a tree test is…um…the tree, modeled as a list of text topics.

One lesson that we learned early was to build the tree based on the content of the site, not simply its page structure. Any implicit in-page content should be turned into explicit topics in the tree, so that participants can “see” and select those topics.

Also, because we want to measure the effectiveness of the site’s topic structure, we typically omit “helper” topics such as Search, Site Map, Help, and Contact Us. If we leave them in, it makes it too easy for users to choose them as alternatives to browsing the tree.

Devising tasks

We test the tree by getting participants to look for specific things – to perform “find it” tasks. Just as in a usability test, a good task is clear, specific, and representative of the tasks that actual users will do on the real site.

How many tasks? You might think that more is better, but we’ve found a sizable learning effect in tree tests. After a participant has browsed through the tree several times looking for various items, they start to remember where things are, and that can skew later tasks. For that reason, we recommend about 10 tasks per test, presented in a random sequence.

Finally, for each task, we select the correct answers – 1 or more tree topics that satisfy that task.

The results

So we’ve run a tree test. How did the tree fare?

At a high level, we look at:
* Success – % of participants who found the correct answer. This is the single most important metric, and is weighted highest in the overall score.
* Speed – how fast participants clicked through the tree. In general, confident choices are made quickly (i.e. a high Speed score), while hesitation suggests that the topics are either not clear enough or not distinguishable enough.
* Directness – how directly participants made it to the answer. Ideally, they reach their destination without wandering or backtracking.

For each task, we see a percentage score on each of these measures, along with an aggregate score (out of 10):
Showing Treejack results with a percentage score of each measure and an aggregate score.

If we see an overall score of 8/10 for the entire test, we’ve earned ourselves a beer. Often, though, we’ll find ourselves looking at a 5 or 6, and realise that there’s more work to be done.

The good news is that our miserable overall score of 5/10 is often some 8’s and 9’s brought down by a few 2’s and 3’s. This is where tree testing really shines — separating the good parts of the tree from the bad, so we can spend our time and effort fixing the latter.

To do more detailed analysis on the low scores, we can download the data as a spreadsheet, showing destinations for each task, first clicks, full click paths, and so on.

In general, we’ve found that tree-testing results are much easier to analyse than card-sorting results. The high-level results pinpoint where the problems are, and the detailed results usually make the reason plain. In cases where a result has us scratching our heads, we do a few in-person tree tests, prompting the participant to think aloud and asking them about the reasons behind their choices.

Lessons learned

We’ve run several tree tests now for large clients, and we’re very pleased with the technique. Along the way, we’ve learned a few things too:
* Test a few different alternatives. Because tree tests are quick to do, we can take several proposed structures and test them against each other. This is a quick way of resolving opinion-based debates over which is better. For the government web project we discussed earlier, one proposed structure had much lower success rates than the others, so we were able to discard it without regrets or doubts.

* Test new against old. Remember how we promised that government agency that we would deliver a better IA, not just a different one? Tree testing proved to be a great way to demonstrate this. In our baseline test, the original structure notched a 31% success rate. Using the same tasks, the new structure scored 67% – a solid quantitative improvement.

* Do iterations. Everyone talks about developing designs iteratively, but schedules and budgets often quash that ideal. Tree testing, on the other hand, has proved quick enough that we’ve been able to do two or three revision cycles for a given tree, using each set of results to progressively tweak and improve it.

* Identify critical areas to test, and tailor your tasks to exercise them. Normally we try to cover all parts of the tree with our tasks. If, however, there are certain sections that are especially critical, it’s a good idea to run more tasks that involve those sections. That can reveal subtleties that you may have missed with a “vanilla” test. For example, in another study we did, the client was considering renaming an important top-level section, but was worried that the new term (while more accurate) was less clear. Tree testing showed both terms to be equally effective, so the client was free to choose based on other criteria.

* Crack the toughest nuts with “live” testing. Online tree tests suffer from the same basic limitation as most other online studies – they give us loads of useful data, but not always the “why” behind it. Moderated testing (either in person or by remote session) can fill in this gap when it occurs.

Conclusion

Tree testing has given us the IA method we were after – a quick, clear, quantitative way to test site structures. Like user testing, it shows us (and our clients) where we need to focus our efforts, and injects some user-based data into our IA design process. The simplicity of the technique lets us do variations and iterations until we get a really good result.

Tree testing also makes our clients happy. They quickly “get” the concept, the high-level results are easy for them to understand, and they love having data to show their management and to measure their progress against.

You can sign up for a free Treejack account at “Optimal Workshop”:http://www.optimalworkshop.com/treejack.htm.2

References

1. “Don’t Make Me Think”:http://www.amazon.com/Dont-Make-Me-Think-Usability/dp/0321344758, Steve Krug
2. Full disclosure: As noted in his “bio”:http://boxesandarrows.wpengine.com/person/35384-daveobrien, O’Brien works with Optimal Workshop.