Developing and Creatively Leveraging Hierarchical Metadata and Taxonomy

When confronted with projects requiring content, document or knowledge management, and presentation, more likely than not, the information architect will be expected to lead or contribute to development of the content classification requirements. And we don’t classify our content without reason.

“In content metadata and hierarchies, you will often find a goldmine of implicit and explicit data that you can leverage to creatively contextualize content.”

As site creators, it would be time-consuming, expensive, and contentious to develop and maintain the necessary infrastructure and processes to manage unorganized content. As site users, it would be maddening to try to sift through the links in Yahoo’s directory trees if contributors and reviewers hadn’t organized them ahead of time. By organizing content for presentation according to its metadata, we can “contextualize” for potential users.

In content metadata and hierarchies, however, you will often find a goldmine of implicit and explicit data that you can leverage to creatively contextualize content. Following a brief introduction on taxonomy and metadata (what I call content classification requirements), this article will focus on finding and utilizing such relationships in hierarchies.

Content classification requirements: taxonomy and metadata
“Taxonomy” is a terribly overused term these days. Bob Boiko, in the book Content Management Bible, goes so far as to call it “trendy.” Specifically, taxonomy is a hierarchical structure for the classification or organization of data. Historically used by biologists to classify plants or animals according to a set of natural relationships, in content management and information architecture, we tend to leverage taxonomies as a tool for organizing content (For additional information, see Christina Wodtke’s interview with Samantha Bailey elsewhere on Boxes and Arrows).

Metadata (data about data) describes an asset and provides us with a meaningful set of attributes that we can use to further classify or consume content. While much metadata is flat or one-dimensional in nature (e.g., size or weight), some of it is hierarchical (e.g., taxonomies), making the definition and distinction between metadata and taxonomy vague and fuzzy.

Collectively, I tend to manage taxonomy and metadata needs as simply content classification requirements; taxonomy as a means of organizing content and metadata as method of further describing it.

If your task, for example, is to classify independent musicians for a website and database, you may choose a taxonomy that organizes the artists by musical style and then create metadata to describe the members of the band, the year of inception, record label, discography, and geography.

Figure 1: Sample taxonomy for classifying music and musicians (Click to enlarge).

Geography is particularly interesting as it has the potential to be hierarchical metadata itself. Let’s say your database expands to include musicians from around the world. Your hierarchy of geography could include country, state or providence, region and/or city. That’s a whole new hierarchy, perhaps even a taxonomy. Geography, by itself, has become an effective and alternative means of organizing your content.

Figure 2: Sample geography taxonomy

How to develop content classification requirements
When developing a content classification strategy, it’s important that you know your needs, your application, and the technical limitations of your software infrastructure and content producers. It’s quite possible that your content management and delivery needs will tax the capabilities and APIs of your content management tools. It’s also quite possible that you will severely overtax the abilities and/or patience of your content producers (see sidebar). If there’s development to be done or sacrifices to be made, keep these constraints in mind as you design the application.

There’s a very good chance that you will need to revisit and revise your content classification strategy periodically. Life changes, business changes. It is more important that you plan for and design a mechanism and set of policies for easy adaptation, than waste countless hours fine-tuning your taxonomy here and now.

Therefore, here are my guidelines for development of your content classification requirements:

Address governance as early as possible in the design cycle.
How will your content classification get revised and extended? How frequently? What person or team will be responsible for the maintenance of your metadata? How will you settle disputes?
Identify the scope.
Are you organizing content for creation? Consumption? Both? A specific department? Some combination of departments? Intranet? Extranet?

Admittedly, it’s important that you not rule out the eventual integration of this work with other initiatives and parts of your business but consider the value of bounding your project. Scope creep is a serious risk as you begin cataloging and classifying your content. Instead of solving all the organization’s problems right now, invest time and energy in defining how your classification requirements can and will change over time.
How does the scope of this application relate to your business organization.
Are you striving for the development of an enterprise taxonomy? Enterprise taxonomies are a popular concept right now, especially in organizations initiating development of 3rd and 4th generation intranets. An enterprise taxonomy is typically perceived as a single, monolithic, corporation-wide, structure for the classification of all things related to your business.

Unfortunately, development of an enterprise taxonomy requires the careful coordination, and cooperation, of departments within your organization. Will you be able to coordinate the efforts, language and needs of your IT and sales departments? If so, kudos to you. If not, welcome to the real world.

Disparate divisions within an organization will often use different terms when referring to the same thing. This means, more often than not, the scope of your content classification requirements will be constrained to some subset of the organization or to a particular business process.
Catalog your content.
The exercise of cataloging your content can be very informative. A lot of guidance, limitations, and input can be achieved during this process. More often than not, your classification requirements will vary by content class or type.
Focus on developing good criteria for the definition and extension of your metadata and taxonomy.
Good rules will enable those tasked with management to respond quickly and with reason to requests for adaptation.

Organizing content for management and delivery
So now we can build a strategy for classifying the content we hope to manage and we have attributes. Now, we need a method for presenting our content. Strategies for getting content from a management infrastructure to the delivery framework (internet, intranet, extranet, client-server application, XML-web services, etc.) vary. Here are two common techniques:

1. Universal Hierarchy
A single hierarchy could be used to store and deliver content. When content contributors utilize the content management system, they add, remove, and manage content in a structure that closely resembles the navigation and hierarchy of the delivery framework (your website or application). The navigation structure is your taxonomy.

Figure 3: Organizing content for delivery by Universal Hierarchy (Click to enlarge)

This method is conceptually simple and makes it quite easy to dynamically build your navigation from knowledge of this hierarchy. However, this model does have drawbacks:

Every time you reorganize the website, the organization of content in your management application shifts. Admittedly, this isn’t much of a drawback if you’re managing content for one moderately sized site or if your team of contributors is small.
It is difficult to reuse content in this structure. If you hope to reuse assets throughout your website, where are they organized in this structure?
In an environment with many contributors and diverse security requirements, organizing content (in the management application) in another way, say by contributor or by department, may be more intuitive.

2. Content Mapping
A more robust, albeit more complex, method of managing content is to maintain structures and metadata in the content management application that is independent of the delivery system’s organization (navigation).

Figure 4: Organizing content for delivery by Content Mapping (Click to enlarge)

Content is organized, at the source, as may be required by your security, workflow, or organizational needs. Perhaps your data lives in a content management system or database where different organizational mechanisms exist. Unfortunately, oftentimes the navigation for your consuming application (the presentation framework) is managed by some other means.

By some rule or algorithm, leveraging your content classification data, material gets “mapped” to the presentation framework. See the example below for an application of this model.

This model is rich with possibilities:

There may be more than one way to organize content (think: content reuse). Given the same set of content, same set of classification criteria, but multiple algorithms, we can now build a delivery framework that allows for many methods of organization.
You no longer need to reorganize your content management application to change the delivery application. Just the algorithms (mappings) change.

However:

If you hope to build your navigation dynamically, often you’ll need to build a tool or alternate hierarchy. You may not find much value in the content’s taxonomy.
Content, in your management environment, may be orphaned in your presentation framework if there are no rules mapping to an accessible part of the site.
Parts of the site may only be sparsely populated. It may not be readily obvious that you are creating gaps (with little or no content) in your site.

While powerful, this technique can be difficult to administer without having a fairly comprehensive understanding of the site design and algorithms for “mapping.”

Creative contextualization of content using hierarchical metadata or taxonomy
Assuming there are hierarchical structures within your content classification system, there is a very good chance that valuable information exists in the hierarchy. By taking advantage of relationships within your hierarchical metadata structures, richer algorithms may be developed for your content delivery framework.

Lets identify some of these relationships and how you can leverage them:

Ancestors
In the example below, North America is an ancestor of America, which is an ancestor of West (and Washington – WA). Ancestors can be valuable because content classified under an ancestor node may have relevance (albeit with less specificity) to child nodes.

If you are looking at content classified under America, it’s quite possible that there is relevance in the information classified under North America.
Descendents
In the example below, Canada is a descendent of North America. Descendents have value because content classified under a child or descendent node may have greater specificity. This content provides consumers of your delivery framework (users of your website or application) with a means of realizing granularity in the data.

A website visitor reading about your corporation’s news and events relevant to North America may appreciate that there is new specific to Canada or California.

Figure 5: Nested sets within a geography taxonomy (Click to enlarge)

Nested Sets
Nested sets are the union of data classified under a particular node and all of its descendents. The example, above, shows three sets. One of those sets is North America and all the nodes within it. Content classified under nodes within a single set may have relevance because they are related by something inherent in the structure (they’re all part of the same ancestor).

If you hope to convey, in a single “view” or page, contact information for all of your offices in North America, then you want contact information classified under North America and all its descendent nodes (i.e., within the North America nested set). If the user wants to see a summary list of contact information within a particular region, they navigate to a node of greater specificity (a descendent).
Peers
Peer nodes exist within a nested set of the hierarchy. Peers have equal depth within a particular nested set of the hierarchy.

Borrowing from the example hierarchy, below, if a user has navigated to content classified under Educational Toys (at depth 2 of the Toys nested set), you may find it valuable to provide them visibility into Plush Toys. These topics share context (Toys), but by providing visibility into peer nodes, you convey the breadth of your toy inventory.

Figure 6: Set and depth in a product taxonomy (Click to enlarge)

So what’s the point? If you’re building algorithms that map content, from your database or content management system to a delivery framework (website or application), it may be both necessary and beneficial to consider content by these relationships.

For example, let’s say the assets in your content management application are organized by type (press release, white paper, image, etc.) and corporate division; it’s an easy and logical choice that allows you to manage security in the content management application at the department level, but it doesn’t accurately reflect the organization of your internet site.

Additional Guidance

Avoid burdening your content contributors with the maintenance of unnecessary metadata; define only as much metadata as you require for the effective presentation of your assets.
Whenever possible, leverage the information you already have. Identify, and aim to leverage, implicit relationships in the content classification criteria you already collect.
If you’re building your own content classification software, make it extensible – it shouldn’t be necessary for you to revisit the application whenever classification needs vary.

It’s your task to map content from the source (content management application) to the presentation system (internet delivery framework). If several parts of your organization are capable of producing press releases, and these press releases are stored in department-specific areas of the content management application, how do you get them on the site?

Example “algorithms” for populating the press release and investor news sections of your internet site may read like this: “All press release type assets, in a “ready” state (leveraging metadata), except those in the investor relations nested set are queried, sorted in descending order by publish date and rendered in the Press Releases section.” Additionally, “All press release type assets, in a “ready” state, in the investor relations nested set are queried, sorted in descending order by publish date and rendered under Investor Relations > News.”

By leveraging the more complex relationships available in hierarchies, we facilitate the presentation (contextualization) of content without impacting it’s organization. Too frequently, contextualization algorithms neglect implicitly derived information available in hierarchies. Narrowly scoped queries are used to relate content to its place in the delivery framework.

It’s up to the information architect to find and realize the value of these relationships when developing a content delivery framework. The opportunity for improved usability and greater content visibility using these relationships is tremendous and oftentimes supports the mission of an IA.

Technical issues
Unfortunately, knowing about these relationships is not enough. Pre-assembled content management and delivery frameworks may not provide the APIs or schema necessary to effectively leverage this knowledge. It’s important to understand what you want, and then to find and understand the limitations of your tools.

Maintaining and consuming hierarchical data in tabular relational database management systems (RDBMS) systems can be difficult, but with creative database administration, the relationships can be assembled in temporary tables and/or discovered by stored procedures.

This is valuable stuff. If it improves content delivery, consider the cost of developing a method for accessing these relationships if it doesn’t already exist.

Summary
When developing your content classification strategies (for both content management and presentation), make a point of evaluating your hierarchical structures for relationships and valuable data hidden within them. Relationships between nodes of a hierarchy are complex, but often mirror the perception of your users.

By leveraging these relationships, you can broaden the scope of your queries. Broaden the scope of data presented to the users, and drive users toward tangential content.

For More Information:

Boiko, Bob. Content Management Bible. John Wiley & Sons, 2001.
Aifia Tools http://aifia.org/tools/
Series on controlled vocabularies and faceted classification by Karl Fast, Fred Leise and Mike Steckel

Christian Ricci is a consultant, application developer, web designer, and project manager with over 11 years of experience in software design and development, network and server administration, and software project management and engineering. As a Senior Solutions Architect for Saillant Consulting Group, Chris has lead portal, content, and document management projects for Qualcomm, Intermountain Health Care, J.D. Edwards, EAS, and the Denver Post.

Site: http://www.chiamonkey.com/
Resume: http://www.chiamonkey.com/mealticket

One comment

Lars Marius Garshol says:

June 9, 2004 at 4:00 am

I found this article interesting and well written, but I have to say I am disappointed that it does not go beyond simple hierarchies and traditional metadata. These are simple and useful tools, but metadata and classification can be taken much further than what is possible with the model outlined in the article.

I wrote a paper on both the traditional methods as well as newer approaches that go much further, which can be found at http://www.ontopia.net/topicmaps/materials/tm-vs-thesauri.html

I would be interested to hear from IA practitioners why they still stick with the traditional methods.

Comments are closed.

Share this: