As of January, 2012, this site is no longer being updated, due to work and health issues

Search Tools

Taxonomies, Categorization, Classification, Categories, and Directories for Searching

The terms taxonomy, ontology, directory, cataloging, categorization and classification are often confused and used interchangeably. These are all ways of organizing information (or things or animals) into categories.

There are a number of applications that can help people create taxonomies and place information objects within their categories, although the amount of automation can vary. Some programs simply allow anyone to manually add a URL to a specific category by submitting a site. Others allow human catalogers to create sophisticated rules to specify certain words and phrases which will place a page in a category. Others accept a "training set" within an existing taxonomy, and will place documents in categories based on similarities. Still others attempt to automate the entire process, grouping pages into topics based on programmatic evaluation of the contents.

When evaluating these applications, remember that they are simply software. No matter the elegance of the algorithms, a computer program can never truly understand the concepts involved in a page, as a human can do, and will sometimes place pages in the wrong categories. For example, one very automated system had an "Arts and Humanities" category which includes links to an Internet services consulting company and a singer-songwriter's personal home page (along with many more appropriate pages). To serve your site or intranet users, plan for a significant amount of human cataloging and editing.

Glossary and Definitions

A directory is an organized sets of links, like those on Yahoo or the Open Directory Project, which allows a web site to display the scope and focus of its content. A directory can cover a single host, a large multi-server site, an intranet or the Web. At each level, the category names provide instant context information to users. Rather than a simple list, such as the results of a search, drilling down into the more and more specific categories (for example Shopping > Clothing > Footwear > Athletic) explains how the pages fit into the larger set of information.

Categorization is the process of associating a document with one or more subject categories. So the entry for a page on cross trainer shoes could go into Running, Manufacturing, Sports Medicine, or Rushkoff, Douglas! All of these are legitimate, depending on the context.

Cataloging and Classification come from libraries, where specialists enter the metadata (such as author, date, title and edition) for a document, apply subject categories to it, and place it into a class (such as a call number) for later retrieval. These tend to be used interchangeably with Categorization.

Clustering is the process of grouping documents based on similarity of words, or the concepts in the documents as interpreted by an analytical engine. These engines use complex algorithms including Natural Language Processing, Latent Semantic Analysis, Bayesian statistical analysis, and so on.

A Thesaurus is a set of related terms describing a set of documents. This is not hierarchical: it describes the standard terms for concepts in a controlled vocabulary. Thesauri include synonyms and more complex relationships, such as broader or narrower terms, related terms and other forms of words.

Taxonomy is the organization of a particular set of information for a particular purpose. It comes from biology, where it's used to define the single location for a species within a complex hierarchic. Biologists have arguments about where various species belong, although DNA analysis can resolve most of the questions. In informational taxonomies, items can fit into several taxonomic categories.

Ontology is the study of the categories of things within a domain. It comes from philosophy and provides a logical framework for academic research on knowledge representation. Work on ontologies involves schema and diagrams for showing relationships in Venn diagrams, trees, lattices and so on.


Organizational Theory section of the Argus Center for Information Architecture
A splendid annotated bibliography of the best works on organizing information and classification as well as indexing, thesaurus construction and controlled vocabularies.

Content Wire Taxonomies News
Links to relevant articles, updated several times a month.

Guided Tour of Ontology
Definitions and background information for ontologies as part of the Semantic Web.

Taxonomy Primer from Lexonomy
Discusses controlled vocabularies for web sites. Includes recommendations for buying and building vocabularies, and applying them to navigation and searching. Special search features include "synonym rings", while a hierarchical arrangement is often known as a taxonomy. Provides additional tips and suggestions based on extensive experience.

Classification Society of North America

International Federation of Classification Societies

Bibliography on Automatic Text Categorization
Fabrizio Sebastiani, a research scientist at Consiglio Nazionale delle Ricerche (Italian National Research Council) provides a listing of scientific research articles on automatic classification and categorization.

Classification, Indexing, Metadata and Thesauri - link page at UMass Amherst

Taxonomy, Classification and Metadata Resources - link page at Scottish electronic Staff Development Library

See also: Faceted Metadata and Search

Articles and Overviews

Classification Tools

Now listed on the Classification Tools page.
Page Updated 2003-07-07