As of January, 2012, this site is no longer being updated, due to work and health issues
SearchTools.com: Background Topics
Information Retrieval Research
Information Retrieval is the academic discipline which underlies
computer-based text search tools. It tends to concentrate on mathematical
models and algorithms for retrieval quality, but there is a great deal of
valuable research in the field.
Useful IR Sites
- Berkeley Digital Library SunSITE
A wide-ranging site dealing with many aspects of information retrieval and
digital information. Also the home of the SWISH-E
- TREC (Text REtrieval
A "shootout" conference between various approaches to indexing and
searching for data in very large collections has identified the most successful
approaches in information retrieval. Sponsored by the National Institutes
of Standards & Technology, the conference helps translate theory into
practice, and provides an objective testbed.
Asian Text Retrieval
Evaluation of Asian language text retrieval, question answering and text
summarization, following on the US NIST TREC workshops. Also includes cross-language
information retrieval in Chinese, Korean, Japanese and English. Runs from
for a year each time, participants get a chance to perform tests, participate
in discussions, receive evaluations of their software and publish their
results. Anonymous participation is permitted.
- Web IR and IE (Information
Excellent links to conferences, books, research and other related topics,
with short summaries of each link
- SIGIR: Special Interest Group on Information
Retrieval of the ACM (Association for Computer Machinery).
The group covers all aspects of information storage, retrieval and dissemination,
including research strategies, output schemes and system evaluations. The
site mainly provides information on conferences and calls for papers.
- Information Retrieval at the Illinois Institute
Research group with projects in improving retrieval performance, efficiency,
visualization, integrating structured data and text, and so on. Includes an
excellent IR links page.
Books and Articles
- Recommended Reading for IR Research Students [PDF, 1.5 MB] SIGIR Forum, December 2005 by Alistair Moffat, Justin Zobel and David Hawking.
Extensive annotated bibliography of the most important works in Information Retrieval since 1997. Covers topics including TREC results, scaling issues, index compression, multilingual retrieval, multimedia retrieval, statistical, vector and probabalistic approaches, evaluation and testing, and much more.
- After the Dot-Bomb:
Getting Web Information Retrieval Right This Time
FirstMonday, July 2002 by Marcia Bates
An academic expert on usable information retrieval suggests that if web entrepreneurs
and VCs had known about the history of IR and library experiences, they would
not have wasted investments in problematic approaches such as "push"
technology. She offers seven suggestions to improve web retrieval: use faceted
rather than hierarchical classification; don't try for a single "true"
classification (and avoid the term 'ontology'); use subject and domain information
retrieval vocabulary; remember the Bradford distribution; plan for explosive
growth; provide tools for "human content processing"; learn from
the history of information retrieval.
Bloug, July 10, 2002 by Lou Rosenfeld
A leading expert in Information Architecture and librarian comments on
Marcia Bates' article.
Vox Populi: The Public Searching of the Web December 2001 (v52n12),
Journal of the American Society for Information Science and Technology, by
Dietmar Wolfram, Amanda Spink, B.J. Jansen, Tefko Saracevic
Looks at trends in web search between 1997 and 1999, including fewer terms
in queries, fewer queries per "session", and users viewing fewer
pages of results per query. Topics of searches also changed.
Research from the WWW10 Conference SearchEngineWatch; June 18, 2001
by Danny Sullivan
Short summaries of many technical articles relating to search and information
retrieval presented at the Tenth World Wide Web Conference (May 1-5 2001,
- A Case Study in Web Search
using TREC Algorithms Proceedings of the WWW10 Conference, May 2-5
2001, Hong Kong, by Amit Singhal and Marcin Kaszkiel (formerly at AT&T
Labs, now at Google)
This experiment compared the standard information-retrieval algorithms used
at TREC with Excite, Google, Lycos (FAST) and AltaVista.
The experimenters used real queries from the Excite logs, limiting them to
those for home pages or web sites, but not for obvious query word - matches
with domain names such as "ibm". They then indexed and searched
a 17.8 million web page collection using a successful standard TREC "ad-hoc"
(interactive) algorithm. When they compared this to the commercial search
engines, they found that the TREC algorithm was much less likely to find the
home page desired. This is probably due to high-frequency matches of the query
words in other pages.
the Web: The Public and Their Queries February 2001 (v52n3) Journal
of the American Society of Information Science and Technology by Amanda Spink,
Detmar Wolfram, B.J. Jansen, Tefko Saracevic
Based on analysis of the Excite search engine session logs, finds that
most searches are short and simple, about half search for one item per session,
and almost people view only one or two results pages. While a small number
of search terms are common, there are also many unique words. Most of these
findings confirm the authors' earlier studies.
Investigation Into the Use of Simple Queries on Web IR Systems Information
Research 2000, by B.J. Jansen
Research looked at short queries (2 to 4 terms), comparing the first 10 results
for simple queries of just the words against complex queries with Boolean
or other query operators. Given the large amount of overlap between these
two modes (about 7 out of 10 were the same), concludes that the search engines
do well enough with just simple queries. The search engines are implementing
the heuristic rule: Place at the top of the results list, those documents
that contain all the query terms and that have all the query terms near each
Next Generation Web Search: Setting Our Sites: In In
IEEE Data Engineering Bulletin Special issue on Next Generation Web Search
September 2000 by Marti Hearst
Proposes that new site search engines provide better search experiences by
using metadata and categories to integrate browsing features with search results.
Also mentions other aspects of web search, including question-answering.
Web: The Next Generation, Proceedings of the 9th World Wide Web Conference
(WWW9) Elsevier May 2000 ISBN 0-444-50515-6
Topics include rich query languages, search behavior, relevance ranking, context,
XML and data mining.
Processing & Management Pergamon / Elsevier Science International
Journal, US $919 per year
Covers current research in information retrieval, classification, user
interaction, evaluation, etc. Volume 36, number 2 concentrates on Web search
Strategies in Web Searching Proceedings of the Human Factors &
the Web conference, June 3, 1999 by Raquel Navarro-Prieto, Mike Scaife, and
Results of a small study on web-searching behavior, including cognitive
strategies and interaction with the systems.
- Modern Information
Retrieval Ricardo Baeza-Yates & Berthier Ribero-Neto, Addison-Wesley
Longman, May 1999, ISBN 020139829x, $50
- Introduction to the current state of information retrieval, including
changes brought by the Web to a field that was previously oriented towards
academia, libraries and corporate networks. Most of the book is not online,
but the chapter on User
Interfaces and Visualization in Modern Information Retrieval, written
by UC Berkeley professor Marti Hearst, provides a valuable academic study
of interfaces for information retrieval and searching, including graphical
overviews and visualization. Get the book from Amazon
and we'll get an affiliate fee.
Gigabytes: Compressing and Indexing Documents and Images (2nd edition)
by Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Morgan-Kaufmann Publishers;
April 1999; ISBN 1558605703, $54.95
- Covers the problems of very large document collections, including compression,
indexing and querying options. Praised by Steve Kirsch of Infoseek, among
others. See also the MG web site
and the excellent reviews.
If you want buy it from Amazon
please use this link to support the SearchTools site.
- Report from the 1999
Search Engines Meeting April 1999 by Avi Rappoport.
The presentations and discussions there provided insight into the variety
of options for information retrieval, and the opportunities to go beyond basic
text matching to locate the truly relevant materials. The 1998 meeting apparently
had three main themes: the debt that current search engines owe to Information
Retrieval research, the need to make searching easier, and the convergence
of browse-oriented directories with full-text indexes.
- Mira: Evaluating interactive information
retrieval April 1999, Glasgow UK
Papers from a workshop attempting to measure effective interactions with
Storage and Retrieval Robert R. Korfhage: John Wiley & Sons, June,
1997 ISBN 04711143383, $49.95.
text search matters Sunworld, September 1995 by Bill Rosenblatt
Compares text searching to database field searching -- a very useful distinction.
- A World Wide Web
Resource Discovery System WWW4 Conference, December 11-14, 1995 by
Budi Yuwono et al.
Useful introduction to Information Retrieval principals as applied to
- Information Retrieval,
2nd Edition [textbook] C.J. van Rijsbergen, 1979
Complete contents of this book available online: covers the state of research
twenty years ago.
Academic Research Search Engine Projects
A UC Berkeley research project, provides search results in outline form
with the titles of the parent documents displayed, to provide a context.
Rather than showing all the meta description / page start information
in the results list, this system allows searchers to click on an icon
and see that additional data in a frame on the right.
A research project in IBM's Almaden Labs concentrating on providing the
best and most authoritative information on a topic. It does this by tracking
the number of links pointing to a site or page, and the number of links
pointing to other popular pages.
the Web Scientific American June 1999 by Members of the Clever
Project ($7.95 or digital subscription required)
A project of the NCSA (National Center for Semiconductor Applications) to
build a new search infrastructure. This will allow people to search for
information across many large databases, even if they are structured in
A research project on automatically generating lists of common terms and
phrases in context.