As of January, 2012, this site is no longer being updated, due to work and health issues
by Avi Rappoport, Search Tools Consulting
a version of this article appeared in the July/August 1999 issue of Searcher Magazine
It was a mild Spring day in Boston, and the two hundred librarians, researchers and developers gathered to discuss search engines were entirely surrounded by thousands of runners, in town to participate in the Boston Marathon. It occurs to me that search engines are also running a race, trying to keep up with the glut of information on the Web as a whole, and even within Intranets and local sites. This conference presentations concentrated on locating the most relevant and useful data, using techniques from Information Retrieval research and theory.
Portalization and Other Search Trends
Danny Sullivan of SearchEngineWatch gave a quick tour of search engines on the Web. He covered the recent trend of search engines turning into portals--sites with many features such as news and directories--in hopes of keeping people around so that the sites can sell more advertising.
Although search is no longer the only feature on these sites, they are trying to make the results more relevant by responding better to common searches, such as "travel" or "Microsoft". Most searches are for one or two word queries, so other improvements include evaluating page popularity (as evidenced by how many links point at them), and watching user behavior and learning which sites are popular and which are not.
n addition, all major portals have browsable directories, providing selected and approved links rather than relying on just the relevance of search results (more on directories later). Interestingly, Lycos had just announced that it was licensing the Open Directory, which has volunteer subject specialists approving sites for categories, rather than relying on overworked editors.
Another development on webwide search engines is paid placement of items on results pages. GoTo has been selling placements, with clear labeling, for some time, and just before the conference, AltaVista announced that the company would sell special placements on selected search results page. For example, a search for travel now brings up a link to a travel insurance page that is clearly labeled as a paid link:
Danny pointed out that there are many search-engine-placement promoters, both legitimate and those who "spam" the relevance algorithms. Given that so many of the searches are short, it's hard to argue that the paid listings are any less relevant than the normal results.
Search Realities faced by end users and professional searchers
Carol Tenopir gave a splendid presentation on the history of user-centered research on searching, and current work in testing user experiences. She reported that the main domains of behavior influencing search experience are cognitive, sensorimotor and "affective behavior", meaning the motivational and emotional involvement of the searcher. The mode of a person's affective behavior determines their perseverance and attention to detail, and therefore their search strategies and satisfaction. For example, some people are 'field-dependent' and need a lot of context or they get lost. Others are very visual, and prefer spatial organization of information.
In general, searchers belong to different groups depending on their experience in the particular topical area, cognitive style, technical aptitude, personality type, age and so on.
A recent study of how people search attempted to measure affective behavior, and discovered that many were concerned about their choices, needing additional confirmation and positive feedback. Another study of anxiety confirmed that stress levels go down when searchers find any results, that the levels fluctuate in a single session, and that stress increased the longer the people spent searching and the more web sites they visited.
Carol's conclusions were that everyone needs confirmation of their choices, and that flexibility and options are necessary to provide useful information to a wide variety of searchers.
Quantifiable Results: Testing at TREC
Ellen Voorhees discussed the valuable testing done at TREC (the Text REtrieval Conference) sponsored by the United States National Institute of Standards and Technology (NIST). This series of conferences provides a set of realistic test collections, uniform scoring, unbiased evaluators and a chance to see the changes and improvements of search engines over time. It also offers a forum to exchange and incubate research ideas and accelerate technology transfer--many of the innovative products shown at this meeting grew out of TREC research. Participants in the workshops are both academic and commercial researchers, both US and international.
The TREC test collection consists of about 2 GB of combined newspaper articles and government reports. Participants are allowed to work with the collection and perform their own test runs before the official test period. The Topics are the test questions: they can be much longer and more detailed than most end-user searches, with synonyms and useful key terms. The "Ad-Hoc" task asks search engines to take one of these topics, find the best thousand documents from the collection, and rank them in relevance order.
In recent years, all the search engines seem to have reached the same level of quality, although the queries are getting harder. They have adopted many of the same good practices for text retrieval and ranking: careful weighting based on location of text in the document, and use of Text Frequency (how often the term is found in the collection) and Inverse Document Frequency (how many documents in the collection contain the term), modified by document length normalization. In addition, the TREC participants have found that a good query makes all the difference in good retrieval. Automatic query expansion adds terms to the query using a thesaurus, neural net or AI techniques. Pseudo-relevance feedback, where the system assumes that the top results are relevant, gathers key terms from those documents, and adds those terms to the original query for a following search, works well when the original query was fairly detailed.
The Cross-Language Track requires participants to search documents in several languages for queries in English, French, German and Italian. The queries are easier than the English-only tests, and the results are similar, but there is a long way to go yet. Very diverse systems have similar results, and there is little consensus on the best approach.
The Filtering Track is another hard task. Filtering is in some ways harder than searching, because the filter agent has to make a binary decision: is the document relevant, or is it not. The same ambiguity makes it very difficult to evaluate the results: a document which is irrelevant one day may be relevant the next, due to changes in information needs and vocabulary.
The Web Track is a new test, replicating the heterogeneous "dirty data" of the real Web. It's a 2 GB snapshot of a fairly random set of web data, including graphics and other data types. Unlike the web, frames, scripts and image links are ignored. This track will give web search engine developers a realistic testbed for evaluating their algorithms.
Other tracks include Spoken Document Retrieval, Very Large Corpus (100 GB web document collection), High Precision and so on.
As the Search Engines Meeting continued, it became clear how much the products of today owe to the earlier TREC conferences. Having a stable testbed and clear evaluations is an enormous service to the industry as well as the research community, and we can look for more valuable information to come from TREC in the future
Improvements to Relevance Ranking of Results
During the meeting, there was a good deal of laudable concern about relevance rankings on webwide search engines and other large sites. When there are thousands of matching documents and very short searches, the terms themselves don't define the relevance: the search engine must use other criteria. After all, the main point is not to make best use of an algorithm, it's to provide useful information to searchers!
Byron Dom of IBM showed how link analysis can find the "best" pages on the Web. His CLEVER system (and many others, including Google and Infoseek) tracks how many sites link to a page, and then increase the weight of that page in the results list. CLEVER has a nice algorithm for defining how much to weight to add, based on the quality of the referring pages and how many other good pages it links to. By relying on the judgment of experts who prove themselves by linking to other good sites, CLEVER can really improve the results ranking. They have tested the system against traditional term-based retrieval algorithms, and found that the top results are much more relevant and useful. This relevance system can be extended to knowledge management tasks and even automatic creation of taxonomies and directories on the fly. Unfortunately, the product is not available for use on the web, due to legal limitations.
Gary Cullis, the chairman of Direct Hit, gave a presentation on his system for tracking user popularity rankings in search results and building lists of the "best" results for specific searches. Essentially, Direct Hit watches search behavior, and adds or subtracts weights for result items based on whether searchers clicked on them, and how long they stay at the site. For example, if someone clicks on the first item in a result list, then comes right back to the results list, that's considered bad, and the item gets a slight downgrade. On the other hand, if someone clicks on an item that is 9th or 24th in the list, it automatically gets extra weight, as being particularly attractive. HotBot uses it for the "best" results sections, and ZDNet uses it on its portal search.
Obviously, this rewards attractive sites that have interesting titles and nice descriptions, but I think that's fine -- pages with good content tend to have good title, and vice versa. The aggregation of thousands of search clicks means that unusual searcher behavior, such as opening a site into another window, can be ignored as noise. Like all other search engine companies, DirectHit is ambitious. They believe that they can add substantial value to e-commerce catalog searching by showing related and popular items. It will be interesting to see if they can do it.
Directories and Question-Answering -- Revenge of the Librarians
Librarians told everyone that huge lists of arbitrary results would overwhelm people, and they were right. Now Yahoo! is the single most successful web portal and all the search engines are adding directories. Many systems provide directory results when you search, as well as web pages. For common terms and good matches, the category names are useful context in and of themselves, and the option to view selected sites in the categories is better at offering the "good enough" answer that many people are looking for.
LookSmart has a team of several hundred editors adding sites into one or more categories in a hierarchical directory. They use technology as an assistance, but rely on human judgment for their value. They see somewhat different behavior than the other search engines, with an average of 2.4 words per query, and very specific questions, such as "What is Letterman's fax number?".
Another approach is building a large database of questions and associated answers. AskJeeves takes this approach, using humans to create the question and answer database with technology assistance for concept mapping. They also provide an automated metasearch interface to offer answers they don't store internally, and partner with webwide search engines such as AltaVista.
LookSmart slides are at www.infonortics.com/searchengines/boston1999/tomassi/
Results Clustering and Topic Categorization
One fruitful approach to improving results presentation is clustering of the found documents into useful groups. While it seems obvious to me that search engines could limit their searches to sites cataloged in their directories, and then cluster by category, I don't know of any who are doing that.
Ev Brenner, chair of the panel on Human vs. Automated Categorization talked about the history of research on Information Retrieval. He traced the movement away from (human) categories, subject headings and thesauruses towards keyword text searching and Boolean queries. Modern sophisticated systems use statistical retrieval techniques and natural-language processing to provide results, rather than catalogs.
Some search engines perform automatic clustering and categorization on results sets, so they are divided into rough groupings by topic. The NorthernLight Search Engine, for example, has some human-designed categories, but most of its results are clustered into Custom Folders automatically. Their ambition is to index and classify everything, not just web pages. They have their own crawler and work with article database companies such as IAC to search their data and offer it for sale when found by a search.
Marc Krellenstein of Northern Light, identified a very interesting and fruitful area of research. He pointed out that genres or types of web pages have different formats and can be analyzed differently. For example, a product information page is very different from a list of technical support answers and both are different from personal home pages. Future search engines and clustering algorithms may be able to use these distinctions to better weight results for certain queries or cluster documents. However, in my search engine tests, I've found that NorthernLight's Custom Folders are arbitrary at best and often distressing -- I found a lot of porn in the most promising group when I searched for information about hotel rooms in London, as well as a folder named "Spike Lee", which made no apparent sense.
James Callen of the University of Massachusetts made the academic case for full text search with modern relevance rankings being the best approach for information retrieval. He pointed out that categorization is hard to create and maintain, and that even humans can disagree on relevance and classification. So automated categorization can provide significant benefits, allowing browse access even when there are no resources for human evaluation. Jan Peterson of Infoseek who was in the audience, said that Infoseek was no longer using automated categorization. He said that while humans may disagree, they almost never make gross mistakes in adding pages or sites to categories, whereas automated systems do so to a greater or lesser degree.
Consensus of the panel, and the meeting, is that automation can help humans, and automated categorization best when humans can provide a reality check on the systems.
Mark Krellenstein's slides are at www.infonortics.com/searchengines/boston1999/krellenstein/index.htm
Summarization attempts to reduce document text to its most relevant content based on the task and user requirements. Obviously, this is a slippery process at best, but DARPA (the US Defense Advanced Research Projects Administration, early funders of the Internet), has built a testbed system for this research, called SUMMAC. In conjunction with TREC, they have worked out a large set of test documents and relevance judgments, and invited several different summarization engines to participate between 1992 and 1998. The group was trying to discover if the use of summaries could improve efficiency of searching without loss of accuracy relative to full text retrieval and relevance.Therese Firmin of the US Defense Dept. explained how it works.
Fifteen groups, from all over the world, mainly from research institutions but also from industry (British Telecomm, Lexis-Nexis, GE, IBM, TextWise, etc.), participated in the testing. They used various techniques including term frequency and location, co-reference and co-occurance, passage extraction, etc.
Results indicated that many documents can be summarized successfully, but fixed-length summaries are significantly less helpful than variable-length summaries. The Information Retrieval methods applied to this task work well for query-focused summarization, because the topic focuses the summarization effort.
The test was difficult and labor-intensive, but developers of this kind of software need repeatable, objective evaluations. Users will accept good summaries and use them for scanning and familiarization, and summaries may be appropriate for full text of documents in assessing relevance. Summarization will be widely useful on the Web, especially for search results, for cross-language research, and even multimedia.
When people say "visualization", they mean two things. One is simply to use graphical elements carefully in displaying results of searches, the other is attempting to display the search results graphically, in two or three dimensions, grouped by topic or category. Both approaches are meant to take advantage of the human capacity to process visual information quickly and efficiently. Good visualization should integrate the natural and technical world, use natural intuition, spatial cues and perception.
An example of a clustered visual display of retrieved information retrieval comes from Cartia Themscape (acquired by Aurigin):
However, even the most ardent advocate, such as James Wise of Integral Videos, admits that there are more software demos than working systems, no theoretical underpinnings or widely-accepted evaluation criteria. His demonstrations showed that the value of visualization of search results is in organizing and grouping sets of documents so that items on similar topics are close, and the searcher can understand the relationships among large numbers of documents. This relies enormously on the quality of the algorithms used to determine document similarity: I think that the success of this approach depends very strongly on the needs and experience of the searcher.
Carol Tenopir had already pointed out that users can have very different cognitive styles and affective behavior. I realized during this conference that some people are so visually-oriented that they pour their time and effort into developing these systems. Although visualization search products have not succeeded, the fervor of the researchers and developers indicates that for those people, and for the rest of us in some circumstances, visualization is compelling and valuable.
Michael Witbrock, now at Lycos, spoke about projects at Carnegie Mellon on retrieval of spoken documents, including radio programs, dictated text, all the amateur and professional video files now on the Web, and future speech ditigizing. He didn't include my best candidate for spoken documents that need retrieval, conference tapes, but I believe that most of his comments apply to that format as well.
Before you can retrieve information from spoken documents, you have to have speech recognition. That technology is now at a good point, where the error rate has gone down to about 5% and there is a commercial market for products. Speech recognition algorithms work for all languages, from English to Japanese and Chinese, and are now encompassing variants from children's to elderly speech.
There has been some good research and evaluation in speech IR, and a track at TREC. Research using CMU's Informedia system has had good results. Measuring the effect of recognition errors on retrieval shows that as the word error rate reaches 30%, the query recall starts to fall. Surprisingly, in small collections or those with documents that match the queries well, the random errors were overwhelmed in the retrieval algorithm, but larger collections and difficult queries have a larger probability of spurious matches. Approaches to improve results include mixed word and phoneme text retrieval, weighting by speech recognition quality, and heuristic estimates of word correctness.
Spoken document retrieval is inevitable, and even feasible now for small collections of text.
I missed the Video Retrieval talk, but believe that many of the topics on Image Content Based Retrieval are covered in the previous year's presentation at www.infonortics.com/searchengines/boston1998/petkovic/petkovic.htm IBM's CueVideo site and at the IBM Query by Image Content site.
Filtering and Routing and Intelligent Agents
Filtering and Routing allow individuals to set up criteria for incoming data (news feeds, email, press releases, etc.), and only be notified or sent those items that match their interests. Sometimes routing is distinguished by acting on arrival rather than in batches.
All of these technologies rely in part on information retrieval techniques, although they each must adapt those techniques to their specific needs. In particular, the relevance judgment in a filtering operation is binary: the document is either relevant or it is not, unlike a search result. TREC has a filtering track and has had a routing track in the past. In a pre-conference workshop, David Evans of ClariTech described the important need to enrich the initial profile with user feedback, allowing the system to become more precise and useful over time.
Intelligent Agents travel a network or the Internet to locate data or track web site changes, evaluating the items using relevance judgments like those of search engines. John Snyder of Muscat describes how search agents also need an adaptive query that changes according to needs and data available. Agents can also benefit from automatic classification and summarization technologies.
Cross-Language Information Retrieval
CLIR means querying in one language for documents in many languages. It's becoming more important due to the globalization of the economy and the use of multiple languages in corporations and working groups. Approaches include Machine-readable dictionaries, parallel and comparable corpora, a generalized vector space model, latent semantic indexing, similarity thesauruses and interlinguas. TREC has had a CLIR track for the past several years, where many of these approaches have been tested.
Elizabeth Liddy of TextWise described their Conceptual Interlingua approach, which uses a concept space where terms from multiple languages are mapped into a language-independent schema. This technique is used for both indexing and querying, and does not require pairwise translation (dictionaries for both the query language and each document language).
Both Daniel Hoogterp of Retrieval Technologies and Rick Kenny of PCDocs / Fulcrum described how search fits into corporate knowledge management. These systems attempt to integrate all the information in an institution and allow access in many ways, for a broad base of users. They can make use of the most sophisticated techniques of natural language processing, visualization, summarization, cross-language retrieval, and directory creation.
Data mining means evaluating large amounts of stored data and looking for useful patterns, like the number of widgets sold in November in Peoria. David Evans of Clairtech described the emerging field of text mining, concerned with leveraging the text stored in corporate systems to anticipate trends and address problems. Banks and other institutions have call center notes, surveys, email and many other repositories of data that could be used for to make better decisions, but it is not now accessible. This information is free text, so structured queries will not find it, non-uniform quality and language, and there is a lot of it. Text mining uses techniques from information retrieval and other fields to analyze internal structure, parse the content, provide results, clustering, summarization, and so on. With automatic event identification, conditional responses, reuse of analysis, and graphic presentation of results, the user can skim the best of the information easily. They can test hypotheses and investigate situations in ways that were never available before, and use the data to make smarter decisions.
Slides are at: www.infonortics.com/searchengines/boston1999/evans/index.htm
Coming soon to a search engine near you
Steven Arnold of AIT gave a quick overview of the leading edge in search and retrieval software, with many useful buzzwords to be aware of. He discussed research results, technical issues and market realities, with entertaining and pointed comments on the technology. He sees a future with search everywhere, with more sophisticated natural language processing and results clustering. There are more free search engines than ever, and Windows 2000 will come with Microsoft Index Server built in. Future search may use visualization and graphical displays catering to the younger generation's preferences, and portals will spring up everywhere.
Ramana Rao, of the Xerox spinoff company Inxight, described a theoretical view of information-seeking behavior, with ideas about how to "reduce friction" in information access. He, and many others at this conference, see search and browse as two ends of the same spectrum, and a future with searching providing access to multimedia, live news feeds, operational reports, and more using, visualization, language identification, genre detection, summarization, categorization, clustering search results, metasearch, personalization, channels, recommendations and more. This will all create a world with search "deeply intertwingled" in other tasks.
Ramana Rao's 7 2 Insights on Search Engines
1 & 2: Keep your eye on constraints & resources
3: Integrate search & browsing
4: Increase the interaction bandwidth
5: Leverage extraction & arrangement functions
6: Balance automation & gardening
7: Consider the context of access
8: Search all kinds of stuff (many file formats, including multimedia)