As of January, 2012, this site is no longer being updated, due to work and health issues Related Topics

XML and Search

About XML

Web pages, written in HTML, include information about how to view text: your browser knows that an H1 means a very large header line. But HTML doesn't give us a way to describe the contents of the text: the meaning is lost because there is no way to tag it. For example, if you have a catalog of hand-carved doors, you probably want to talk about the size, weight, materials, color, price and so on, and it would be great if your browser would sort the list in various ways or let you import the list into a database.

XML, the eXtensible Markup Language, will make this happen. XML is not a set of tags itself: it provides a standard system for browsers and other applications to recognize the data in a tag. Unlike proprietary formats, XML formats are open to all. Consumers will benefit as we will have more control of our data, there will be many more useful little programs because it's easy to read the files, and we will be able to use the data even when the original applications are long gone. When applied to the Web, it makes information interchange much richer and more interesting: Tim Berners-Lee calls this the Semantic Web.

XML and Search

Many people say that XML will "solve the search problem". By this, they mean that instead of searching the whole text of a page, search engines could use the XML tags to specify which parts of the pages to search, as fields, which should improve search results, avoid irrelevant pages and provide more precise listings of the information available.

For example, searching for "Albert Einstein" in a webwide search engine will bring up colleges named after him, university lectures mentioning him, links to sites about him, solar folklore pages, Internet philosophy in German, Jewish History, pages of quotations, biographies of his colleagues, and so on. While these are interesting, you may just have wanted to learn what books and articles he wrote. Wouldn't it be nice if you could just ask for Albert Einstein as an author, and avoid all the irrelevant pages? Even for local site and intranet searching, having fields of data would make it easier to find the most useful items.

However, adding this information will require extra effort from both site designers and search tools developers. It will only be worth an early investment for sites with very large amounts of data or intensive search needs, but the infrastructure will improve over time for the rest of us.

These are the things that must happen before XML can solve the search problems.

As you can tell, I don't think that XML will solve "the search problem" any time soon. However, in the long run, XML will provide many of the benefits of database searching while still retaining the simplicity of plain text searching. And this is a good thing.

How Text Search Relates to XML Query Languages

Right now, most search tools index and search mainly unstructured documents, such as text, HTML, word processor and PDF files: I'm calling these text search engines. As XML documents provide structured data storage, many people see a need for database-style query languages which provide complex ways to access the data. Such languages are used to get data, as small as the text of a single tag, then group results to create reports or compute answers to questions such as "how many widgets did we sell in March?" There are significant differences between this kind of querying and text searching:

XML Query Protocols

XML Query Working Group

April 2002: see drafts on XQuery 1.0, use cases (especially the Full-text Search section), and integration with XPath 2.0.

Note: there is no concept of relevance ranking in XQuery 1.0, although it's possible to sort results by field contents or expressions.

Introduction to XQuery - written by Howard Katz of XML Query Engine, describes the derivation of the W3C XQuery proposal from XPath and SQL.

Articles on XML Queries Relating to Text Search

Integrating Keyword Search into XML Query Processing 9th World Wide Web conference, Amsterdam: May 15-19, 2000 by Daniela Florescu, Ioana Manolescu, and Donald Kossmann
Proposes using keyword search techniques to approach structured querying over entire sets of XML documents. This would map IR query language to SQL-style queries on fields or even whole records using a "contains" clause and controling the depth of the subtree evaluation. It recognizes the value of an inverted index, in this case within a relational database.

Observations on Structured Document Query Languages, which starts with a section on integrating structured and full-text queries.

Supporting XML Query Languages in Web Search

XQuery will be the W3C standard for searching XML documents. However, as it does not have any concept of relevance ranking, so it doesn't fit well with Web search engines. It would be nice if the public Web search engines could support the query syntax, in an appropriately limited way, to provide free-text access to the indexed documents.

However, text search engines do not have to wait for a query language to start providing access to XML documents. AltaVista, FAST (AllTheWeb and Lycos) and Google now index XML files, so they can be searched as free text. As more XML files are published, these engines can start work on storing and searching tag hierarchies in their indexes.

Links: search indexing robots traditionally follow links in HTML documents. How should they handle links in XLink or XPointer format?

Document Metadata

Metadata is information about information, so document metadata tells about the file or page, for example, the author, date, and topic. Adding metadata to documents and indexing the fields provides substantial benefits, allowing searchers to limit their results in useful ways. For example, searching for word processor reviews on an information site, you want just recent reviews rather than news. Search tools that store these metadata values allow searchers to improve the precision of the search and get fewer, more relevant, results. For more information, see the Metadata page.

Structural metadata refers to the tags within the document. For example, an <author> tag tells us that the data within the tag will be an author's name. However, this might be the author of a book described in the document, the author of a movie review, and so on. This metadata can also be searched, if the indexing application can recognize and store it correctly.

On the Web, there are several conventions that page creators and search engines already use for indexer control, and it's unclear how well XML fits in with these.

Related Information and Articles

XML: Text and Context Lou Rosenfeld, WebReview, February 5, 1999
Discussion of the nature of "granularity" of text, how chunks of content can be divided up rather than stored and searched all together. Note also the Brief "Data Does Not Equal Information" Digression, contrasting information architecture with data modeling (and information retrieval with data mining).

XML and Semantic Transparency Robin Cover, November 24, 1998
Academic discussion of the problems with considering XML field tags to be "semantic" (meaningful) rather than syntactic (structural). This may be specifically due to the limits of the DTD format, and may be be addressed by future schema standards.

XML Information Listings

Page Updated 2001-12-28