As of January, 2012, this site is no longer being updated, due to work and health issues
Web pages, written in HTML, include information about how to view text: your browser knows that an H1 means a very large header line. But HTML doesn't give us a way to describe the contents of the text: the meaning is lost because there is no way to tag it. For example, if you have a catalog of hand-carved doors, you probably want to talk about the size, weight, materials, color, price and so on, and it would be great if your browser would sort the list in various ways or let you import the list into a database.
XML, the eXtensible Markup Language, will make this happen. XML is not a set of tags itself: it provides a standard system for browsers and other applications to recognize the data in a tag. Unlike proprietary formats, XML formats are open to all. Consumers will benefit as we will have more control of our data, there will be many more useful little programs because it's easy to read the files, and we will be able to use the data even when the original applications are long gone. When applied to the Web, it makes information interchange much richer and more interesting: Tim Berners-Lee calls this the Semantic Web.
Many people say that XML will "solve the search problem". By this, they mean that instead of searching the whole text of a page, search engines could use the XML tags to specify which parts of the pages to search, as fields, which should improve search results, avoid irrelevant pages and provide more precise listings of the information available.
For example, searching for "Albert Einstein" in a webwide search engine will bring up colleges named after him, university lectures mentioning him, links to sites about him, solar folklore pages, Internet philosophy in German, Jewish History, pages of quotations, biographies of his colleagues, and so on. While these are interesting, you may just have wanted to learn what books and articles he wrote. Wouldn't it be nice if you could just ask for Albert Einstein as an author, and avoid all the irrelevant pages? Even for local site and intranet searching, having fields of data would make it easier to find the most useful items.
However, adding this information will require extra effort from both site designers and search tools developers. It will only be worth an early investment for sites with very large amounts of data or intensive search needs, but the infrastructure will improve over time for the rest of us.
These are the things that must happen before XML can solve the search problems.
- Older browsers won't even recognize XML pages. Internet Explorer 5 and Netscape 6 (Gecko) will render XML pages, but for the time being, most sites will need both HTML and XML versions of their pages.
Although the browsers may not display XML, you could set up a system to search XML pages based on the tags, but display the results in XML or HTML according to the browser version or user preference. Or you could use XSLT to generate HTML from XML for display.
- Each industry will have to set up its own standard structure, and there will be huge battles about who controls the standard.
From art to e-commerce to zoology, disciplines will have to agree on the appropriate structures for their various documents. For some groups, that's fairly easy: book reviews, for example, are fairly straightforward. Already, mathematicians, chemists, genealogists and real-estate brokers have set up standard formats for their documents. But other areas will be much harder: for example, medical researchers will have one set of needs for disease information, while practitioners will have many other issues -- the basic information may be shared, but the context is different. Not everyone will want to use the same language to create their tags: English may be common but it's certainly not universal. Groups are now setting up Registries to store and exchange document structure information.
- Web site content providers will have to tag the pages according to some standard structures.
Tagging text is hard to do in the best of circumstances, because it requires writers and editors to understand the context of their data and how it fits in the structure. They will need good tools such as XML-aware editors and databases, and even then, there will be cases where the data doesn't fit into the structure very cleanly. When data can be exported from databases and tagged automatically, the process is much simpler. However, there are millions of HTML pages, many of which include useful data, that will never be touched again.
- Search engine indexing applications will have to hold the tag information as metadata.
The index must store the tags as metadata: without them, the structure will be lost to searchers. In addition to basic tags, the indexes should store the entire hierarchy, so that a searcher can specify the right tag even when the tag names are re-used in a structure. For example, a job-listing document might have a
<location>tag for the company headquarters, a different
<location>tag for the job site, and yet another
<location>tag for the headhunter's office. Obviously, these are very different and should be searched separately.
- Search engines will have to deal with each standard structure in the collection of indexed documents.
The search indexing program and search engine must understand the DTD or other schema of the documents in the indexed collection (this is easier on local sites and intranets than on the entire Web). Tools such as XSLT can convert from one known structure to another, but it can't be completely automatic: human judgement is required.
- Search interfaces will have to display the options in a useful way, so that structural information is available to searchers. A popup menu with all the possible tags in the indexed collection is likely to be worse than useless.
Searching for fields rather than the entire document is harder to explain and requires more effort by the user. They have to adapt to the structure of the documents. The interface is more like a database search engine than a simple search field. Text search engines which search through the whole file are sometimes more flexible than database searches which must be limited to specific fields, so search interfaces should offer both options. Designing these interfaces will require usability expertise and testing.
As you can tell, I don't think that XML will solve "the search problem" any time soon. However, in the long run, XML will provide many of the benefits of database searching while still retaining the simplicity of plain text searching. And this is a good thing.
Right now, most search tools index and search mainly unstructured documents, such as text, HTML, word processor and PDF files: I'm calling these text search engines. As XML documents provide structured data storage, many people see a need for database-style query languages which provide complex ways to access the data. Such languages are used to get data, as small as the text of a single tag, then group results to create reports or compute answers to questions such as "how many widgets did we sell in March?" There are significant differences between this kind of querying and text searching:
- Text search works on indexed data, so only the indexing application must recognize the fields and hierarchies and store the information as metadata for each word entry. In a structured query, the system cannot assume a structured index: the query itself must convey more about the structure of the document.
- Text search applies to documents as a whole, while query languages are concerned with different kinds of data, such as a single tagged field or a record made up of several fields, and there may be several of these records in a single document, or fields from several tables joined together.
- Text search returns a list of documents with some information about them as the result. Query languages will often return the data extracted from a document, such as a tagged field or multiple records.
- In addition to "selection" (finding matches), query languages often must perform computations and formatting transformations on found items, combine data from multiple sources and even update documents automatically. Text search does none of these.
Note: there is no concept of relevance ranking in XQuery 1.0, although it's possible to sort results by field contents or expressions.
Integrating Keyword Search into XML Query Processing 9th World Wide Web conference, Amsterdam: May 15-19, 2000 by Daniela Florescu, Ioana Manolescu, and Donald Kossmann
Proposes using keyword search techniques to approach structured querying over entire sets of XML documents. This would map IR query language to SQL-style queries on fields or even whole records using a "contains" clause and controling the depth of the subtree evaluation. It recognizes the value of an inverted index, in this case within a relational database.
Observations on Structured Document Query Languages, which starts with a section on integrating structured and full-text queries.
XQuery will be the W3C standard for searching XML documents. However, as it does not have any concept of relevance ranking, so it doesn't fit well with Web search engines. It would be nice if the public Web search engines could support the query syntax, in an appropriately limited way, to provide free-text access to the indexed documents.
However, text search engines do not have to wait for a query language to start providing access to XML documents. AltaVista, FAST (AllTheWeb and Lycos) and Google now index XML files, so they can be searched as free text. As more XML files are published, these engines can start work on storing and searching tag hierarchies in their indexes.
Metadata is information about information, so document metadata tells about the file or page, for example, the author, date, and topic. Adding metadata to documents and indexing the fields provides substantial benefits, allowing searchers to limit their results in useful ways. For example, searching for word processor reviews on an information site, you want just recent reviews rather than news. Search tools that store these metadata values allow searchers to improve the precision of the search and get fewer, more relevant, results. For more information, see the Metadata page.
Structural metadata refers to the tags within the document. For example, an <author> tag tells us that the data within the tag will be an author's name. However, this might be the author of a book described in the document, the author of a movie review, and so on. This metadata can also be searched, if the indexing application can recognize and store it correctly.
On the Web, there are several conventions that page creators and search engines already use for indexer control, and it's unclear how well XML fits in with these.
- Title: text search engines show document titles in results, and it would be very useful if XML documents included a <TITLE> tag that indicates how they want this particular document to appear in the list.
- META Keywords tag: additional searchable keywords for this document.
- META Description tag: a summary of this page for searching and displaying in a results list.
- Meta Robots tag: this tag was introduced to allow page creators to control whether robots would follow links on the page and whether search indexers would store words from this page. XML documents may want to use these as well.