As of January, 2012, this site is no longer being updated, due to work and health issues

Checklist for Search Robot Crawling and Indexing

by Avi Rappoport, Search Tools Consulting

This document provides both technical information and some background and insight into what search engine indexing robots should expect to encounter . Technically, the problems arise from misunderstandings and exploitation of anomalies by HTML creators (direct tagging, WYSIWYG and automated systems), and the tendency of browser applications to be very forgiving in their interpretation of pages and links. Therefore, it's impossible to simply read the HTML and HTTP specifications and follow the rules there -- the real world is much messier than that.

Related page: Source code for Web Robot Spiders

Topic

Suggestion

Servers, Hosts and Domains

For best results, you should work with the servers and conform to their expectations, derived from the behavior of other search engines. But you'll also need to defend against some tricks that have been developed to improve rankings (search engine spam).

Virtual Hosts and Shadow Servers

Virtual hosts and virtual domains allow one server and IP address to act as though it is many servers. To access this, be sure to include the HTTP/1.1 "Host" field in all requests. Most web hosting services use these features to accommodate client hosts. In those cases, you should be sure to accept these URLs and index them using their own host name, although the IP address may be the same.

However, a few search engine spammers will create multiple hosts and even domains, and point them all at the same pages (occupying more of the desirable high rankings in the results). In addition, they may submit or create links to the IP address and even a hex or ten-digit version of the address. The alternate version is sometimes used to get around firewalls and proxies, but it's also used for search engine spam. You may want to do a random check on IP addresses and make sure that similar IP address pages are not just duplicating data, or at least design for future spam checks on this issue.

User-Agent

Your robot should always include a consistent HTTP Header "User-Agent" field with your spider's name, version information and contact information (either a web page or an email address). The spec says that you should put an email address in the "From" field as well.

Referer Field

Another helpful way of working with webmasters is to include the referring page in the HTTP header "Referer" field (yes, it's misspelled). This is the page containing the link you are following. You may want to add this only when you are doing a first crawl, so you don't have to include it in the index database. In any case, it will help webmasters who read their logs to locate bad links and generally understand what you are doing and how you got where you are.

Accept

The HTTP "Accept" field lets your request define the MIME types of files you want to see. Few robots really want audio files, telnet links, compressed binaries, conferencing and so on, but may accidentally follow those links. Using this header field lets the server return a 406 ("not acceptable") status code when the file requested is not one of the desired types, instead of wasting bandwidth and processing time on both server and client sides.

Robots.txt

First, read the contents and links from the Robots Exclusion Protocol and Web Robots FAQ and the Guidelines for Robot Writers. They're old but still definitive. There are are a couple of additional checks you may find useful:

  • Make sure you can read the robots.txt file whether the line-break characters are LF, CR or CR/LF.
  • Assume that the web server is not case-sensitive (better to be conservative)
  • Some people forget to put a slash / to indicate the root directory: if you see a disallow without a slash, assume it starts at the root.
In general, you should check the robots.txt file before any indexing or crawling, at least every day.

META Robots Tag

In addition to the robots.txt file, the Robots Exclusion Protocol allows page creators to set up robot controls within the header of each page using the Robots META tag. Be sure you recognize these options:

  • <META NAME="robots" CONTENT="noindex"> do not index the contents of the page, but do follow the links.
  • <META NAME="robots" CONTENT="nofollow">, you can index the page but do not follow the links
  • <META NAME="robots" CONTENT="noindex,nofollow">, you should neither index the contents nor follow the links.
  • <META NAME="robots" CONTENT="index,follow">, not required, default behavior.

Be sure that you can handle spaces between noindex and nofollow, capitalization variations, and even a different order (nofollow, noindex).

Do not mark these pages as off-limits forever: the settings may change, so you should check them again in future index updates.

Indexing Speed

Although many web servers are perfectly capable of responding to hundreds of requests a second, your indexing robot should be relatively conservative. I recommend that you do one page per ten seconds per site: it will be much faster than most of the other search engines (they tend to do very slow crawls these days). One per ten seconds will make sense to whoever is reviewing the web server log.

Avoid hammering servers! web hosting providers may have multiple virtual hosts per IP address, and multiple IP addresses per physical machine. They are likely to disallow your robot via robots.txt or IP access controls if you overload them.

Server Mirroring and Clustering

Many sites are concerned about server overload, so they use multiple servers. In most cases, it won't matter to the robot, because the pages will be distributed and the links will remain static. However, DNS-based load-balancing servers switch among multiple hosts based on current load, so a robot could get to a page on www1.domain.com that is really the same as the page on www2.domain.com. I don't know of a good solution here, as the IP addresses are different. You may want to set up periodic checks on your index for large amounts of overlap.

Following Links

Simple HTML links are straightforward, however there are many that are not simple at all.

HREF Links

Here's a checklist of things to watch out for in HREF links:

  • Port numbers (A HREF="http://www.domain.com:8010"). This is entirely legitimate and the robot should follow this.
  • Anchor tags (A HREF="page.html#section). In this case, the robot should strip the text after the # and simply go to the page.
  • Extra attributes such as mouseovers, which you should simply ignore

    <A HREF="folder/mainpage.ht ml"
    onMouseOver="display(6);display(8);self.status ='status message'; return true " onMouseOut="display(5);display(7);"><IMG SRC="image.gif" ALIGN="MIDDLE" BORDER="0" WIDTH="24" HEIGHT="23" NATURALSIZEFLAG="3"></A>

  • Quotes in URLs - some content-management programs will insert single or double quotes (', ")in a URL string. In general, stripping these characters should give you a working URL.

Relative Links

Absolute links (those starting with http://) are easy. Relative links are trickier, because page creators are fairly bad at understanding them, content-management programs generate weird ones and browsers go to great lengths to decode them.

The rules are described in RFC 1808 and RFC 2396: basically it is like Unix addressing, where the location starts from the current directory and go down through child directories using slashes (/) and up through parent directories using two dots (..). A variation starts with a slash, meaning the root directory of the host. With these tools, any page can refer to any other page on the site, but it can get confusing:

  • some content-management systems add a dot (.), meaning the current directory, to relative links, even though it's not necessary. Ignore this.
  • confused page creators can add excessive parent directories, accidentally directing clients above the root level of the host. Browsers attempt to compensate for this by rewriting the URL to start at the host, so your robot should do the same.
  • other bizarre combinations can occur, such as this one reported by a search engine administrator: <a href="foldername/http://mydomain">. Your robot should be robust enough to ignore these without problems.

Capitalization and Case Sensitivity new

Some web servers will match words in the path in either upper or lower case letters. Others require exact matches, so a link to www.example.com/SubDir/SpecialPage.html is different from www.example.com/subdir/specialpage.html. Therefore, your robot should store and request pages using the exact case of the original URL. Note that domain and host names are defined by the specification to be case-insensitive, so you don't have to worry about them. (Thanks to Peter Eriksen for reminding me of this problem.)

BASE Tag

Stored in the <HEAD> section of an HTML document, the BASE tag defines an alternate default location for relative links. So if the page is www.example.com but it includes <BASE href="http://www.example.com/subdir/index.html">, all relative links in this document should start from the directory subdir, rather than the root.

Other Kinds of Links


Link Tag

A rarely-used tag that can only appear in the HEAD section of the document, LINK tags can point to alternate versions of documents such as translations or printable versions. They include the familiar HREF attribute and some metadata such as the title, MIME type and relationship that you might find useful.

Links to Frames

The links to pages in framesets are in src attributes in the tag, like this: <FRAME src="page.html">, within a larger <FRAMESET> tag. Just treat the SRC as HREF and you're fine. There may also be normal text, including tags, within the <NOFRAMES> tag.

Note that the linked framed page may be somewhat unintelligible by itself, as when the search engine returns a link in a results list, and the end-user clicks on that link. Smart page creators work around this with navigation and JavaScript, but there's not much a search engine can do about it without storing a lot of context information.

Image Links

In general, IMG SRC attributes do not contain links to indexable pages, so requesting them would waste processor power, bandwidth and time.

Client Side Image Maps are HTML coordinates within an <AREA> tag or a larger <MAP> tag using the familiar HREF format. All you have to do is ignore the shape and coords attributes.

Server Side Image Maps are special files only existing on the server. They also include a coordinate dispatch system, but it's harder to get at. You have three choices

  • ignore these links
  • try sending all or a random set of coordinates to the server and see what happens (for example "http://www.acme.com/cgi-bin/competition?10,27")
  • ask the server to send you the image map file and try to decode it yourself. For an example, see the NCSA tutorial.

Object Links

Objects are the generic class of which Images are specific instances, but they also include Java applets, graphical data and so on.

JavaScript Links

JavaScript can generate text using the commands document.write and document.writeln. If you see those commands, I'd recommend parsing through to the end of the JavaScript and extracting any links you can locate, probably indicated by HREF, http://, .html or .htm.

JavaScript is also used for menus and scrolling navigation links. In this example, the JavaScript command is "onChange" and you can see the HREF is dynamically combined with the links.

<p>Choose a Page:</p>
<form name="jsMenu" >
<select name="select"
onChange= "if(options[selectedIndex].value)
window.location.href= (options[selectedIndex].value)" >
<option value="..index.html" selected>Home</option>
<option value="a.html">Page A</option>
<option value="b.html">Page B</option>
<option value="c.html">Page C</option>
</select>
</form>

Again, the best that you're likely to be able to do is parse through this text and locate the ".html" (and other known page extensions).

 

Page Redirection


There are two ways that webmasters can indicate to the clients that a page has moved or is not really available at the original URL.

Server Redirects

Server Redirects send back an HTTP status code value of 301 (moved permanently) or 302 (moved temporarily). They also send back a Location field with the new URL. In that case, the robot should use the new URL and avoid asking for the old one again.

Meta Refresh

The META refresh tag is designed to update the contents of a page, perhaps after performing processing or as part of a graphic interaction. Some sites use the META refresh tag to redirect clients from one page to another, although the HTML 4.0.1 spec says not to do this. However, it may be the only option for page creators without access to server redirects. The syntax is:

<META http-equiv="Refresh" content="1; URL=page.html">

The slightly odd construction indicates how long before the refresh and the link to the target page: note that it does not have quotes after the URL= but does at the end of the string. Your robot should follow this link, but I'm not sure if it should index the contents of the referring page. Perhaps if it's over a certain length, or if the refresh interval is over a certain time. Some search engine optimization gateway pages use this technique to improve their rankings in the engine while still showing browser clients the full contents of their pages.


Directory Listings

Some servers will display a listing of the contents of a directory, rather than a web page, such as Apache's mod_autoindex module. The robot should follow these links normally.

File Name Extensions

The robot should accept pages which end in ".txt" with a MIME type of text/plain -- they are almost always useful and worth indexing. However, if the page ends in ".log", it's probably a web server log file, which should not be indexed.

Most HTML pages with a MIME type of text/html which do not end in ".htm" or ".html" are dynamically generated by a script or program on the server. But they are generally straightforward HTML and should be indexed normally.

Common file name extensions include:

  • .ssi and .shtml - Server-Side Includes
  • .pl - Perl
  • .cfm - Cold Fusion
  • .asp - Active Server Pages
  • .lasso - Lasso Web Application Server
  • .nclk - NetCloak
  • .xml - XML text files (MIME type text/xml, becoming increasingly important!)

Dynamic Data Issues

Dynamic HTML pages are generated by server applications when a specific URL is requested. There is no definitive way to know if a page is dynamically generated, but those with URLs including the characters ? (question mark), = (equals sign), & (ampersand), $ (dollar sign) and ; (semicolon) tend to be dynamic. While a few of these pages are rendered as Java applets or JavaScript, most are just HTML assembled on-the-fly, and are easily indexable.

Entry IDs

These help the web server analysts track the movement of the clients through the site: they tend to have simple data at the end, such as WebMonkey's ?tw=frontdoor. However, they are duplicate pages so you may want to strip the ends if you recognize a pattern.

Session IDs

These numbers are trickier -- they are generated automatically when a client enters a site to create a session or state within the stateless HTML/HTTP system. So every time a robot evaluates a page, it will see URLs that are different from those it has seen before, and will attempt to follow those links and index those "new" pages. Session IDs (usually from ASP or Java servers) tend to include the text $sessionID. If possible, your indexing robot should recognize this string and compensate for the apparent discrepancies.

Cookies

Cookies are a much more sophisticated way to store state information about a client. For more information, see the Builder.com article. In general, if you include a way to store cookies from a site and send them back on request, interaction with sites that use them will be smoother and more straightforward. Excalibur stores cookies and sends them back correctly.

Domino URLs

Lotus Domino is a web server that generates multiple dynamic views of pages, so that users can show or hide parts of the page. For an example, see the Notes.net author index - clicking on any combination of the triangles will generate a different URL and a different version of the same page. To avoid this, you may want to automatically ignore all URLs that contain but do not end in "OpenDocument" and "OpenView&CollapseView" and ignore all URLs that have "ExpandView" or "OpenView&Start". Excalibur and Ultraseek have settings to do this automatically, and AltaVista Search includes examples of writing these rules.

Infinite Links

Infinite Autolinks are often generated by server applications that simply respond to requests for linked pages. For example, the WebEvent calendar at UMBC provides links to the next month, the next year, and so on, more or less forever. A human will notice that there are no events scheduled for 2010, for example, but a robot will not, and will simply continue to follow links until stopped. You may want to put a limit on pages per top directory, links followed from a single page, or use other throttling techniques to keep your robot under control.

Infinite Loops

Infinitely Expanding Loops generally occur when a server has a special error page (HTTP status 404, file not found), but the page itself contains links to other pages which are not found. This can create URLs that add directory names to the link forever: /error/error/error/error/error.html. I recommend evaluating URLs before following them, and never going more than three or four levels deep with the same directory name.

Default Pages

Sites will probably have links to a single page as both a directory URL (www.example.com/dir/) and a default name within the directory (www.example.com/dir/index.html). The server will automatically serve the default name when the directory is requested. The most common names are:

  • index.html or index.htm
  • default.html or default.htm or default.asp
  • main.html or main.htm
  • home.html or home.htm

You may want to index these separately or check for duplicates and delete one version or the other.


Problem Link Status Codes

The HTTP/1.1 standard includes a set of status and error messages returned by web servers to clients, including browsers and robots. Your robot should recognize these codes and handle them correctly.

2xx: Successful

Web servers will reply with a 200 when they serve a normal page, or when the client sends an "If-Modified-Since" request and the page has been changed since the indicated date. Other 2xx codes can be safely treated the same way.

404: Page Not Found and 410: Gone

When a URL cannot be resolved, web servers are supposed to return a 404 code, indicating that there is no page at this address. This may be permanent or temporary: there's no way to be sure. Many search engines track how many times this code is returned, and purge the page from their system after three or four consecutive errors. If you see a 410 code, the page is gone and there is no way to get it, so you shouldn't try again.

Other 4xx Status Codes

These codes indicate problems with the URL or request. I recommend that you track these URLs and do not retry more than once per month to once per year unless requested.

5xx Status Codes

These codes tend to be transitory problems and you can retry them more often, perhaps once a day or once a week.

Updating the Index

To update the index, you should revisit the pages periodically to locate new and changed pages.

It's not clear how often is right. Many search engines track how often various sites and pages change, and revisit them according to their own internal schedules. Note that some servers pay transmission fees by the byte, especially those in low-technology regions, so constant revisiting can cost them significant amounts of money.

Expires

HTTP/1.1 includes an Expires field which tells you when the information on the page is no longer current. This is mainly used for caching but is also good for indexers, which can revisit the site on the expiration date.

If-Modified-Since

Servers using HTTP/1.1 (and extensions of HTTP/1.0) allow clients to send an If-Modified-Since field with a date and only get the contents of a page if the page is marked as modified after that date. If the page is older than the date, the server returns a status code of 304. This is quite efficient, reducing the CPU and network load on both server and client machines.

Modification Date Problems

Some servers, especially those sending dynamic data, always set the modification date to the current date and time. Some search engines check all pages against the contents of the index before adding them -- if the page is the same as one that is in the index, the new page is considered a duplicate and is not added. Other engines simply re-index the page with the new date.

Note: many pages change only in the content of the links to advertising banners. To avoid excess index updating load, the robot could ignore offsite link changes in IMG tags when computing a checksum.



 

Indexing Pages

Indexing is not just about following links, it's also about understanding HTML pages and making good decisions based on common practices on the Web.

Content

Action

Attributes

In addition to indexing plain text surrounded by tags, search engines should generally index certain selected tag attribute text. The most important is the ALT attribute in the IMG tag, which contains a textual description of the image linked into the page -- an excellent piece of additional data. Much more rare, but still useful, is the LONGDESC attribute, which is a link to a page containing more information about the image or object: you may want to index this page separately, or incorporate it into the contents of the linking page.

JavaScript Text

JavaScript text not only contains links, but also textual content. In general, you should look for the JavaScript command document.write or document.writeln and then look for text in single or double quotes. Note that the backslash (\) escapes quote marks so they are treated as literals rather than as begin and end points. For more information see ProjectCool's JavaScript Structure Guidelines.

Note that some search engines index whole JavaScripts, without doing any parsing, which includes a lot of junk.

XML

You can index XML by simply indexing the data between tag fields. It's not wonderful, but it's a lot better than ignoring it. In future versions, tracking the fields and hierarchy will make searching very interesting.

Style Sheets

Style Sheets (most commonly CSS, Cascading Style Sheets) allow web designers to make their pages look nice. Indexers should ignore these, but some do not. Never follow links to pages ending in .css or index or extract any text in a page within the <STYLE></STYLE> tags.

NOINDEX tag

There is no standard way for page creators to tell search engines not to index parts of pages, such as navigation and copyright data. Therefore, I recommend that you recognize a pseudo-tag, by ignoring all text within the comments <!-- noindex --> and <!-- /noindex -->, and you should write your indexer so it's easy to add other tags or forms for marking text not to be indexed.

HTML extended characters

You should recognize and handle HTML extended characters (for example &nbsp; for non-breaking space and &amp; for ampersand) in indexing and in storing a page extract for later display in the results listing.

Handling Multiple Languages

Indexing text works fairly well even if you don't know much about the language. The more information you can add regarding word breaks and punctuation, the better it works, so consider designing your system in ways that allow future extensions in this area.

HTML has a way for page creators to specify the character encoding in the Content-Type header field.

Unicode

HTML and XML support Unicode, the almost-universal encoding system. Your indexer should do the same.

Language Codes

The language of a page can be set for a whole page: <HTML lang="de"> or for blocks within a page: <P lang="es">. This is rare but should be recognized when found. For more information, see the HTML spec Language section.

Metadata

In the HTML context, metadata is text in the META tags in the <HEAD> section of the HTML page.

META ROBOTS tags

In addition to the robots.txt file, the Robots Exclusion Protocol allows page creators to set up robot controls within the header of each page. Be sure you recognize these options:

  • <META NAME="robots" CONTENT="noindex"> do not index the contents of the page, but do follow the links.
  • <META NAME="robots" CONTENT="nofollow">, you can index the page but do not follow the links
  • <META NAME="robots" CONTENT="noindex,nofollow">, you should neither index the contents nor follow the links.

I'd recommend re-fetching these pages and checking them in future indexing crawls, however, because the settings may change. For more information, see the HTML Author's Guide to the Robots META tag.

Keywords

Page creators can add keywords to their pages to describe them more clearly, and these should be indexed together with the text of the page. For example, a page on cats and dogs as pets might have the following keywords:

<META NAME="keywords" CONTENT="cats, dogs, small animals, pets, 
companions, chat, catz, dogz"> 

Some keywords are delimited by spaces, some by commas, and some by both, and extra white space should be ignored. Where the creator has added commas, I recommend that you treat the contents as phrases. You should also use this information to help weight the page in the results rankings, but beware of keyword spamming (adding many repetitions of a word).

Description

Page creators can also set up a description of their page so you don't have to extract it. You should also include the text of this description in the index.
<META NAME="description" CONTENT="All about living with cats and dogs.">

Publication Dates

Web page publication dates are currently derived from the page modification date. This is often wrong, as the page can be opened and saved without any changes. Future systems will store a publication date in the metadata, so an indexer should be ready to get the date from the contents of the page when that system becomes standardized.

Dublin Core

The Dublin Core initiative is extending the META tags to standardize information on authorship, publication, scope, rights and other information. For example, the SearchTools home page could be indexed with the following tags:

<META name="DC.Title" content="Guide to Site, Intranet and Portal Search">
<META name="DC.Creator" content="Avi Rappoport">
<META name="DC.Publisher" content="Search Tools Consulting">
<META name="DC.Language" content="en">

While you may not want to index these tags now, be sure to design for future compatibility.

Page Descriptions

Most search services display some text from the matched pages, allowing searchers to learn something about the contents before they click on the link. This can come from one or more of several sources:

  • the contents of the Meta Description tag, written by the page creator
  • matching lines: sentences which contain the search terms
  • first useful text: avoiding navigation information by locating the first header tag and extracting text starting there
  • summary text: uses a special formula designed to find the sentences which best summarize the page content
  • top text: text from the top of the page

The first three options are the best: search engines which attempt to find the "best" sentences usually fail; those which extract the first text from the page often display useless navigation information or even JavaScript or CSS.


SearchTools Robots Testing
We have a test directory which provides test cases for common robot problems, including robots.txt directives, duplicate files, odd relative links, JavaScript-generated text and links, and more.


Page Updated 2003-07-18

Home Guide Tools Listing News Info Search Contact

Avi Rappoport of Search Tools Consulting can help you evaluate your search engine, whether it's on a site, portal, intranet, or Enerprise. Please contact SearchTools for more information.


Creative Commons License  This information copyright © 2000-2011 Avi Rappoport, Search Tools Consulting. Some Rights Reserved, under the Creative Commons Attribution-Share Alike 3.0 United States License. Always attribute copied content to the page's full URL. Permissions beyond the scope of this license are available upon request.