As of January, 2012, this site is no longer being updated, due to work and health issues
Robots, including search indexing tools and intelligent agents, should check a special file in the root of each server called robots.txt, which is a plain text file (not HTML). Robots.txt implements the REP (Robots Exclusion Protocol), which allows the web site administrator to define what parts of the site are off-limits to specific robot user agent names. Web administrators can Allow access to their web content and Disallow access to cgi, private and temporary directories, for example, if they do not want pages in those areas indexed.
In June, 2008 webwide search engine companies Yahoo, Google, and Microsoft agreed to extend the Robots Exclusion Protocol. They added elements to robots.txt: an Allow directive, wildcards in URLs, and a link to a sitemap for ease of crawling, IP authentication to identify search engine indexing robots, the X-Robots-Tag header field for non HTML documents, and some additional META robot tag attributes.
About the Robots.txt file
The robots.txt file is divided into sections by the robot crawler's User Agent name. Each section includes the name of the user agent (robot) and the paths it may not follow. You should remember that robots may access any directory path in a URL which is not explicitly disallowed in this file: every path not forbidden is allowed.
Note that disallowing robots is not the same as creating a secure area in your site, as only honorable robots will obey the directives and there are plenty of dishonorable ones. Anything you do not want to show to the entire World Wide Web, you should protect with at least a password.
You can usually read the robots.txt file by just requesting it from the server in a browser (for example, www.searchtools.com/robots.txt). If you click it, you'll see that it's a text file with many entries that I generated by looking at my server's error reports, because I wanted to avoid having those even occasionally requested by robots.
The older version is documented in the REP (Robot Exclusion Protocol), and all robots should recognize and honor the rules in the robots.txt file. The New 2008 REP (Robot Exclusion Protocol) has additional features and may not be recognized by all robot crawlers.
- The exact mixed-case directives may be required, so be sure to capitalize Allow: and Disallow: , and remember the hyphen in User-agent:
- An asterisk (*) after User-agent:: means all robots. If you include a section for a specific robot, it may not check in the general all robots section, so repeat the general directives.
- The user agent name can be a substring, such as "Googlebot" (or "googleb"), "Slurp", and so on. It should not matter how the name itself is capitalized.
- Disallow tells robots not to crawl anything which matches the following URL path
- Allow is a new directive: older robot crawlers will not recognize this.
- Historical Note: the 1996 robots.txt draft RFC actually did include "Allow". But everyone seems to have ignored that until around 2005, and even then, it was not documented.
- URL paths are often case sensitive, so be consistent with the site capitalization
- The longest matching directive path (not including wildcard expansion) should be the one applied to any page URL
- In the original REP directory paths start at the root for that web server host, generally with a leading slash (/). This path is treated as a right-truncated substring match, an implied right wildcard.
- One or more wildcard (*) characters can now be in a URL path, but may not be recognized by older robot crawlers
- Wildcards do not lengthen a path -- if there's a wildcard directive path that's shorter, as written, than one without a wildcard, the one with the path spelled out will generally override the one with the wildcard.
- Sitemap is a new directive for the location of the Sitemap file
- A blank line indicates a new user agent section.
- A hash mark (#) indicates a comment
Examples of Robots.txt Entries
Because nothing is disallowed, everything is allowed for every robot
Specifically, the mybot robot may not index anything, because the root path (/) is disallowed. User-agent: *
For all user agents, allow everything ( 2008 REP update)
Allow: /About/robot-policy.html Disallow: /
The BadBot robot can see the robot policy document, but nothing else. All other user-agents are by default allowed to see everything.
This only protects a site if "BadBot" follows the directives in robots.txt
In this example, all robots can visit the whole site, with the exception of the two directories mentioned and any path that starts with private at the host root directory, including items in privatedir/mystuff and the file privateer.html
The blank line indicates a new "record" - a new user agent command.
BadBot should just go away. All other robots can see everything except any subdirectory named "private" ( using the wildcard character)
This keeps the WeirdBot from visiting the listing page in the links directory, the tmp directory and the private directory.
All other robots can see everything except the temp directories or files, but should crawl files and directories named "temperature", and should not crawl private directories. Note that the robots will use the longest matching string, so temps and temporary will match the Disallow, while temperatures will match the Allow.
If you think this is inefficient, you're right.
Bad Examples - Common Wrong Entries use one of the robots.txt checkers to see if your file is malformed User-agent: googlebot
NO. This entry is missing the colon after the disallow.
NO. Robots will not recognize misspelled User Agent names (it should be "sidewinder"). Check your server logs for User Agent name and the listings of User Agent names. User-agent: MSNbot
WARNING! Many robots and webservers are case-sensitive. So this path will not match any root-level folders named private or Private.
WARNING Robots generally read from top to bottom and stop when they reach something that applies to them. So Weirdbot would probably stop at the first record, *, instead of seeing its special entry.
If there's a specific User Agent, robots don't check the * (all user agents) block, so any general directives should be repeated in the special blocks.
Thanks to B. at Ultraseek support for suggesting a "bad examples" section, Enrico for discussion of precedence, and Melinda, Jim, and AZ, for pointing out mistakes in the table of examples, since corrected
For more information, see the Robots Exclusion Protocol page
These utilities will read a robots.txt file from a web site and report on any problems or issues:
- Google Robots.txt Analyzer (must log into Google Webmaster Tools, Dashboard > Tools)
- Recognizes Allow and wildcards, provides an interactive test to locate errors in the robots.txt syntax without having to wait for the googlebot to read the robots.txt file again.
- UKOLN WebWatch /robots.txt checker
- As of June 25, 2008, recognizes Allow and wildcard characters, but does not report directives with no user agent, or a semicolon instead of a colon after a directive. University consortium, no advertising,
- SearchEnginePromotion's robots.txt Checker
- An SEO site, but the ads are static, not blinking. This one recognizes Allow directives but not wildcards, which are flagged as errors. (As of July 2, 2008)
- Simon Wilkinson's Robots.txt syntax checker
- Formerly of the Tardis project and Botwatch, no ads, flags Allow directives and wildcards as errors, but he may fix that. (As of July 2, 2008)
- Motoricerca Robots.txt Syntax Checker
- A low-key SEO site, no ads, does not recognize Allow or wildcards. (As of July 2, 2008)
This free Apache server module watches for spiders which read pages disallowed in robots.txt, and blocks all further requests from that IP address. It is particularly useful for blocking email address harvesters, while still allowing legitimate search engine spiders. Be sure to double-check your robots.txt file (use one or more of the checkers above), before implementing it, and to watch your server logs carefully. The August 2002 version (0.6) works with Apache 1.3 on FreeBSD and Linux.
Note that your robots.txt file does not have to include complete names or version numbers -- the protocol says "A case insensitive substring match of the name without version information is recommended." That means that you'd do better specifying webcrawler than WebCrawler/3.0 Robot libwww/5.0a.
- List of User-Agents (Spiders, Robots, Browser) at user-agents.org
- This one is so current, it even has the iPhone Core Media player user-agent. The database is searchable. (2007-07-02)
- List of Known Robot User-Agent Fields
- Helpful list of user agents with notes about whether the robots are email collectors (spammers). May not be updating.
- Web Robots Database
- List of user-agent names, but it's not at all current
- SearchEngineWatch SpiderSpotting Chart
- Definitely antique, it lists Google as "experimental")
- Search Engine Robots on jafsoft
- Nice, but last updated January 2006
For more information about robots on the SearchTools Site:
- Robots Information Page
- Summary of the most important things about web crawling robots
- Robots.txt Page (this page)
- Specific information on entries in the robots.txt file, old and new rules
- META Robots Tag Page
- Describes the META Robots tag contents and implications for search indexing robots.
- Robots Exclusion Protocol (REP) Page
- Links to definitive sources on the Robots Exclusion Protocol, old and new versions
- Indexing Robot Checklist
- A list of important items for those creating robots for search indexing.
- List of Robot Source Code
- Links to free and commercial source code for robot indexing spiders
- List of Robot Development Consultants
- Consultants who can provide services in this area.
- Articles and Books on Robots and Spiders
- Overview articles and technical discussions of robot crawlers, mainly for search engines.
- SearchTools Robots Testing
- Test cases for common robot problems.
Page Updated 2008-09-19