As of January, 2012, this site is no longer being updated, due to work and health issues

Searching PDF Files

About PDF and the Web

PDF is the Portable Document Format used by Adobe Acrobat. It is designed for brochures, magazines, forms, reports and other materials with complex visual designs which will be printed on PostScript (tm) printers. The format was created to remove machine and platform dependence for the documents, and its goals include design fidelity and typographic control. It was never designed for interactive online reading. However, many word processors, page layout and other programs can create PDF files easily, so many sites are now serving them online.

Adobe has a PDF Plug-In for browsers and some development tools to allow servers to send PDF in chunks ("byte-range serving") rather than downloading the entire file. This improves the user experience of receiving PDF files, but they still lack the speed, simplicity and user control of HTML.

PDF files have a specified page size, for example, and do not reflow in smaller windows, so people with small screens spend a lot of time scrolling around the window. In addition, copying text from a PDF file is very difficult, as sidebar text is included, and selections cannot cross page breaks.

If at all possible, you should serve both HTML and PDF versions of files, designing the HTML for onscreen use and searching, and PDF for printing only. That provides your users with the best format for their task, rather than making too many compromises on one side or the other. HTML files are better for searching as well!

As of July 2003, Jakob Nielsen has an article about the problems of reading PDF files online, including results from usability tests and quotes from users who dislike this format intensely. Everyone seems happy to print from PDF, just not read it on their computer. In his follow-up article, he recommends either generating an HTML version or at least providing a "gateway" page, which includes a summary a warning that clicking the link will bring up the PDF file, and the page count and file size. In addition, he recommends that sites avoid having either internal or external search engines index the PDF files.

PDF to HTML Conversion Tools

PDF and Web Site Searching

As mentioned above, PDF files are hard on search engines, and HTML pages are much easier for them to deal with. However, if you must have PDF, please follow these procedures.

Preparing PDF Files for Searching

PDF and Metadata

Metadata is defined as "information about information". For simple search engines, that generally constitutes the document title, description, keywords, file size and modification date, but it can be much richer than that, providing many more ways to describe an object, and to search for that object. For more information, see the SearchTools Report on Metadata

When search tools index PDF files, they can get the text from the PDF information fields, such as a document title and additional keywords. If the document creator didn't enter that information, the indexer may attempt to generate a title, or may just use the file name of the document.

Adobe XMP

With Acrobat 5.0 and new releases of other products, Adobe is supporting a new eXtensible Metadata Platform (XMP, previously called XAP). This allows the files to contain substantially more information about themselves, including Dublin Core data such as author, description, actual modification date and so on. This has not been widely used and we know of no search engines that take advantage of this metadata.

Common PDF File Indexing Problems

PDF files usually have both text, and graphical representations of the text, with indications of exactly where that text should be displayed. However, there are several cases where this does not work for searching:

Displaying PDF Files in Search Results

If you must index PDF files, there are several ways to improve the user experience. Each PDF file is a single entity, often very large, and when the searcher clicks on a link, they suddenly discover that they are downloading a file and may be asked to install a browser plug-in.

Make sure that your search engine results listing does the following

Some search engines, such as PDF WebSearch, will actually display some extracted text from the PDF file, and open the file to the correct location for the matched text. If you have a lot of PDF files, consider these search engines first.

PDF-Compatible Site Search Tools

Acrobat does not provide a search engine for web sites to search PDF. They discuss this issue in the page Searching the Contents of PDF files on a Web Site.

new Acrobat version 6 now incorporates the Onix search engine for searching PDF content on local disks and networks. This toolkit is also available for licensing and use in other contexts, such as web site and intranet search engines.

This list is not complete, almost every search engine updated in the last few years now indexes PDF files, some using open-source converters, others attempting to perform more sophisticated and specialized text extraction.

= Java  = Mac  = Perl  = Windows  = Unix
= Remote Search Hosting Services = Code libraries

Page Updated 2003-07-31