As of January, 2012, this site is no longer being updated, due to work and health issues
What is P2P (peer-to-peer) search engine?
First you have to define "peer-to-peer", which currently means storing files in a directory that is accessible by people outside a local network. Essentially, this is file sharing with the entire Internet.
Other uses include reducing corporate server bandwidth bottlenecks, storing enterprise data, distributed processing, knowledge management aggregation, collaboration, automatically distributing updates for software and real-time updating such as auctions and news syndication.
There are two prominent models of P2P searching right now:
- Napster actually was a centralized search engine and index, much like the traditional web search engines. Instead of an indexing robot spider that follows links among Web pages, the Napster indexer gathered information from the sharing directories of user sites. Because the shared files were music, mainly MP3s, the Napster indexer indexed metadata about those files, such as the title and artist (as typed by the peer sharer or extracted from the MP3 file), the size of the file and the speed of the connection.
- Gnutella does not have a central server, so it's a true peer-to-peer system. It simply checks some local computers within what they call the "horizon" -- a subset of the entire network of up to 10,000 hosts (in fact, the original protocol was designed for few people). The idea is that peer sharers will accumulate information as they do searches and store files on their local shared folder. In the standard description, they describe looking for a recipe for strawberry-rhubarb pie: the query is passed along by peer hosts for a number of relays, and then dropped. If any of those peers has ever found and saved a strawberry-rhubarb pie recipe, they will have it available for sharing, providing a sort of replication function.
There are some advantages of peer searching over centralized indexing and search engines.
- Distributed processing -- no need for huge server farms and enormous indexes
- Freshness of the information -- peer searching is always current and doesn't get stale, unlike robot-generated indexes or human-generated directories
- Modular -- no dependencies on any specific server
- Ease of sharing -- does not require a publication step (create a web page or upload to a server) to share information
- Anonymity -- Gnutella, in particular, is designed to obscure the requester's identity
- File-format agnostic -- not limited to HTML or other text files, any file can be shared and found by name
- Local control and flexibility -- can be implemented with security permissions and data structures
Luckily, this is not the first time that people have encountered this problems. There are some approaches that peer searching can learn from:
Metasearch engines can send queries to multiple local engines and collate the results.
Z39.50 is the standard developed by libraries for distributed search among many catalogs.
360 Powered "Push-Indexing" uses a distributed agent to index locally and send updates to the master index server.