Index Phase One Phase Two |
Phase One, Bookmark Crawler:As phase one in my quest to build an open and free peer to peer search engine, I wrote a compact perl script that spiders the URLs in your bookmark file. It then creates a searchable database from the result of the spider. The script works with a flat file database and has a numerically simple algorithm. Using a flat file keeps the results snappy with well over 1,000 bookmarks. It is written without using any special Perl modules. I used IO::Sockets for the spider, which is part of the standard packaging of Perl. As of now, there are two versions of Bookmark Crawler: one that is compatible with UNIX (tested on Debian, Mandrake, and Mac osX) and another version for Windows (tested on win98 and XP). The *nix and Windows versions are 98% the same; the only difference is some adjustments to how the different OSs work with child processes. Bookmark Crawler is available under the terms of the GPL. Requirements:Just perl--if you have Windows, go grab a free download of perl from Active Perl All the major variations of *NIX including Linux, *BSD, and Mac osX have perl already installed. Download: Linux/BSD/Mac version Windows version Changes: 12-08-02 Some code clean up, courtesy of Andrew Moore. 12-10-02 numeric entities to UTF-8 code added, adapted from Øyvind A. Holm's command line script 12-16-02 With some help from Andreas Friedrich, I added some more UTF-8 functionality. It isn't perfect, and works best with Latin-1 characters. 01-08-03 Fixed small description bug |