Index Phase One Phase Two

Phase One, Bookmark Crawler:


As phase one in my quest to build an open and free peer to peer search engine, I wrote a compact perl script that spiders the URLs in your bookmark file. It then creates a searchable database from the result of the spider.

The script works with a flat file database and has a numerically simple algorithm. Using a flat file keeps the results snappy with well over 1,000 bookmarks.

It is written without using any special Perl modules. I used IO::Sockets for the spider, which is part of the standard packaging of Perl.

As of now, there are two versions of Bookmark Crawler: one that is compatible with UNIX (tested on Debian, Mandrake, and Mac osX) and another version for Windows (tested on win98 and XP). The *nix and Windows versions are 98% the same; the only difference is some adjustments to how the different OSs work with child processes.

Bookmark Crawler is available under the terms of the GPL.

Requirements:
Just perl--if you have Windows, go grab a free download of perl from Active Perl
All the major variations of *NIX including Linux, *BSD, and Mac osX have perl already installed.

Download:
Linux/BSD/Mac version Windows version

Contact me

Changes:
12-08-02 Some code clean up, courtesy of Andrew Moore.
12-10-02 numeric entities to UTF-8 code added, adapted from Øyvind A. Holm's command line script
12-16-02 With some help from Andreas Friedrich, I added some more UTF-8 functionality. It isn't perfect, and works best with Latin-1 characters.
01-08-03 Fixed small description bug