Gossamer Forum
Home : Products : Links 2.0 : Customization :

Web Crawler addition

Quote Reply
Web Crawler addition
I am about to start a project to build a web crawler that will index all of the pages on each site that is in the links DB. I'll then add to the search script so that it returns the url of the reviewed pages, and then the spidered pages.

Has anyone else done this yet? Time is a little tight on this project, I'd appreciate any help you guys can give.
Quote Reply
Re: Web Crawler addition In reply to
This has been discussed a lot in the forum. Try searching for Xavatoria or spider.

John
Quote Reply
Re: Web Crawler addition In reply to
That's not what I'm looking for. It's pretty trivial to implement a quick spider that's generated on the fly.

I'm looking for information about a _real_ crawler. A backend database complete with relevance ranked keywords the sites. My plan is to spider every page on all of the sites that are stored in the links database. And then offer the user the results from the links DB followed by the crawled pages from the keyword database.



[This message has been edited by cdavis (edited May 09, 1999).]
Quote Reply
Re: Web Crawler addition In reply to
But Xavatoria is a _real_ crawler. See http://www.xav.com/
Quote Reply
Re: Web Crawler addition In reply to
And the only one I have been able to find. It should be easy to use for the default search form. It is a full index search in two flavors onsite indexing and on/off site indexing. He kind of droped the ball on the latter but it works fine. My only beef is that indexes seem to take up more space with the on/off site version. I've got over 5k pages indexed on one site with no slowdown yet (small pages and an ...whoops 8meg index, maybe they both need space). Its kind of hard to find the on/off site version. I beleive the latest version is here:

ftp://ftp.xav.com/search.txt
Quote Reply
Re: Web Crawler addition In reply to
All righty then.

On a quest I go. I looked into the Xavatoria script, and his indexes are pretty huge.

I'm planning on using the Data: Dumper routines to store a hash in keyword:file_list format. That way i can do a reverse lookup by keyword.

There are a couple of routines in the LWP library that will do some of the link walking.

I promise to post again when i get the code in a working version.



[This message has been edited by cdavis (edited May 09, 1999).]