Gossamer Forum
Home : Gossamer Threads Inc. : Custom Modification Jobs :

Re: [katabd] GT spider Plug in for links sql customization

Quote Reply
Re: [katabd] GT spider Plug in for links sql customization In reply to
This actually sounds like the FIRST version of the GT spider, which I had at one time. The second version, changed things greatly. Almost disappointingly.

This was a long time ago, and I don't know if I have that version anywhere any more. I wish I did, as it had a cool interface, I'd like to have expanded on.

I never could get the hang of the GT spider. It didn't have a good set of examples, and it really needed pre-built spider rules, as examples. I tried and tried, but couldn't get it to work. In the past few years, better CPAN modules for fetching, parsing, and traversing have been built, so any new spider will have significant advantages -- it's the interface (as always) that is the problem.

What I've always wanted in a spider, is something that no one (even the really good Windows spiders -- Teleport Pro, Offline Explorere pro, etc) has done.

1) feed in a list, or starting URL.
2) spider DOWN, pulling what is needed from above, but not moving sideways.
3) pulling links from external sites, one page deep, but spidering DOWN from there, eg: pages below that off-site link, but not sideways, or above it, and not on a 3rd server.
4) A rules based set of parsing the pages, that is more intuitive, and less technical, sort of the difference between PERL and c for example. I want to deal in "pages" rather than content. Or, I want to spider all the pages, but only save .gif images, .zip's or such, ignoring the .html or .txt., but I need it to spider and follow all the .html pages. This breaks most rules-based spidering.

It's hard to explain, but most spiders spider 90% extraneous data, all the banners on pages, off-site links to helper programs, etc. Some, you can't get to spider the "content" because in doing so, you start expanding the rules to off-site, 1-off links, which are irrelevant, but fit the looser expanded rules for following content links.


PUGDOG� Enterprises, Inc.

The best way to contact me is to NOT use Email.
Please leave a PM here.
Subject Author Views Date
Thread GT spider Plug in for links sql customization katabd 4809 Jun 30, 2004, 8:30 AM
Post Re: [katabd] GT spider Plug in for links sql customization
long327 4456 Dec 13, 2004, 5:51 PM
Post Re: [katabd] GT spider Plug in for links sql customization
pugdog 4461 Dec 16, 2004, 7:49 PM