Gossamer Forum
Home : Gossamer Threads Inc. : Custom Modification Jobs :

GT spider Plug in for links sql customization

Quote Reply
GT spider Plug in for links sql customization
Hello
Does anyone know or can, develope or modify GT Spider plug in to do the folowing..

- We want the directory to continue working as usual...
- We want to be able to enter specific TRUE DOMAIN names (no shared hosting or xoom type domains) to be spidered DEEPLY and indexed (TRUE pages only no CGI or images) and the spider MUST obey all "robots.txt" rules if they exist.
- The spidered results will be saved on a separate table and they WILL NOT HAVE to carry any of the links rating or recommending or.. features bulit it to links SQL.
- Add a new section to the search results page with results from those domains or even mix them up with the regular results.

Does anyone know of this being doen? or can develope it for us?
May be GT if we are lucky Crazy
Regards
KaTaBd

Users plug In - Multi Search And Remote Search plug in - WebRing plug in - Muslims Directory
Quote Reply
Re: [katabd] GT spider Plug in for links sql customization In reply to
I am also looking for such a customized plugin. If you are interested, please quote.
Quote Reply
Re: [katabd] GT spider Plug in for links sql customization In reply to
This actually sounds like the FIRST version of the GT spider, which I had at one time. The second version, changed things greatly. Almost disappointingly.

This was a long time ago, and I don't know if I have that version anywhere any more. I wish I did, as it had a cool interface, I'd like to have expanded on.

I never could get the hang of the GT spider. It didn't have a good set of examples, and it really needed pre-built spider rules, as examples. I tried and tried, but couldn't get it to work. In the past few years, better CPAN modules for fetching, parsing, and traversing have been built, so any new spider will have significant advantages -- it's the interface (as always) that is the problem.

What I've always wanted in a spider, is something that no one (even the really good Windows spiders -- Teleport Pro, Offline Explorere pro, etc) has done.

1) feed in a list, or starting URL.
2) spider DOWN, pulling what is needed from above, but not moving sideways.
3) pulling links from external sites, one page deep, but spidering DOWN from there, eg: pages below that off-site link, but not sideways, or above it, and not on a 3rd server.
4) A rules based set of parsing the pages, that is more intuitive, and less technical, sort of the difference between PERL and c for example. I want to deal in "pages" rather than content. Or, I want to spider all the pages, but only save .gif images, .zip's or such, ignoring the .html or .txt., but I need it to spider and follow all the .html pages. This breaks most rules-based spidering.

It's hard to explain, but most spiders spider 90% extraneous data, all the banners on pages, off-site links to helper programs, etc. Some, you can't get to spider the "content" because in doing so, you start expanding the rules to off-site, 1-off links, which are irrelevant, but fit the looser expanded rules for following content links.


PUGDOG´┐Ż Enterprises, Inc.

The best way to contact me is to NOT use Email.
Please leave a PM here.