Gossamer Forum: Products: Gossamer Links: Development, Plugins and Globals: Spider 2.0 Plugin!

Jul 2, 2001, 1:02 AM

tommartin

Novice (11 posts)

Jul 2, 2001, 1:02 AM

Post #1 of 3

Shortcut

Spider 2.0 Plugin!

Ok This is making me really angry!

I really don't think it is just me, but Spiders 2.0 is really awkward and difficult to use and the documents supplied with it provide little comfort.

I don't mind the fact that I have spent $200, what I mind is not being able to work the thing correctly.

List of problems and possible solutions:-

1. The auto shutdown does not appear to be working properly? (After two full days of playing around with it I have not once had the system actually stop).

Solution:- Someone design an add-on that will make a response appear to verify if it is pausing, stopping or restarting (if that is already in the program it does not appear to work).

2. I can't get the spider to stop working?

Solution (well sort of):- The only way I have found to stop the spider working if it does not finish by itself is to access MySQLMan and empty all of the Spider Tables, then start the spider again. It does not appear to have damaged any files but does stop the spider, (which after 4 hours of hearing the click it makes on every page refresh gets really annoying). There may be other ways in Telnet or whatever, but I do not have enough knowledge to play around with it enough to find one (if there is code you can use, someone let me know).

3. I can't figure out how to use the spider properly?

Nope me neither, I find the documentation to be cryptic when it comes to this and can only recommend playing around with it until you get it figured, as that is what I am doing! (PLEASE POST A LIST OF BASIC BUT IN DEPTH INSTRUCTIONS IF YOU HAVE MANAGED TO FIGURE OUT EXACTLY HOW TO GET SPIDER TO WORK PROPERLY).

4. If you, like myself are quite new to this, and chose to play around with Spider the way I did, be very careful when using MySQLMan to empty tables (as mentioned above), because you can ruin the whole of LinkSQL otherwise. Fortunately, when you delete something by accident, there are plenty of ways to retrieve your information which where not available in Links 2.0.

If you do want to empty your spider tables to stop the the spider or so you can start a fresh search, only empty the following tables:-

Link_Spider_Hosts
Link_Spider_Links
Link_Spider_Link_Score_List
Link_Spider_Link_Word_List
Link_Spider_Queue
Link_Spder_Rejected
Link_Spider_Rules
Link_Spider_Sets
Link_Spider_Validate

Do Not, and I repeat do not drop them unless you really know what your doing as this makes the script stop working properly! (I know all about that NOW).

ESPECIALLY DONT DROP OR EMPTY:-
Link_Word_List or
Link_Score_List
This will stop your search engine working properly and is an absoulte pain to put right.

If you have dropped or emptied a table by accident I would suggest contacting support.

4. I queue a website for spidering and set the values I require, but Spider keeps bringing up all kinds of wierd pages I didn't want?

Solution:-

This can really be only 1 of 2 things:-
1. If you have the option (spider_out) set to "1" you are going to get every link the website you are spidering has to offer thrown at you, try setting it to "0" for a more refined search.
2. Unless you have mastered (Edit-Rules) you will find that you are not getting the results you are looking for. This as far as I can see is a result of the instructions and help not being detailed and easy enough for most of us to understand fully! Like me and probably everyone else with the spiders plugin, it is a case of practise makes perfect! (After about 16 hours of playing yesterday I did a search for football links and finally got the perfect results using the following rules):-

Keywords Contains 'football' [ Up | Down | Edit | Delete ]
Change Score by 10
Description Does Not Contain 'football' [ Up | Down | Edit | Delete ]
Change Score by 10

So as you can see it is a case of play and it will work.

If anyone has written any good tips or advice on spiders I would really like to get hold of it, because I can't spend 16 hours a day trying to add 40-50 links to my website, it would be quicker to do it manually otherwise and a waste of $200.

Tom Martin

Jul 2, 2001, 3:36 PM

kitsune

Staff (53 posts)

Jul 2, 2001, 3:36 PM

Post #2 of 3

Shortcut

Re: Spider 2.0 Plugin! In reply to

Hi Tom,

One of the major problems is that the new spider is now trying to bind to a specific address, but it is using the calling convention in a library module that is not found in 2.0.4. The easiest way to fix the problem is to do the following:

In the file, spider.pl look for the following lines. (Line 188)

my $server_sock = shift;
my $server_port = $CONF->{status_port};
! $server_sock = GT::Socket->server( $server_port ) or die "Can't setup server on port: $server_port";
return $server_sock;
}

Replace the line marked by the "!" by:

require GT::Socket;
$server_sock = GT::Socket->server(
( $GT::Socket::VERSION > 1.036 ) ?
{
port => $server_port,
host => ( $CONF->{status_host} eq 'localhost' ) ? '' : $CONF->{status_host}
} : $server_port ) or die "Can't setup server on port: $server_port";

This should help with the controls. The added code will test for the version of the GT::Socket library module and call the library appropriately.

If you would like a patch file, if you email me I can send you one. I will also be releasing 2.1 of the spider later today which will have this problem corrected.

You are correct, spidering is a skill that can only be improved through practice. I'm still exploring ways of improving the spidering process. If anyone has any suggestions or hints please let me know.

Cheers,
Aki

Jul 3, 2001, 5:38 AM

surfsafely

User (119 posts)

Jul 3, 2001, 5:38 AM

Post #3 of 3

Shortcut

Re: Spider 2.0 Plugin! In reply to

Aki:

As Alex and Jack can attest to, I have been waiting for a refined spider from GT for quite some time. As the many problem threads here will attest to, it's not quite there yet and I'm glad I waited. I'm at the stage where when I buy something I need it to work correctly the first time.

Let me tell you what I'm looking for and you tell me if it can be done with the product as it stands now or is on the table for future releases. (Understand also that I do not need these features preconfigured out of the box. I simply need to know they can be done with the product.)

1) I need the spider to crawl sites freely but index only the default documents in the root directory of given domains, provided they meet certain criteria I will embelish on below.

2) Conversely, rather than free roaming I need the spider to also pull sites to be crawled from a file only, exported using Links Properties | Export in Links SQL Admin.
Essentially respidering the links in my entire directory to check conformance to the criteria I have set forth for inclusion in my directory.

3) When it respiders, sites no longer conforming must be emailed automatically from the Contact_Email field if present and exported to a separate file.

4) The compiled list of sites still in conformance must then be able to be imported back into Links SQL again, complete with category assignments, Contact_Email, etc etc.

5) In all cases, the spider must be able to examine pages being submitted for the presence of required content, excluded content and both, and handle each one of these three situations differently.

How far or how close are you to helping me achieve these goals?

Mark Brasche
http://SurfSafely.com/