Gossamer Forum
Quote Reply
Spider and rules...
Hey all,

Also playing with the spider plugins and the not so easy to understand documentation...

I queued a concurrent site of mine and can see the result of the indexing now...

I have -> tons of robots.txt to validate
-> tons of empty "description" field links to validate
-> tons of empty "keywords" field links to validate
-> tons of empty "keywords & description" fields links to validate

Then, when I will find the way to shoot all those empty or robot.txt related files, I will stay with a good big bulk of link to validate...

1st I need to to the step I just talk about and then I would like to find a way to compare the links still in my indexed spider with my own current of "normal" links in my own directory and shoot the duplicates from the spider...

uhh

Another thing... anybody figured out how to spider it's own indexed links database(from the normal index) add a keyword field and parse the missing info "keywords" into that field easiely?

Well, still lots of dark zones for me, but I think we'll face them ... Thanks for your help...

Cheers

Quote Reply
Re: Spider and rules... In reply to
Hi Steve,

To delete things in bulk, the easiest way would be to use the Bulk Operations tool, found in Database->Bulk. Select Validate as the operation table and click Go.

In the URL field, enter robots.txt, Click on Submit. That will remove all documents with robots.txt.

With the next three items. Save the following CGI into one of your directories. The following script may not work, get back to me if it does and i'll answer the links and validate table question.

Code:
#!/usr/bin/perl

use Links qw/ $DB $IN /;
print $IN->header();

$tbl = $DB->table('Spider_Validate');
$tbl->delete({ Description => '' });
$tbl->delete({ keywords => '' });
$tbl->delete({ Description => \'NULL' });
$tbl->delete({ keywords => \'NULL' });

print "done!";

# end of file
About the spider taking already index links and then adding keywords to them. The difficulty is how to get the keywords into there. If the Keywords are not already available on the page, it's rather hard to extract them from the actual document webpage. However, when searches are done, search.cgi will go through and look at not only the keywords, but the actual document for word matches as well, so having Keywords help, but it doesn't mean that that a page will not appear.

Hope that helps,
Aki