Gossamer Forum
Quote Reply
Andy's Spider Bug??
When I spider some sites, I get the "Restricted!" in the title...
The pages that this comes up on the html of the page is like:
<title>
This is the title.
</title>

if the title is like <title>This is the title.</title>, it spiders okay.

I went into the Spider.pm and changed the
foreach (@html) {
m,<title>(.+?)</title>,i and $title = $1;
}

to

foreach (@html) {
m,<title>
(.+?)
</title>,i and $title = $1;
}

and it spiders the questionable pages okay, but then the ones that were normal from the beginning say Restricted!...
Any solution so it picks up either...

</not a clue>
Quote Reply
Re: [Dinky] Andy's Spider Bug?? In reply to
Odd, I've never seen that problem before. What happens if you change it to;

I went into the Spider.pm and changed the
Code:
foreach (@html) {
m,<title>(.+?)</title>,i and $title = $1;
}

if (!$title) {
foreach (@html) {
m,<title>
(.+?)
</title>,i and $title = $1;
}
}

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] Andy's Spider Bug?? In reply to
Since the '.' doesn't match newlines, what about getting rid of newlines and tabs in the string first? Check for the opening <title>, remove newlines and tabs, anything else you don't want, then doing it.

s,\n,,

Or maybe

m,<title>\n?(.*?)\n?</title>,i and $title=$1

Don't know if that will work or not.


PUGDOG´┐Ż Enterprises, Inc.

The best way to contact me is to NOT use Email.
Please leave a PM here.
Quote Reply
Re: [pugdog] Andy's Spider Bug?? In reply to
In that case, how does this work Dinky?

Code:
foreach (@html) {
m,<title>\n?(.*?)\n?</title>,i and $title=$1
}

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] Andy's Spider Bug?? In reply to
Used the first one, it works fine...
When I get time, I'll try the second one...
Thanks for all your help!!

</not a clue>
Quote Reply
Re: [Dinky] Andy's Spider Bug?? In reply to
Hi

I have been using this GREAT plug in since it was released..
The bug you mentioned was discussed before for some time (I think Andy just did not remember)..

The changes you, Andy and Pugdog mentioned will not work..
The *conclusion* (if I remember correctly) was that it is something with the site environment versus the code in the plug in.. the plug in is being blocked from grabbing selected tags..

Now that might have changed since then but I do not know for sure.
Regards
KaTaBd

Users plug In - Multi Search And Remote Search plug in - WebRing plug in - Muslims Directory
Quote Reply
Re: [katabd] Andy's Spider Bug?? In reply to
I think the bug you are referring too, was the one where it wasn't finding all the URL's on a particular page? It only did this if more than one URL was listed on the same line. This bug should be fixed up in the last version :)

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] Andy's Spider Bug?? In reply to
Hi

OK then, as I said I was not sure..Blush
Regards
KaTaBd

Users plug In - Multi Search And Remote Search plug in - WebRing plug in - Muslims Directory
Quote Reply
Re: [katabd] Andy's Spider Bug?? In reply to
Second question,
Some links come back with:
The following URL's have been grabbed, but were decided that they were not suitable for spidering;

What if I want to spider it anyway, there is no option or way of doing it???
Thanks,

Also, as a future update, how hard would it be to have an admin function that once you have spidered all the links, you can click a check box or something and then click add, and it will add all of the clicked spidered links into the category you selected instead of individually???

</not a clue>
Quote Reply
Re: [Dinky] Andy's Spider Bug?? In reply to
Quote:
Second question,
Some links come back with:
The following URL's have been grabbed, but were decided that they were not suitable for spidering;

Yeah, they are links that are unsuitable to be spidered (.asp, .pl, .cgi, etc). The reason I put that in, is so it will scan out weird URL's, such as javascript: and mailto:.

Quote:
What if I want to spider it anyway, there is no option or way of doing it???

I may write these URL's to a file, and then at the end of the process, give the option to scan the checked URL's from that list too. Not easy, but I'll have a look into it (not guaranteeing anything this week though I'm afraid, as I'm totally booked up).

Quote:
Also, as a future update, how hard would it be to have an admin function that once you have spidered all the links, you can click a check box or something and then click add, and it will add all of the clicked spidered links into the category you selected instead of individually???

I'm not sure this is possible. When you click 'add' on the current plugin, it will give you the list of fields, pre populated... ready to have a category assigned, and added. The problem is, that if any errors turn up (i.e you forget to assign a category), it will give a fatal GT::SQL:;error, and the whole process will die :(

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] Andy's Spider Bug?? In reply to
Was just wondering if there was any resolution to this:
The following URL's have been grabbed, but were decided that they were not suitable for spidering
Need to spider some pages with .cgi, but not letting me....
Thanks,

</not a clue>
Quote Reply
Re: [Dinky] Andy's Spider Bug?? In reply to
Hi. It should be as simple as editing /admin/Plugins/Spider.pm;

Code:
sub extract_links {
my ($tag, %attr) = @_;

if ($tag eq 'a') { #or $tag eq 'img'
foreach my $key (keys %attr) {
if ($key eq 'href' or $key eq 'src') {
my $link_url = URI->new($attr{$key});
my $full_url = $link_url->abs($base_url);
if ($full_url !~ /\.cgi/i) {
push(@urls,$full_url);
} else {
push(@unused_urls,$full_url);
}
}
}
}
}

...to

Code:
sub extract_links {
my ($tag, %attr) = @_;

if ($tag eq 'a') { #or $tag eq 'img'
foreach my $key (keys %attr) {
if ($key eq 'href' or $key eq 'src') {
my $link_url = URI->new($attr{$key});
my $full_url = $link_url->abs($base_url);
push(@urls,$full_url);
}
}
}
}
}

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!