Gossamer Forum
Home : General : Perl Programming :

LinksExtor

Quote Reply
LinksExtor
Just wondering if anyone has got any prior experience with LinksExtor...I'm trying to use it;

http://www.nihongo.org/snowhare/utilities/perldoc2tree/example/HTML/LinkExtor.html

I tried using their demo, which doesn't even work! There was an extra ; in there somewhere, so I removed that, and then I get;

A fatal error has occured:

Unable to load plugin: Test (Compilation failed in require at admin.cgi line 198.
) at admin.cgi line 200.

Please enable debugging in setup for more details.

Just wondering if anyone has any ideas/prior experience with it Smile

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] LinksExtor In reply to
Yep I've used it.

I can't help without seeing the code.
Quote Reply
Re: [Paul] LinksExtor In reply to
Thanks for the reply :)

Code:
# grab the page we wanna spider...and get rid of new lines...
my $page = $IN->param('URL');
my $email= $IN->param('Email');
my $name = $IN->param('Name');

# cut up the domain name...
#my @sliced_url = split(/\//,$page);
#my $domain = $sliced_url[2];
# $page =~ s/http\:\/\/$domain\///g;
# if ($page =~ /http/) { $page = ""; }

print $IN->header();

use LWP::UserAgent;
use HTML::LinkExtor;
use URI::URL;

$url = "http://www.sn.no/";; # for instance
$ua = new LWP::UserAgent;

# Set up a callback that collect image links
my @imgs = ();
sub callback {
my($tag, %attr) = @_;
return if $tag ne 'img'; # we only look closer at <img ...>
push(@imgs, values %attr);
}

# Make the parser. Unfortunately, we don't know the base yet
# (it might be diffent from $url)
$p = HTML::LinkExtor->new(\&callback);

# Request document and parse it as it arrives
$res = $ua->request(HTTP::Request->new(GET => $url),
sub {$p->parse($_[0])});

# Expand all image URLs to absolute ones
my $base = $res->base;
@imgs = map { $_ = url($_, $base)->abs; } @imgs;

# Print them out
print join("\n", @imgs), "\n";

The commented out variables are because I'm trying the demo code they give,so there is no need for it to be run....

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] LinksExtor In reply to
Are you using strict?
Quote Reply
Re: [Paul] LinksExtor In reply to
Yeah, but that is another modification I made (putting my declerations for all the variables not already defined)...

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Paul] LinksExtor In reply to
Ok...their code was just a problem where I had missed a 'my' decleration Crazy Fixed that, and now it works fine. I'm trying to use the following code to get URL's from a specific page;

Code:
print $IN->header();

use LWP::UserAgent;
use HTML::LinkExtor;
use URI::URL;

my $ua = new LWP::UserAgent;

# Set up a callback that collect image links
my @urls = ();
my ($p, $res);
sub callback {
my($tag, %attr) = @_;
return if $tag eq 'img'; # we only look closer at <img ...>
push(@urls, values %attr);
}

# Make the parser. Unfortunately, we don't know the base yet
# (it might be diffent from $url)
$p = HTML::LinkExtor->new(\&callback);

# Request document and parse it as it arrives
$res = $ua->request(HTTP::Request->new(GET => $page),
sub {$p->parse($_[0])});

# Expand all image URLs to absolute ones
my $base = $res->base;
@urls = map { $_ = url($_, $base)->abs; } @urls;

# Print them out
print join("<BR>", @urls), "\n";

The problem I am having, is that it is returning images too, for example, http://www.gossamer-threads.com, returns;

Quote:
http://www.gossamer-threads.com/includes/threads.css
http://www.gossamer-threads.com/index.htm
http://www.gossamer-threads.com/scripts/index.htm
http://www.gossamer-threads.com/services/index.htm
http://www.gossamer-threads.com/support/index.htm
http://www.gossamer-threads.com/contact/index.htm
http://www.gossamer-threads.com/jobs/index.htm
http://www.gossamer-threads.com/images/black.gif
http://www.gossamer-threads.com/corporate/index.htm
http://www.gossamer-threads.com/perl/gforum/gforum.cgi?forum=16
http://www.gossamer-threads.com/forum/Gossamer_AutoRespond_1.1.1_Security_Fix_P216296/
http://www.gossamer-threads.com/forum/Forum_Guidelines_P216271/
http://www.gossamer-threads.com/forum/Gossamer_Forum_1.1.8_Released!_P213168/
http://www.gossamer-threads.com/forum/Gossamer_AutoRespond_1.1.0_Now_Available!_P211856/
http://www.gossamer-threads.com/forum/Gossamer_Forum_1.1.7_Released!_P205091/
http://www.gossamer-threads.com/scripts/index.htm
http://www.gossamer-threads.com/scripts/webmail/index.htm
http://www.gossamer-threads.com/scripts/links-sql/index.htm
http://www.gossamer-threads.com/scripts/gforum/index.htm
http://www.gossamer-threads.com/scripts/autores/index.htm
http://www.gossamer-threads.com/scripts/mysqlman/index.htm
http://www.gossamer-threads.com/scripts/dbman-sql/index.htm
http://www.gossamer-threads.com/scripts/dbman/index.htm
http://www.gossamer-threads.com/scripts/links/index.htm
http://www.gossamer-threads.com/scripts/fileman/index.htm
http://www.gossamer-threads.com/scripts/register/index.htm
http://www.gossamer-threads.com/services/index.htm
http://www.gossamer-threads.com/support/index.htm
http://www.gossamer-threads.com/perl/gforum/
http://www.gossamer-threads.com/scripts/resources/
http://www.gossamer-threads.com/images/foot_bkgd.gif
http://www.gossamer-threads.com/index.htm
http://www.gossamer-threads.com/scripts/index.htm
http://www.gossamer-threads.com/services/index.htm
http://www.gossamer-threads.com/support/index.htm
http://www.gossamer-threads.com/contact/index.htm
http://www.gossamer-threads.com/jobs/index.htm
http://www.gossamer-threads.com/images/foot_bkgd.gif

I'm still trying to work out how to edit this sub, so that it will not pass on select things...such as .gif/.jpg's/css etc.

Any ideas?

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] LinksExtor In reply to
Try changing:

return if $tag eq 'img';

to

return if $tag eq 'a href';

or

return if $tag eq 'a';
Quote Reply
Re: [Paul] LinksExtor In reply to
Turnes out I needed;

Code:
return if $tag ne 'a'; # we only look closer at <img ...>

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!