Gossamer Forum
Home : General : Perl Programming :

Extract URL + its link title from HTML (Plz Help)

Quote Reply
Extract URL + its link title from HTML (Plz Help)
Hi! Does anyone have any idea on extract the URL's (of certain type of file) out of a webpage and put it in a flat text database?

For example:
file.html containing the following contents:
<a href="../files/num_1.zip">1st one</a>
<a href="http://www.fff.com/files/num_2.zip">2nd one</a>

process and put into spidered.db:
1st one|../files/num_1.zip
2nd one|http://www.fff.com/files/num_2.zip

I know this is kinda complicated. I tried but nothing works:(
THANKS IN ADVICE!!!

Quote Reply
Re: Extract URL and its link title In reply to
use LWP::Simple;

$src = get ($url);
while ($src =~ m#<a\s href\s*=\s*"?([^"] ?)"?>(. ?)</a>#ig) {
($link, $title) = ($1, $2);
$output .= "$title|$link\n";
}

open FILE, ">>spidered.db";
print FILE $output;
close FILE;

Jerry Su
widgetz sucks
Quote Reply
Re: Extract URL and its link title In reply to
Hi! Thanks for replying!=)

This is the script but it prints out nothing, any idea?
#!perl
use LWP::Simple;
$URL = "http://www.perl.com";
$src = get($URL);

while ($src =~ m#<a\s+ href\s*=\s*"?([^"] ?)"?>(. ?)</a>#ig) {
($link, $title) = ($1, $2);
$output .= "$title|$link\n";
}
print "$output";

Quote Reply
Re: Extract URL and its link title In reply to
Is that really your path to Perl? At any rate, no headers. Precede print statement with the following:

print "Content-type: text/html\n\n";


Dan Cool


Quote Reply
Re: Extract URL and its link title In reply to
Thanks for replying Dan.

I tried adding that line before but it doesn't help. And yup. that's my correct path to perl.=)

I tried that script on browser and command line. It won't print a thing. I checked and it fetches the html correctly when I put in the line:
print "$src";

Just there's smth w/ the lines following get(). Basically, the script runs, but won't print. Thanks.

Quote Reply
Re: Extract URL and its link title In reply to
do you have LWP installed?

Jerry Su
widgetz sucks
Quote Reply
Re: Extract URL and its link title In reply to
Yes indeed:)

Someone said the $title and $link variables seem to be undefined, meaning that nothing will print. I really have no clue on what to do... Any idea?

Thanks for your help.

Quote Reply
Re: Extract URL and its link title In reply to
umm.. this bulletin board messes with the code.. some symbols were missing in my original code..

while ($src =~ m#<a\s+href\s*=\s*"?([^"]+?)"?>(.+?)</a>#ig) {

Jerry Su
widgetz sucks