Gossamer Forum
Home : General : Perl Programming :

Which is better for my purposes [Parser]

Quote Reply
Which is better for my purposes [Parser]
Which would be better for parsing title and description only from a site's meta (more commonly installed module so I can use on anyones site)

HTML:parse, HTML:parser, HTML::TokeParser or write my own?

I am currently using LWP:Simple to grab the site into a local variable.

I want the parse to be smart enough to recognise possible missing meta tags, and variations on the meta tag.

Like:

Code:
$metadata =~ s/HTTP-EQUIV/name/gi; if ($metadata =~ m/>/gi)
{
$end_metad = pos($metadata);
$metadata = substr($metadata, 0, $end_metad -1);
} $metadata =~ s/Content="//gi;
$metadata =~ s/Content = "//gi;
$metadata =~ s/Content= "//gi;
$metadata =~ s/Content ="//gi;
$metadata =~ s/"//g; if (($metadata =~ /NAME=DESCRIPTION/i) or
($metadata =~ /NAME = DESCRIPTION/i) or
($metadata =~ /NAME =DESCRIPTION/i) or
($metadata =~ /NAME= DESCRIPTION/i))
{
$metadiz = $metadata;
$metadiz =~ s/NAME=DESCRIPTION//gi;
$metadiz =~ s/NAME = DESCRIPTION//gi;
$metadiz =~ s/NAME= DESCRIPTION//gi;
$metadiz =~ s/NAME =DESCRIPTION//gi;
$metadiz =~ s/\n//g;
} if (($metadata =~ /NAME=KEYWORDS/i) or
($metadata =~ /NAME = KEYWORDS/i) or
($metadata =~ /NAME =KEYWORDS/i) or
($metadata =~ /NAME= KEYWORDS/i))
{
$metakeyw = $metadata;
$metakeyw =~ s/NAME=KEYWORDS//gi;
$metakeyw =~ s/NAME = KEYWORDS//gi;
$metakeyw =~ s/NAME= KEYWORDS//gi;
$metakeyw =~ s/NAME =KEYWORDS//gi;
$metakeyw =~ s/\n//g;
}
}


http://www.iuni.com/...tware/web/index.html
Links Plugins

Last edited by:

Ian: Jun 12, 2002, 8:11 AM
Quote Reply
Re: [Ian] Which is better for my purposes [Parser] In reply to
Something like this should work just using LWP::Simple to get the page:

Code:
my @page = get($url);
my $page = join "\n", @page;
my $description;
my $title;

if ($page =~ m#<meta\s+(?:name|http-equiv)="?description"?\s+(?:content|value)="?([^"]+)"?>#i)
$description = $1;
}

Then to get the title....

Code:
if ($page =~ m#<title>(.*?)</title>#i) {
$title = $1;
}

Last edited by:

Paul: Jun 12, 2002, 8:19 AM
Quote Reply
Re: [Ian] Which is better for my purposes [Parser] In reply to
HTML::TokeParser.

- wil
Quote Reply
Re: [Paul] Which is better for my purposes [Parser] In reply to
Hi Paul,

Nice and simple solution (not your code - but the use of only LWP:simple). I like it.

Will this take into account any variations on how the description appears in the html file? I guess the question marks do this... sorry dumb question!

Thanks for the assistanceSmile



EDIT: Wil, why TokeParser, if you don't mind explaining?


http://www.iuni.com/...tware/web/index.html
Links Plugins

Last edited by:

Ian: Jun 12, 2002, 8:21 AM
Quote Reply
Re: [Ian] Which is better for my purposes [Parser] In reply to
Because it already does what Paul had to build a regex for.

Code:
my $page = "/path/to/page.html";

$parser=HTML::TokeParser->new(\$page);

while (my $token=$parser->get_tag("meta")) {

if ($token->[1]{name} =~ /description/i) {
$description= $token->[1]{content};
}
else {
die ("Meta tag DESCRIPTION not found: $!");
}
}

print $description;

I guess I just believe in 'using the correct tool for the job'.

Cheers

- wil

Last edited by:

Wil: Jun 12, 2002, 8:57 AM
Quote Reply
Re: [Wil] Which is better for my purposes [Parser] In reply to
Thanks Wil (and thanks for the code example also).

One more question if you don't mind, how common would the HTML::TokeParser module be on the systems of potential users of my code? Is it a standard module the most hosts would have installed? I know mine has it.

Thanks again.


http://www.iuni.com/...tware/web/index.html
Links Plugins
Quote Reply
Re: [Ian] Which is better for my purposes [Parser] In reply to
It's part of the default installation as far as I can tell, so yeah, it's petty portable.

- wil
Quote Reply
Re: [Wil] Which is better for my purposes [Parser] In reply to
>>Because it already does what Paul had to build a regex for. <<

You'll probably find that the regex is a bit faster than loading a module ;)
Quote Reply
Re: [Paul] Which is better for my purposes [Parser] In reply to
Have you seen how many dependencies LWP::Simple has? ;-)

- wil
Quote Reply
Re: [Wil] Which is better for my purposes [Parser] In reply to
The comment you made was regarding HTML::TokenParser and my regex...that has nothing to do with LWP::Simple.

If you want to be fussy though then I can give you some other ways that will be quicker Wink

Code:
sub get_url {
#--------------------------------------------------
# Grab a url.

my ($buffer, $n);

require IO::Socket;

local $^W;
my $sock = IO::Socket::INET->new(PeerAddr => $some_url,
PeerPort => 80,
Proto => 'tcp',
Timeout => 60) || return;
$sock->autoflush;
print $sock join("\015\012" => "GET $some_path HTTP/1.0",
"Host: $some_host,
"User-Agent: PaulBot", "", "");

1 while $n = sysread($sock, $buffer, 8*1024, length($buffer));
return undef unless defined($n);

if ($buffer =~ m,^HTTP/\d+\.\d+\s+(\d+)[^\012]*\012,) {
my $code = $1;
return undef unless $code =~ /^2/;
$buffer =~ s/.+?\015?\012\015?\012//s;
}

return $buffer;
}

Last edited by:

Paul: Jun 12, 2002, 9:32 AM
Quote Reply
Re: [Paul] Which is better for my purposes [Parser] In reply to
It's probably a good idea to do it without LWP, as it's not commonly installed (doesn't ship with perl by default).

You could use Paul's example, or look at how verify-child.pl does it in Links SQL, or use GT::URI to get the module if this is something for one of our products.

Cheers,

Alex
--
Gossamer Threads Inc.