Gossamer Forum: General: Perl Programming: Which is better for my purposes [Parser]

Jun 12, 2002, 8:10 AM

Ian

Veteran (2577 posts)

Jun 12, 2002, 8:10 AM

Post #1 of 11

Shortcut

Which is better for my purposes [Parser]

Which would be better for parsing title and description only from a site's meta (more commonly installed module so I can use on anyones site)

HTML:parse, HTML:parser, HTML::TokeParser or write my own?

I am currently using LWP:Simple to grab the site into a local variable.

I want the parse to be smart enough to recognise possible missing meta tags, and variations on the meta tag.

Like:

Code:
 $metadata =~ s/HTTP-EQUIV/name/gi; if ($metadata =~ m/>/gi) 
  { 
  $end_metad = pos($metadata); 
  $metadata = substr($metadata, 0, $end_metad -1); 
  } $metadata =~ s/Content="//gi; 
 $metadata =~ s/Content = "//gi; 
 $metadata =~ s/Content= "//gi; 
 $metadata =~ s/Content ="//gi; 
 $metadata =~ s/"//g; if (($metadata =~ /NAME=DESCRIPTION/i) or  
    ($metadata =~ /NAME = DESCRIPTION/i) or  
    ($metadata =~ /NAME =DESCRIPTION/i) or  
    ($metadata =~ /NAME= DESCRIPTION/i)) 
  { 
  $metadiz = $metadata; 
  $metadiz =~ s/NAME=DESCRIPTION//gi; 
  $metadiz =~ s/NAME = DESCRIPTION//gi; 
  $metadiz =~ s/NAME= DESCRIPTION//gi; 
  $metadiz =~ s/NAME =DESCRIPTION//gi; 
  $metadiz =~ s/\n//g; 
  } if (($metadata =~ /NAME=KEYWORDS/i) or  
    ($metadata =~ /NAME = KEYWORDS/i) or  
    ($metadata =~ /NAME =KEYWORDS/i) or  
    ($metadata =~ /NAME= KEYWORDS/i)) 
  { 
  $metakeyw = $metadata; 
  $metakeyw =~ s/NAME=KEYWORDS//gi; 
  $metakeyw =~ s/NAME = KEYWORDS//gi; 
  $metakeyw =~ s/NAME= KEYWORDS//gi; 
  $metakeyw =~ s/NAME =KEYWORDS//gi; 
  $metakeyw =~ s/\n//g; 
  } 
 }

http://www.iuni.com/...tware/web/index.html
Links Plugins

Last edited by:

Ian: Jun 12, 2002, 8:11 AM

Jun 12, 2002, 8:16 AM

Paul

Veteran (19537 posts)

Jun 12, 2002, 8:16 AM

Post #2 of 11

Shortcut

Re: [Ian] Which is better for my purposes [Parser] In reply to

Something like this should work just using LWP::Simple to get the page:

Code:
my @page  = get($url); 
my $page  = join "\n", @page; 
my $description; 
my $title; 

if ($page =~ m#<meta\s+(?:name|http-equiv)="?description"?\s+(?:content|value)="?([^"]+)"?>#i)      
    $description = $1;     
}

Then to get the title....

Code:
if ($page =~ m#<title>(.*?)</title>#i) { 
    $title = $1; 
}

Last edited by:

Paul: Jun 12, 2002, 8:19 AM

Jun 12, 2002, 8:18 AM

Wil

Veteran / Moderator (4108 posts)

Jun 12, 2002, 8:18 AM

Post #3 of 11

Shortcut

Re: [Ian] Which is better for my purposes [Parser] In reply to

HTML::TokeParser.

- wil

Jun 12, 2002, 8:20 AM

Ian

Veteran (2577 posts)

Jun 12, 2002, 8:20 AM

Post #4 of 11

Shortcut

Re: [Paul] Which is better for my purposes [Parser] In reply to

Hi Paul,

Nice and simple solution (not your code - but the use of only LWP:simple). I like it.

Will this take into account any variations on how the description appears in the html file? I guess the question marks do this... sorry dumb question!

Thanks for the assistance Smile

EDIT: Wil, why TokeParser, if you don't mind explaining?

http://www.iuni.com/...tware/web/index.html
Links Plugins

Last edited by:

Ian: Jun 12, 2002, 8:21 AM

Jun 12, 2002, 8:55 AM

Wil

Veteran / Moderator (4108 posts)

Jun 12, 2002, 8:55 AM

Post #5 of 11

Shortcut

Re: [Ian] Which is better for my purposes [Parser] In reply to

Because it already does what Paul had to build a regex for.

Code:
  my $page = "/path/to/page.html"; 

  $parser=HTML::TokeParser->new(\$page); 

    while (my $token=$parser->get_tag("meta")) { 

      if ($token->[1]{name} =~ /description/i) { 
        $description= $token->[1]{content}; 
      } 
      else { 
        die ("Meta tag DESCRIPTION not found: $!"); 
      } 
    }  

    print $description;

I guess I just believe in 'using the correct tool for the job'.

Cheers

- wil

Last edited by:

Wil: Jun 12, 2002, 8:57 AM

Jun 12, 2002, 8:59 AM

Ian

Veteran (2577 posts)

Jun 12, 2002, 8:59 AM

Post #6 of 11

Shortcut

Re: [Wil] Which is better for my purposes [Parser] In reply to

Thanks Wil (and thanks for the code example also).

One more question if you don't mind, how common would the HTML::TokeParser module be on the systems of potential users of my code? Is it a standard module the most hosts would have installed? I know mine has it.

Thanks again.

http://www.iuni.com/...tware/web/index.html
Links Plugins

Jun 12, 2002, 9:02 AM

Wil

Veteran / Moderator (4108 posts)

Jun 12, 2002, 9:02 AM

Post #7 of 11

Shortcut

Re: [Ian] Which is better for my purposes [Parser] In reply to

It's part of the default installation as far as I can tell, so yeah, it's petty portable.

- wil

Jun 12, 2002, 9:07 AM

Paul

Veteran (19537 posts)

Jun 12, 2002, 9:07 AM

Post #8 of 11

Shortcut

Re: [Wil] Which is better for my purposes [Parser] In reply to

>>Because it already does what Paul had to build a regex for. <<

You'll probably find that the regex is a bit faster than loading a module ;)

Jun 12, 2002, 9:08 AM

Wil

Veteran / Moderator (4108 posts)

Jun 12, 2002, 9:08 AM

Post #9 of 11

Shortcut

Re: [Paul] Which is better for my purposes [Parser] In reply to

Have you seen how many dependencies LWP::Simple has? ;-)

- wil

Jun 12, 2002, 9:28 AM

Paul

Veteran (19537 posts)

Jun 12, 2002, 9:28 AM

Post #10 of 11

Shortcut

Re: [Wil] Which is better for my purposes [Parser] In reply to

The comment you made was regarding HTML::TokenParser and my regex...that has nothing to do with LWP::Simple.

If you want to be fussy though then I can give you some other ways that will be quicker Wink

Code:
sub get_url { 
#-------------------------------------------------- 
# Grab a url. 

   my ($buffer, $n); 

   require IO::Socket; 

   local $^W; 
   my $sock = IO::Socket::INET->new(PeerAddr => $some_url, 
                                    PeerPort => 80, 
                                    Proto    => 'tcp', 
                                    Timeout  => 60) || return; 
   $sock->autoflush; 
   print $sock join("\015\012" => "GET $some_path HTTP/1.0",  
                                  "Host: $some_host,  
                                  "User-Agent: PaulBot", "", ""); 

   1 while $n = sysread($sock, $buffer, 8*1024, length($buffer)); 
   return undef unless defined($n); 

   if ($buffer =~ m,^HTTP/\d+\.\d+\s+(\d+)[^\012]*\012,) { 
       my $code = $1; 
       return undef unless $code =~ /^2/; 
       $buffer =~ s/.+?\015?\012\015?\012//s; 
   } 

   return $buffer; 
}

Last edited by:

Paul: Jun 12, 2002, 9:32 AM

Jun 12, 2002, 9:47 AM

Alex

Administrator (9387 posts)

Jun 12, 2002, 9:47 AM

Post #11 of 11

Shortcut

Re: [Paul] Which is better for my purposes [Parser] In reply to

It's probably a good idea to do it without LWP, as it's not commonly installed (doesn't ship with perl by default).

You could use Paul's example, or look at how verify-child.pl does it in Links SQL, or use GT::URI to get the module if this is something for one of our products.

Cheers,

Alex
--
Gossamer Threads Inc.