Gossamer Forum
Home : General : Perl Programming :

HTML Parser

Quote Reply
HTML Parser
Does anyone know where I can find a Perl or PHP script that can parse an HTML file and write the information to the screen or to a file.
Thanks
Quote Reply
Re: [barryb12000] HTML Parser In reply to
Parse what exactly?

Parse meaning just read it or parse tags?
Quote Reply
Re: [Paul] HTML Parser In reply to
I want to parse all the HTML tags plus any comments and only have the text. The file is a web page (on NFL.com). I want to parse the stats from the games.

Here is an example of the file and what I want parse.

<!--RUSHINGSTATS--><TR align=right class=bg2>
<TD align=left width=60><A target=_new
href=http://www.nfl.com/players/1151.htm>Watters</a></td><td>16</td><td>97</td><td>0</td><td>0</td>
<td width=7 class=bg3>&nbsp;</td>
<TD align=left width=60><A target=_new
href=http://www.nfl.com/players/192822.htm>White</a></td><td>10</td><td>52</td><td>0</td><td>0</td></tr>
<TR align=right class=bg3>
<TD align=left width=60><A target=_new
href=http://www.nfl.com/players/12429.htm>Hasselbeck</a></td><td>5</td><td>9</td><td>0</td><td>0</td>
<td width=7 class=bg3>&nbsp;</td>
<TD align=left width=60><A target=_new
href=http://www.nfl.com/players/235220.htm>Jackson</a></td><td>14</td><td>37</td><td>0</td><td>0</td></tr>
<TR align=right class=bg2>
<TD align=left width=60><A target=_new
href=http://www.nfl.com/players/187382.htm>Alexander</a></td><td>4</td><td>3</td><td>0</td><td>0</td>
<td width=7 class=bg3>&nbsp;</td>
<TD align=left width=60><A target=_new
href=http://www.nfl.com/players/133260.htm>Couch</a></td><td>1</td><td>1</td><td>0</td><td>0</td></tr>
<TR align=right class=bg3>
<TD align=left width=60><A target=_new
href=http://www.nfl.com/players/1161.htm>Strong</a></td><td>1</td><td>0</td><td>0</td><td>0</td>
<td width=7 class=bg3>&nbsp;</td>
<TD colspan=5 class=bg3>&nbsp;</td></tr>

Output Example:

Watters|16|97|0|0

White|10|52|0|0

Hasselbeck|5|9|0|0

Jackson|14|37|0|0
Quote Reply
Re: [barryb12000] HTML Parser In reply to
Let me post code which I have used for extracting keywrods out of a webpage before. It should be preety obvious what needs to be adjusted here for this to work on your web page. If not let me know.

Code:
!#/usr/bin/perl

use strict;
use HTML::TokeParser;

my $page = "/path/to/page.html";

$parser=HTML::TokeParser->new(\$page);

while (my $token=$parser->get_tag("meta"))
{
if ($token->[1]{name} =~ /keywords/i)
{
$content = $token->[1]{content};
}
else
{
die ("Meta TAG not found: $!");
}
}

# $content now holds meta keywords from $page.

- wil
Quote Reply
Re: [barryb12000] HTML Parser In reply to
Code:
use LWP::Simple;

my $url = 'http://www.nfl.com';
my @page = get($url);
my $string = join "\n", @page;
my @data;

while ($string =~ m|<TD align=left width=60><A[^>]+>([^<]+)</a></td><td>([^<]+)</td><td>([^<]+)</td><td>([^<]+)</td><td>([^<]+)</td>|sg) {
push @data, [$1,$2,$3,$4,$5];
}

If you can let me know the exact URL of the stats page I can make sure it works properly.

Last edited by:

Paul: Mar 21, 2002, 1:43 AM
Quote Reply
Re: [Paul] HTML Parser In reply to
Thanks Paul

Here is a game from week 1 in the NFL

http://scores.nfl.com/...20010909_SEA@CLE.htm

Thanks again I've been beating myself up for weeks now tring to get something to work.