Gossamer Forum: General: Perl Programming: HTML Parser

Mar 20, 2002, 11:43 AM

barryb12000

Novice (32 posts)

Mar 20, 2002, 11:43 AM

Post #1 of 6

Shortcut

HTML Parser

Does anyone know where I can find a Perl or PHP script that can parse an HTML file and write the information to the screen or to a file.
Thanks

Mar 20, 2002, 11:44 AM

Paul

Veteran (19537 posts)

Mar 20, 2002, 11:44 AM

Post #2 of 6

Shortcut

Re: [barryb12000] HTML Parser In reply to

Parse what exactly?

Parse meaning just read it or parse tags?

Mar 20, 2002, 12:04 PM

barryb12000

Novice (32 posts)

Mar 20, 2002, 12:04 PM

Post #3 of 6

Shortcut

Re: [Paul] HTML Parser In reply to

I want to parse all the HTML tags plus any comments and only have the text. The file is a web page (on NFL.com). I want to parse the stats from the games.

Here is an example of the file and what I want parse.

<TR align=right class=bg2>
<TD align=left width=60><A target=_new
href=http://www.nfl.com/players/1151.htm>Watters</a></td><td>16</td><td>97</td><td>0</td><td>0</td>
<td width=7 class=bg3> </td>
<TD align=left width=60><A target=_new
href=http://www.nfl.com/players/192822.htm>White</a></td><td>10</td><td>52</td><td>0</td><td>0</td></tr>
<TR align=right class=bg3>
<TD align=left width=60><A target=_new
href=http://www.nfl.com/players/12429.htm>Hasselbeck</a></td><td>5</td><td>9</td><td>0</td><td>0</td>
<td width=7 class=bg3> </td>
<TD align=left width=60><A target=_new
href=http://www.nfl.com/players/235220.htm>Jackson</a></td><td>14</td><td>37</td><td>0</td><td>0</td></tr>
<TR align=right class=bg2>
<TD align=left width=60><A target=_new
href=http://www.nfl.com/players/187382.htm>Alexander</a></td><td>4</td><td>3</td><td>0</td><td>0</td>
<td width=7 class=bg3> </td>
<TD align=left width=60><A target=_new
href=http://www.nfl.com/players/133260.htm>Couch</a></td><td>1</td><td>1</td><td>0</td><td>0</td></tr>
<TR align=right class=bg3>
<TD align=left width=60><A target=_new
href=http://www.nfl.com/players/1161.htm>Strong</a></td><td>1</td><td>0</td><td>0</td><td>0</td>
<td width=7 class=bg3> </td>
<TD colspan=5 class=bg3> </td></tr>

Output Example:

Watters|16|97|0|0

White|10|52|0|0

Hasselbeck|5|9|0|0

Jackson|14|37|0|0

Mar 21, 2002, 1:33 AM

Wil

Veteran / Moderator (4108 posts)

Mar 21, 2002, 1:33 AM

Post #4 of 6

Shortcut

Re: [barryb12000] HTML Parser In reply to

Let me post code which I have used for extracting keywrods out of a webpage before. It should be preety obvious what needs to be adjusted here for this to work on your web page. If not let me know.

Code:
!#/usr/bin/perl 

    use strict; 
    use HTML::TokeParser; 

    my $page = "/path/to/page.html"; 

    $parser=HTML::TokeParser->new(\$page); 

	while (my $token=$parser->get_tag("meta")) 
	{ 
		if ($token->[1]{name} =~ /keywords/i) 
		{ 
			$content = $token->[1]{content}; 
		} 
		else 
		{ 
			die ("Meta TAG not found: $!"); 
		} 
	}  

    # $content now holds meta keywords from $page.

- wil

Mar 21, 2002, 1:43 AM

Paul

Veteran (19537 posts)

Mar 21, 2002, 1:43 AM

Post #5 of 6

Shortcut

Re: [barryb12000] HTML Parser In reply to

Code:
use LWP::Simple; 

my $url = 'http://www.nfl.com'; 
my @page = get($url); 
my $string = join "\n", @page; 
my @data; 

while ($string =~ m|<TD align=left width=60><A[^>]+>([^<]+)</a></td><td>([^<]+)</td><td>([^<]+)</td><td>([^<]+)</td><td>([^<]+)</td>|sg) { 
     push @data, [$1,$2,$3,$4,$5]; 
}

If you can let me know the exact URL of the stats page I can make sure it works properly.

Last edited by:

Paul: Mar 21, 2002, 1:43 AM

Mar 21, 2002, 8:43 AM

barryb12000

Novice (32 posts)

Mar 21, 2002, 8:43 AM

Post #6 of 6

Shortcut

Re: [Paul] HTML Parser In reply to

Thanks Paul

Here is a game from week 1 in the NFL

http://scores.nfl.com/...20010909_SEA@CLE.htm

Thanks again I've been beating myself up for weeks now tring to get something to work.