Gossamer Forum
Home : General : Perl Programming :

Re : Using HTML-TokeParser

Quote Reply
Re : Using HTML-TokeParser
Hi gurus,

I am trying to parse the following

Code:
<HTML>
<HEAD>
<TITLE>TEST</TITLE>
<BODY>

<!-- Begin Navbar Definition -->
<table>
<tr>
<td>Some Data</td>
</tr>
</table>
<!-- End Navbar Definition -->


</BODY>
</HTML>

I need to strip out the bit betwen the Navbar Definitions .....

I have tried using HTML TokeParser (and Simple) - I can get either ALL the document .... or just the actual comment lines (the two Navbars) but nothing in between ....

Can anyone help me please ....

TIA

Rab

Last edited by:

Wil: Feb 7, 2003, 8:08 AM
Quote Reply
Re: [rab54] Re : Using HTML-TokeParser In reply to
It thats all you need to do its probably just as simple to do it manually.

This code is assuming those are the only two comments you have in the code;

Code:
$html =~ s|<![^-]+->.+?<![^-]+->||s;
Quote Reply
Re: [Paul] Re : Using HTML-TokeParser In reply to
Thanks for your response .... does this mean I don't need HTML-TokeParser module ?

If so how do I get the file assigned to a scalar ?

If I put it in an array and do a foreach over it ,and then do the regex I just get the filename back !

I just want a new file written with the contents of what is between the two <!-- navbar--> statements .....

Cheers once again for any pointers !

Rab
Quote Reply
Re: [rab54] Re : Using HTML-TokeParser In reply to
To remove the middle section and keep the start and end bits, try this:
Code:
#!/usr/bin/perl
use warnings;
use strict;

{
local $/;
open my $in, 'File.html' or die "Can't open file.html: $!";
my $html = <$in>;
close $in;

$html =~ s/<!-- Begin Navbar Definition.+End Navbar Definition -->//s;
print $html;
}


or this:
Code:
#!/usr/bin/perl
use warnings;
use strict;

open my $in, 'File.html' or die "Can't open file.html: $!";

while(<$in>) {
print unless /-- Begin Navbar Definition --/
.. /-- End Navbar Definition --/
}

close $in;


To just extract the middle bit, try this:
Code:
#!/usr/bin/perl
use warnings;
use strict;

open my $in, 'File.html' or die "Can't open file.html: $!";

while(<$in>) {
print if /-- Begin Navbar Definition --/
.. /-- End Navbar Definition --/
and !/-- (Begin|End) Navbar Definition --/;
}

close $in

Cheers,
Dave.
Quote Reply
Re: [utinni] Re : Using HTML-TokeParser In reply to
Just a FYI:

open my $in, ...

...is not backward compatible. It's a perl 5.6+ feature. Also you don't need to explicitly close the file handle using this method as it is done automatically.

Also as you don't need Begin/End you can change:

(Begin|End)

to:

(?:Begin|End)

Last edited by:

Paul: Feb 12, 2003, 5:17 PM
Quote Reply
Re: [Paul] Re : Using HTML-TokeParser In reply to
Quote:
open my $in, ...

...is not backward compatible. It's a perl 5.6+ feature

So is "use warnings". My code is for 5.6 and up since that's what I write all day.

And if you want to be picky, "my" is not backward compatible; it's a 5.0+ feature, but you have to draw the line somewhere, I guess.

Anyway 5.6 is a couple of years old by now, and even 5.8 has been around for a while - surely everyone's upgraded by now! Wink

Seriously though, if people round here don't use 5.6 that often then let me know and I'll adjust my posts accordingly.

Quote:
Also you don't need to explicitly close the file handle using this method as it is done automatically.

True, but good programming practice dictates that operations are balanced (less scope for bugs that way), and I like to promote good programming practice whenever possible. Smile

Quote:
Also as you don't need Begin/End you can change:

(Begin|End)

to:

(?:Begin|End)

Spot on. Just personal preference in this case; I find the () construct easier on the eye than (?:) when there's no other capturing going on in the regex.


Cheers,
Dave.
Quote Reply
Re: [utinni] Re : Using HTML-TokeParser In reply to
Cheers Dave for your help .... Smile

This works a treat from the command line - but I need to output the stuff to a text file ....

I have tried

print FILE $in if /-- Begin Navbar Definition --/
.. /-- End Navbar Definition --/
and !/-- (Begin|End) Navbar Definition --/;

But I get this back

GLOB(0x1abefac)GLOB(0x1abefac)GLOB(0x1abefac)GLOB(0x1abefac)GLOB(0x1abefac)

in the output file Unsure

Can you shed any light on this ?

Cheers for your continued help ....

Rab
Quote Reply
Re: [utinni] Re : Using HTML-TokeParser In reply to
Quote:
And if you want to be picky, "my" is not backward compatible; it's a 5.0+ feature, but you have to draw the line somewhere, I guess.

Well thats just silly ;)

Quote:
Anyway 5.6 is a couple of years old by now, and even 5.8 has been around for a while - surely everyone's upgraded by now!

I think you'd be surprised at how few people have access to 5.8.0

Quote:
Seriously though, if people round here don't use 5.6 that often then let me know and I'll adjust my posts accordingly.

Well its just a matter of compatibility.

Quote:
Spot on. Just personal preference in this case; I find the () construct easier on the eye than (?:) when there's no other capturing going on in the regex.

Using ?: will improve performance with a more aggressive regex.

Last edited by:

Paul: Feb 13, 2003, 4:16 AM
Quote Reply
Re: [rab54] Re : Using HTML-TokeParser In reply to
$in is a GLOB reference. You need to print $_ to the file handle, or you can leave it out totally and $_ will be used automatically.
Quote Reply
Re: [Paul] Re : Using HTML-TokeParser In reply to
Cheers it works ! woo hoo ! Smile

Many thanks once again

Rab
Quote Reply
Re: [Paul] Re : Using HTML-TokeParser In reply to
Sorry to bug you once again ....

How does using the GLOB reference impact on me using the program through a browser .... as it does not seem to work .... it works fine from the cmd line .....

Cheers



Rab
Quote Reply
Re: [rab54] Re : Using HTML-TokeParser In reply to
What behaviour do you see and do you get any errors?
Quote Reply
Re: [Paul] Re : Using HTML-TokeParser In reply to
It turned out to be a permissions problem .... doh !

Works fine now .... cheers !!



Rab