Gossamer Forum: General: Perl Programming: Re : Using HTML-TokeParser

Feb 6, 2003, 10:04 AM

rab54

Novice (8 posts)

Feb 6, 2003, 10:04 AM

Post #1 of 13

Shortcut

Re : Using HTML-TokeParser

Hi gurus,

I am trying to parse the following

Code:
<HTML> 
<HEAD> 
<TITLE>TEST</TITLE> 
<BODY>  

<!-- Begin Navbar Definition -->  
<table>  
<tr>  
<td>Some Data</td>  
</tr>  
</table>  
<!-- End Navbar Definition -->   


</BODY> 
</HTML>

I need to strip out the bit betwen the Navbar Definitions .....

I have tried using HTML TokeParser (and Simple) - I can get either ALL the document .... or just the actual comment lines (the two Navbars) but nothing in between ....

Can anyone help me please ....

TIA

Rab

Last edited by:

Wil: Feb 7, 2003, 8:08 AM

Feb 6, 2003, 10:10 AM

Paul

Veteran (19537 posts)

Feb 6, 2003, 10:10 AM

Post #2 of 13

Shortcut

Re: [rab54] Re : Using HTML-TokeParser In reply to

It thats all you need to do its probably just as simple to do it manually.

This code is assuming those are the only two comments you have in the code;

Code:
$html =~ s|<![^-]+->.+?<![^-]+->||s;

Feb 7, 2003, 9:59 AM

rab54

Novice (8 posts)

Feb 7, 2003, 9:59 AM

Post #3 of 13

Shortcut

Re: [Paul] Re : Using HTML-TokeParser In reply to

Thanks for your response .... does this mean I don't need HTML-TokeParser module ?

If so how do I get the file assigned to a scalar ?

If I put it in an array and do a foreach over it ,and then do the regex I just get the filename back !

I just want a new file written with the contents of what is between the two  statements .....

Cheers once again for any pointers !

Rab

Feb 12, 2003, 3:28 PM

utinni

Novice (11 posts)

Feb 12, 2003, 3:28 PM

Post #4 of 13

Shortcut

Re: [rab54] Re : Using HTML-TokeParser In reply to

To remove the middle section and keep the start and end bits, try this:

Code:
#!/usr/bin/perl 
use warnings; 
use strict; 

{ 
    local $/; 
    open my $in, 'File.html' or die "Can't open file.html: $!"; 
    my $html = <$in>; 
    close $in; 

    $html =~ s/<!-- Begin Navbar Definition.+End Navbar Definition -->//s; 
    print $html; 
}

or this:

Code:
#!/usr/bin/perl 
use warnings; 
use strict; 

open my $in, 'File.html' or die "Can't open file.html: $!"; 

while(<$in>) { 
    print unless /-- Begin Navbar Definition --/ 
              .. /-- End Navbar Definition --/ 
} 

close $in;

To just extract the middle bit, try this:

Code:
#!/usr/bin/perl 
use warnings; 
use strict; 

open my $in, 'File.html' or die "Can't open file.html: $!"; 

while(<$in>) { 
    print if /-- Begin Navbar Definition --/ 
          .. /-- End Navbar Definition --/ 
          and !/-- (Begin|End) Navbar Definition --/; 
} 

close $in

Cheers,
Dave.

Feb 12, 2003, 5:08 PM

Paul

Veteran (19537 posts)

Feb 12, 2003, 5:08 PM

Post #5 of 13

Shortcut

Re: [utinni] Re : Using HTML-TokeParser In reply to

Just a FYI:

open my $in, ...

...is not backward compatible. It's a perl 5.6+ feature. Also you don't need to explicitly close the file handle using this method as it is done automatically.

Also as you don't need Begin/End you can change:

(Begin|End)

to:

(?:Begin|End)

Last edited by:

Paul: Feb 12, 2003, 5:17 PM

Feb 12, 2003, 11:12 PM

utinni

Novice (11 posts)

Feb 12, 2003, 11:12 PM

Post #6 of 13

Shortcut

Re: [Paul] Re : Using HTML-TokeParser In reply to

Quote:

open my $in, ...

...is not backward compatible. It's a perl 5.6+ feature

So is "use warnings". My code is for 5.6 and up since that's what I write all day.

And if you want to be picky, "my" is not backward compatible; it's a 5.0+ feature, but you have to draw the line somewhere, I guess.

Anyway 5.6 is a couple of years old by now, and even 5.8 has been around for a while - surely everyone's upgraded by now! Wink

Seriously though, if people round here don't use 5.6 that often then let me know and I'll adjust my posts accordingly.

Quote:

Also you don't need to explicitly close the file handle using this method as it is done automatically.

True, but good programming practice dictates that operations are balanced (less scope for bugs that way), and I like to promote good programming practice whenever possible. Smile

Quote:

Also as you don't need Begin/End you can change:

(Begin|End)

to:

(?:Begin|End)

Spot on. Just personal preference in this case; I find the () construct easier on the eye than (?:) when there's no other capturing going on in the regex.

Cheers,
Dave.

Feb 13, 2003, 2:10 AM

rab54

Novice (8 posts)

Feb 13, 2003, 2:10 AM

Post #7 of 13

Shortcut

Re: [utinni] Re : Using HTML-TokeParser In reply to

Cheers Dave for your help .... Smile

This works a treat from the command line - but I need to output the stuff to a text file ....

I have tried

print FILE $in if /-- Begin Navbar Definition --/
.. /-- End Navbar Definition --/
and !/-- (Begin|End) Navbar Definition --/;

But I get this back

GLOB(0x1abefac)GLOB(0x1abefac)GLOB(0x1abefac)GLOB(0x1abefac)GLOB(0x1abefac)

in the output file Unsure

Can you shed any light on this ?

Cheers for your continued help ....

Rab

Feb 13, 2003, 2:55 AM

Paul

Veteran (19537 posts)

Feb 13, 2003, 2:55 AM

Post #8 of 13

Shortcut

Re: [utinni] Re : Using HTML-TokeParser In reply to

Quote:

And if you want to be picky, "my" is not backward compatible; it's a 5.0+ feature, but you have to draw the line somewhere, I guess.

Well thats just silly ;)

Quote:

Anyway 5.6 is a couple of years old by now, and even 5.8 has been around for a while - surely everyone's upgraded by now!

I think you'd be surprised at how few people have access to 5.8.0

Quote:

Seriously though, if people round here don't use 5.6 that often then let me know and I'll adjust my posts accordingly.

Well its just a matter of compatibility.

Quote:

Spot on. Just personal preference in this case; I find the () construct easier on the eye than (?:) when there's no other capturing going on in the regex.

Using ?: will improve performance with a more aggressive regex.

Last edited by:

Paul: Feb 13, 2003, 4:16 AM

Feb 13, 2003, 4:15 AM

Paul

Veteran (19537 posts)

Feb 13, 2003, 4:15 AM

Post #9 of 13

Shortcut

Re: [rab54] Re : Using HTML-TokeParser In reply to

$in is a GLOB reference. You need to print $_ to the file handle, or you can leave it out totally and $_ will be used automatically.

Feb 13, 2003, 4:18 AM

rab54

Novice (8 posts)

Feb 13, 2003, 4:18 AM

Post #10 of 13

Shortcut

Re: [Paul] Re : Using HTML-TokeParser In reply to

Cheers it works ! woo hoo ! Smile

Many thanks once again

Rab

Feb 13, 2003, 4:35 AM

rab54

Novice (8 posts)

Feb 13, 2003, 4:35 AM

Post #11 of 13

Shortcut

Re: [Paul] Re : Using HTML-TokeParser In reply to

Sorry to bug you once again ....

How does using the GLOB reference impact on me using the program through a browser .... as it does not seem to work .... it works fine from the cmd line .....

Cheers

Rab

Feb 13, 2003, 7:30 AM

Paul

Veteran (19537 posts)

Feb 13, 2003, 7:30 AM

Post #12 of 13

Shortcut

Re: [rab54] Re : Using HTML-TokeParser In reply to

What behaviour do you see and do you get any errors?

Feb 13, 2003, 8:26 AM

rab54

Novice (8 posts)

Feb 13, 2003, 8:26 AM

Post #13 of 13

Shortcut

Re: [Paul] Re : Using HTML-TokeParser In reply to

It turned out to be a permissions problem .... doh !

Works fine now .... cheers !!

Rab