Gossamer Forum
Home : General : Perl Programming :

Stripping stuff from an htm file

Quote Reply
Stripping stuff from an htm file
I have a .htm file with the usual html tags and a single table in the body.

I want to include it into another .htm file that is being built with perl but I need to strip out everything but the table when I include it into the new .htm file to be built, so that I dont end up with multiple occurrences of the html head and body tags.

My question is would it be best done using various regex`s to strip the tags (bearing in mind that the body tag has some attributes and I would also want to lose the text inbetween the title tag) or is there a regex way to strip out anything between and including two special tags I can include in the page for this purpose?


Thanks for any help.

chmod

Last edited by:

chmod: Nov 2, 2002, 5:09 AM
Quote Reply
Re: [chmod] Stripping stuff from an htm file In reply to
You could surround the content you want with a couple of comment tags such as:

Code:
<!-- KEEP -->
<table>
<tr>
<td>Hello</td>
</tr>
</table>
<!-- KEEP -->

Then with a regex you can do:

Code:
my ($wanted) = $string =~ /<!-- KEEP -->(.*?)<!-- KEEP -->/s;

$string would contain the entire file content.
Quote Reply
Re: [Paul] Stripping stuff from an htm file In reply to
Thanks Paul that works like a charm.

After I keep the code I want from the .htm page using your regex I am also trying to swap out some code with this.
Code:
$wanted =~ s/<!--REPLACE_MENU-->(.*?\s*?)<!--REPLACE_MENU_END-->/<!--REPLACE_MENU-->$menutable<!--REPLACE_MENU_END-->/g;

Due to the html being across multiple lines I added \s*? to the regex, it works and I was wondering if that is the best way to do the regex? and I also have a problem with an html editor sometimes breaking up tags with spaces and spreading them over two lines so I have replaced spaces with _ in the tag to stop that happenning.


EDIT: actually my regex works until something has been placed between the tags then it fails to swap, is there some code that will ignore evrything and anything between the tags for swapping?

I`m trying (.*?\s*?\n*?\r*?\f*?) now.


Thanks again for your help, its appreciated.

chmod

Last edited by:

chmod: Nov 3, 2002, 1:29 AM
Quote Reply
Re: [chmod] Stripping stuff from an htm file In reply to
Quote:
I`m trying (.*?\s*?\n*?\r*?\f*?) now.

well I gues I`m wrong with that Blush ,
jeeze regexes have to be the most confusing part of any language.

chmod
Quote Reply
Re: [chmod] Stripping stuff from an htm file In reply to
Using (.*?) should be fine as long as you add the s modifier to the end of the regex. eg...

s/blah/blah/s;