Gossamer Forum
Home : General : Internet Technologies :

regex troubles

Quote Reply
regex troubles
Sometimes I'm amazed at the things I can accomplish with a simple regex, and other times I just feel like a moron. I've been struggling for a good while on this one and I just can't get it to work. Basically I need a regex that strips out any empty html entities from a string. They can have any amount of whitespace or other empty entities in between the opening and closing tag, but no text or images or other entities that contain text or images. I'm using PHP, but if any perl wizards feel like giving it a whirl, I'm sure I can translate.

My embarrassingly crude and incomplete best effort so far follows:

Code:
$string = preg_replace("(<li>\s*</li>|<ul>\s*</ul>|<font[^>]*>( |\n|\r|\t)*</font>|<b>( |\n|\r|\t)*</b>|<td[^>]*>( |\n|\r|\t)*</td>|<tr[^>]*>( |\n|\r|\t)*</tr>|<table[^>]*>( |\n|\r|\t)*</table>)", "", $string);


This works for the <li> items, but that's about it. It seems to stop at that point.

Much gratitude to any kind soul who can lend a hand.

Fractured Atlas :: Liberate the Artist
Services: Healthcare, Fiscal Sponsorship, Marketing, Education, The Emerging Artists Fund
Quote Reply
Re: [hennagaijin] regex troubles In reply to
An update if anyone encounters this post... Came to the conclusion that this isn't possible with a single regex. It probably could be done with a more extended subroutine/function using multiple regexes, but I found an alternative solution to the problem for my purposes. FWIW, I'd still be interested in taking a look if anyone has a php/perl code snippet for doing this or something similar.

Fractured Atlas :: Liberate the Artist
Services: Healthcare, Fiscal Sponsorship, Marketing, Education, The Emerging Artists Fund
Quote Reply
Re: [hennagaijin] regex troubles In reply to
I expect this won't work in php but with perl I think you can do something like:

Code:
$string =~ s#<(\w+)[^>]*>\s*</\1[^>]*>##sg;

That will match the tag name, eg "ul" or "li" and then the \1 contains the matched value of "li" or "ul" or whatever the match is, and so looks for the closing tag. If found it will strip it out.

So, say you had:

<li></li>

...that would get removed, but:


<li><b>

...wouldn't.

Now, I tihnk I better go and test that I got that right Blush
Quote Reply
Re: [hennagaijin] regex troubles In reply to
Sometimes I astound myself =)

It works. The following:

Code:
my $string = q|<li></li><b><font color="red">Stay here</font></b>|;
$string =~ s#<(\w+)[^>]*>\s*</\1[^>]*>##sg;

print "Content-type: text/html\n\n";
print $string;

Outputs:

<font color="red"><b>Stay Here</b></font>
Quote Reply
Re: [Paul] regex troubles In reply to
Thanks for taking a stab at this. The question, though, is can it handle:

<table cellpadding="1" cellspacing="2" border="0"><tr valign="top"><td><font color="red" size="2">
<ul>
<li><b></b></li>
</ul>
</font></td></tr></table>

I'm looking for something to nuke all of that, since there's no "content" in any of those elements, just other empty elements. The only idea I've had so far is perhaps running it through a regex replace like yours several times, each time catching one higher level of entities. But I haven't been able to get that to work...

The really "advanced" challenge would be figuring out how to make it work even if elements are improperly nested. Crazy

Anyway - thanks for your help.

Fractured Atlas :: Liberate the Artist
Services: Healthcare, Fiscal Sponsorship, Marketing, Education, The Emerging Artists Fund