Gossamer Forum
Home : General : Perl Programming :

how to delete html out of a $value

Quote Reply
how to delete html out of a $value
Hello,

I'm working on a script and now i have the following line as $value
" This i an example <P> line SPACE SPACE SPAVE for demonstraiting purpopus only."

Now I would like to remove the <P> (and all other html, AND all spaces.

Can anyone help me out ?

Because this wont work:

$value =~ s/<//g;
$value =~ s/>//g;
$value =~ s/ //g;


Quote Reply
Re: how to delete html out of a $value In reply to
In general, you can't parse HTML with regular expressions properly. For simple cases (perhaps 95%), you'll be all right, but I can produce a case that is valid html, but that your regexp won't match properly. This is why you should be using the HTML::Parse module.

However, if you don't have it, or are just looking to do something simple, a regexp is fine. For instance:

$input = '<img src="/blah.jpg" alt="<cool image>">';
$input =~ s/<.*?>//gs;
print "Output: $input";

Output: ">

which is not what you want, it should be blank.

As for removing extra spaces, how about:

$input =~ s/\s\s/\s/g;

Hope this helps,

Alex
Quote Reply
Re: how to delete html out of a $value In reply to
I would take a look at the html::parse module available at CPAN.

No need to reinvent the wheel

Smile

--Mark

------------------
You can reach my by ICQ at UID# 8602162

Quote Reply
Re: how to delete html out of a $value In reply to
I definately agree with Mark, you can also use the module HTML::FormatText to output in a slightly neater format. Here is an example:

require HTML::Parse;
require HTML::FormatText;

$html = parse_htmlfile("test.html");
$formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 50);

print $formatter;
print $formatter->format($html);

This first parses your HTML, then the formatter creates a text document with margins on the values of leftmargin and rightmargin. You would see two sets of outputs. The first is the margined text, the second would simply be regular text.

Hope this helps.



------------------
Fred Hirsch
Web Consultant & Programmer
Quote Reply
Re: how to delete html out of a $value In reply to
But i need it for this piece of code:

...and i decided i only realy need to delete <P> and the SPACE's nothing more)

counter = 1;
foreach my $elem (@gesorteerd) {
my($url,$headline) = $elem =~ m|<a HREF="(.*?)"(?:.*?)>\s*(.*?)\s*</a>|sgi;


if ($headline =~ /cd-rom|CD|CD-ROM|CD ROM/){
$output .= "<!-- this headline contained an illegal word -->";
} else {

$headline =~ s/<P>//g;
$headline =~ s/<p>//g;

# so here i need an extra edition to the piece of code,
#so this script alsow removes the extra spaces

$output .= "<br>$counter - <a href=\"$url\" target=\"_new\">$headline</a>\n";

$counter++;

}
}

[This message has been edited by chrishintz (edited January 08, 1999).]