Gossamer Forum: General: Perl Programming: pattern matching prblem

Jun 17, 2004, 4:21 AM

Lex

User (142 posts)

Jun 17, 2004, 4:21 AM

Post #1 of 8

Shortcut

pattern matching prblem

Hi,

I'm trying to get rid of all

Code:
<br>

in between

Code:
<pre>

and

Code:
</pre>

tags...

Now why doesn't this work:

Code:
	$rec{'Text'} =~ s|<pre>(.*?)<br>(.*?)</pre>|<pre>$1 $2</pre>|gis;

Thanks!

Jun 17, 2004, 6:11 AM

Andy

Veteran / Moderator (18441 posts)

Jun 17, 2004, 6:11 AM

Post #2 of 8

Shortcut

Re: [Lex] pattern matching prblem In reply to

Hi. You need to escape things like < and >. Something like this should work;

Code:
$rec{'Text'} =~ s|\<pre\>(.*?)\<br\>(.*?)\</pre\>|<pre>$1 $2</pre>|gis;

Cheers

Andy (mod)
andy@ultranerds.co.uk

Jun 17, 2004, 7:16 AM

Lex

User (142 posts)

Jun 17, 2004, 7:16 AM

Post #3 of 8

Shortcut

Re: [Andy] pattern matching prblem In reply to

In Reply To:

Hi. You need to escape things like < and >. Something like this should work;

Code:
$rec{'Text'} =~ s|\<pre\>(.*?)\<br\>(.*?)\</pre\>|<pre>$1 $2</pre>|gis;

Thanks Andy, but it doesn't. And just above it, I have:

Code:
	$rec{'Text'} =~ s|</blokrechts>|</span>|gis;  
	$rec{'Text'} =~ s|<pre>(.*?)<br>(.*?)</pre>|<pre>$1 $2</pre>|gis;

The first one works fine so I guess it's something to do with the way I try to get rid of the <br>'s...

Jun 17, 2004, 7:35 AM

Andy

Veteran / Moderator (18441 posts)

Jun 17, 2004, 7:35 AM

Post #4 of 8

Shortcut

Re: [Lex] pattern matching prblem In reply to

Trust me... you need to escape the <>, just like you would with the [] or . , or else it will be treated as non-charachter, and instead of being a direct match, it will be taken as its regex equivelant.

What if you change the ending to;

Code:
\Q<pre>$1 $2</pre>

Code:
\<pre\>$1 $2\</pre\>

?

Andy (mod)
andy@ultranerds.co.uk

Jun 17, 2004, 8:40 AM

Lex

User (142 posts)

Jun 17, 2004, 8:40 AM

Post #5 of 8

Shortcut

Re: [Andy] pattern matching prblem In reply to

Well it still doesn't work then. I've figured out where the problem is now, but don't know how to solve it.

I'll paste a bit of html where the <br>'s should be taken away, the problem are the line endings etc.

I tried doing this:

in stead of:

Code:
$rec{'Text'} =~ s%<pre>(.*?)<br>(.*?)</pre>%<pre>$1 $2</pre>%gim;

I tried:

Code:
$rec{'Text'} =~ s%<pre>((.|\n)*?)<br>((.|\n)*?)</pre>%<pre>$1 $2</pre>%gim;

But that would just erase everything between <pre> and </pre> in the next example:

Code:
<br><b>Medische reden WAO-uitkering, in percentages</b>  
<br><pre> 
                                    Turken  Marokkanen  Nederlanders 
<br>Klachten aan het bewegingsapparaat  36      35          36 
<br>Psychische klachten                 23      26          27 
<br>Overig                              41      39          37 
<br></pre>

hmmm... So what I actually want the script to do is the following:

look for <pre> and </pre> and erase all <br> that you find within it, no matter what you find. However: leave the rest!

But I don't know how to do it properly.

(still studying 'programming perl')
Thanks for your time anyway!

Jun 18, 2004, 12:44 AM

yogi

Veteran (2199 posts)

Jun 18, 2004, 12:44 AM

Post #6 of 8

Shortcut

Re: [Lex] pattern matching prblem In reply to

Goeiemorgen

Here is a solution using HTML::TokeParser::Simple, adapted from http://www.tek-tips.com/...m/pid/219/qid/861625

Code:
#!/usr/bin/perl 
use strict; 
use HTML::TokeParser::Simple; 

my $in = q| 
<br><b>Medische reden WAO-uitkering, in percentages</b>   
<br><pre>  
                                    Turken  Marokkanen  Nederlanders  
<br>Klachten aan het bewegingsapparaat  36      35          36  
<br>Psychische klachten                 23      26          27  
<br>Overig                              41      39          37  
<br></pre> 
|; 

my $p = HTML::TokeParser::Simple->new(\$in); 

my $pre = 0; 

while (my $token = $p->get_token) { 
    $pre++ if $token->is_start_tag('pre'); 
    $pre-- if $token->is_end_tag('pre'); 
    next if $pre and $token->is_tag('br'); 
    print $token->as_is; 
}

The output is

Code:
<br><b>Medische reden WAO-uitkering, in percentages</b>   
<br><b>Medische reden WAO-uitkering, in percentages</b>   
<br><pre>                           Turken  Marokkanen  Nederlanders  
                                    Turken  Marokkanen  Nederlanders  
Klachten aan het bewegingsapparaat  36      35          36  
Psychische klachten                 23      26          27  
Overig                              41      39          37  
</pre>

Ivan
-----
Iyengar Yoga Resources / GT Plugins

Jun 18, 2004, 1:17 AM

Lex

User (142 posts)

Jun 18, 2004, 1:17 AM

Post #7 of 8

Shortcut

Re: [yogi] pattern matching prblem In reply to

Goeiemorgen Yogi,

I'll study this and try to start using a parser more often. However, in this case I managed to get the folowing working (with help from a newsgroup):

Code:
     $rec{'Text'} =~ s{(<pre.*?>.+?</pre>)}{ 
         (my $rest = $1) =~ s/<br.*?>//gis; 
         $rest 
     }egis;

It's pretty safe as there is nothing more than <br> in between the <pre> and </pre> tags.

But thanks lot for your code, it's good to be told there are other directions to look at.

Bedankt,

Lex

Jul 17, 2004, 7:14 AM

BlueBottle

New User (4 posts)

Jul 17, 2004, 7:14 AM

Post #8 of 8

Shortcut

Re: [Andy] pattern matching prblem In reply to

Quote:

Mmm no you don't :) ....< and > are not meta-characters. You were right about escaping [ ] though.

Last edited by:

BlueBottle: Jul 17, 2004, 7:14 AM