Gossamer Forum: General: Perl Programming: regex help please

Jan 25, 2004, 5:42 PM

ronzo

Novice (37 posts)

Jan 25, 2004, 5:42 PM

Post #1 of 6

Shortcut

regex help please

Hello,

New to regex and my head is spinning more than the girl in the "Exorcist".

I'm trying to delete a bunch of html code. For example, this seems to work for deleting all the stuff between the first beginning and ending center tags:

Code:
remove_center=s/\<center>(.*?)\</center>//g;

After those center tags, there are other center tags further down the html page, with a bunch of other code and text in between, all of which I'd like to delete. So any ideas how to remove everything between the first <center> tag and say maybe the third </center> tag?

In the same vein, is there a simple regex for removing everything in a html page between the first instance of a beginning string and the first instance of another string regardless of the amount of text, lines or code between?

That is, say the first and last strings are:

Code:
<form method="post" action="something.cgi" blah blah blah> 
which is the first string, and there is a whole bunch of code and text ending with the last string of: 
</form>

I've tried more regex expressions than I can remember, and none work (or produce unexpected results), although it's certainly a learning experience.

Any help would be appreciated.

Thanks,
ronzo

Jan 26, 2004, 2:11 AM

Andy

Veteran / Moderator (18441 posts)

Jan 26, 2004, 2:11 AM

Post #2 of 6

Shortcut

Re: [ronzo] regex help please In reply to

I'm not sure I quite understand what you are asking... but something like this?

Code:
#!/usr/bin/perl 

use strict; 

my $string = q| 

<p>just a test</p> 
<center>something here 
and some more 
or here</center><BR><BR> 
<center>something here 
and some more 
or here</center> 

|; 

$string =~ s/\<center>(.*?)\</center>//sg; 

print "Content-type: text/html \n\n"; 
print $string;

That should return something like;

Quote:

This is untested, but it should work :)

Cheers

Andy (mod)
andy@ultranerds.co.uk

Last edited by:

Andy: Jan 26, 2004, 8:41 AM

Jan 26, 2004, 8:38 AM

ronzo

Novice (37 posts)

Jan 26, 2004, 8:38 AM

Post #3 of 6

Shortcut

Re: [Andy] regex help please In reply to

Hi Andy,

I was wrong... the regex I gave doesn't remove what I wanted. Here's an example of what I'm trying to do:

Code:
<center>begin html removal</center> 


  <table border="0" width="75%"> 
    <tr> 
      <td width="100%">table #1</td> 
    </tr> 
  </table> 


<center>a lot more text and html tags here<br> 
and here</center> 

  <table border="1" width="100%"> 
    <tr> 
      <td width="100%">table #2</td> 
    </tr> 
    <tr> 
      <td width="100%">&nbsp;</td> 
    </tr> 
  </table> 

<center>end html removal</center>

<p>and then from here on is stuff that will stay</p>

So basically I want to delete everything on the page from the first center tag (in red) to the final center ending tag (in red). I could do this line by line, but I'd rather just grab everything all at once and get rid of it if that's possible.

Hope that explains it better.
Thanks,
ronzo

Jan 26, 2004, 8:45 AM

Andy

Veteran / Moderator (18441 posts)

Jan 26, 2004, 8:45 AM

Post #4 of 6

Shortcut

Re: [ronzo] regex help please In reply to

What about the modified version? Otherwise I'm afraid I don't have any more ideas :(

Cheers

Andy (mod)
andy@ultranerds.co.uk

Jan 26, 2004, 5:26 PM

fuzzy logic

Enthusiast (854 posts)

Jan 26, 2004, 5:26 PM

Post #5 of 6

Shortcut

Re: [ronzo] regex help please In reply to

When you take out the "?" in the "match anything" expression, you tell Perl to be greedy and match as much as it can, rather than as little as it can. The "s" regular expression modifier is required in order for the match to span multiple lines. The "g" modifier is not required since we will only match once.

try this:

Code:
my $html = qq| 
 <center>begin html removal</center> 
 <table border="0" width="75%"> 
   <tr> 
     <td width="100%">table #1</td> 
   </tr> 
 </table> 
 <center>a lot more text and html tags here<br> and here</center> 
 <table border="1" width="100%"> 
   <tr> 
     <td width="100%">table #2</td> 
   </tr> 
   <tr>   
     <td width="100%">&nbsp;</td> 
   </tr> 
 </table> 
 <center>end html removal</center> 
 <p>and then from here on is stuff that will stay</p> 
|; 

$html =~ s,(<center>.*</center>),,s; 
print $html;

Philip
------------------
Limecat is not pleased.

Last edited by:

fuzzy logic: Jan 26, 2004, 5:28 PM

Jan 26, 2004, 9:05 PM

ronzo

Novice (37 posts)

Jan 26, 2004, 9:05 PM

Post #6 of 6

Shortcut

Re: [fuzzy logic] regex help please In reply to

Hi Philip,

Thanks, that works great!

Just picked up the O'Reilly regex book by Friedl, and found a few good web tutorials. Looks like a potentially long learning curve, but it's certainly going to be worth it, especially after seeing how powerful regex can be.

Thanks again, and thanks also to Andy.

ronzo