Gossamer Forum
Home : General : Perl Programming :

[Perl] Multiline Regular Expression Searches

Quote Reply
[Perl] Multiline Regular Expression Searches
Hello, all. Can someone give me the code to find this pattern in a string?:

<img (anything) ad (anything) >

I need the search to be case insensitive, and to include carriage returns and such in it. What I have currently does not work:

/<img[^>]+ad>/im

The way I translate this is that it's telling Perl to start a regular expression search (/) and find a string of "<img" followed by anything that is not a ">" one or more times ([^>]+) followed by the string "ad>". Then it's telling Perl to end the regular expression search (/) and tell it to be case insensitive (i) and to continue searching even if it encounters a carriage return.


Any help is appreciated!
Furry
Quote Reply
Re: [Fat_N_Furry] [Perl] Multiline Regular Expression Searches In reply to
Would it be possible to give an example of what you are trying to match and it may help in giving a more accurate solution as currently any regex I could give would probably not be suitable, for example:

/<img(.*?)ad[^>]*>/is;

...that would probably do what you want but it is not a great regex as it is just designed to match what you want rather than me being able to assess what you want to achieve and then advising.

Last edited by:

Paul: Jan 6, 2003, 3:35 PM
Quote Reply
Re: [Paul] [Perl] Multiline Regular Expression Searches In reply to
Well what I'm trying to do is create a program to scan HTML files on my hard drive (in the cache, to be exact) for ads and remove them.

I can read the files and write them back out, but I need to remove all the occurences of what are assesed to be ads. (images and .SWF files) with certain keywords in them. The keywords I'm scanning for are as follows:

ad
banner
powered
sponsor
brought to you by
courtesy of

My program code as it stands now is as follows (I haven't implemented your regular expression yet):

Code:
open (FILE_IN, "@ARGV[0]" || die "Cannot open file @ARGV[0]. Verify that it exists.");
($base_name, $type) = @ARGV[0] =~ /(.*)(\.html)/;
$output_file = $base_name
. "_out"
. $type;

printf("\$output_file = %s\n", $output_file);
open (FILE_OUT, ">$output_file" || die "Cannot write to file $output_file");

while (<FILE_IN>)
{
if ( (/<img[^>]+>/im)
|| (/<img[^>]+/i)
|| (/<img[^>]+.*ad/im)
|| (/<img[^>]+.*sponsor/im)
|| (/<img[^>]+.*courtesy of/im)
|| (/<img[^>]+.*banner/im)
|| (/<img[^>]+.*powered/im)
|| (/<img[^>]+.*brought to you by/im)

|| (/<embed[^>]+\.swf.*ad/im)
|| (/<embed[^>]+\.swf.*sponsor/im)
|| (/<embed[^>]+\.swf.*courtesy of/im)
|| (/<embed[^>]+\.swf.*banner/im)
|| (/<embed[^>]+\.swf.*powered/im)
|| (/<embed[^>]+\.swf.*brought to you by/im)

|| (/<td[^>]+\<embed[^>]+\.swf.*ad<\/td>/im)
|| (/<td[^>]+\<embed[^>]+\.swf.*sponsor<\/td>/im)
|| (/<td[^>]+\<embed[^>]+\.swf.*courtesy of<\/td>/im)
|| (/<td[^>]+\<embed[^>]+\.swf.*banner<\/td>/im)
|| (/<td[^>]+\<embed[^>]+\.swf.*powered<\/td>/im)
|| (/<td[^>]+\<embed[^>]+\.swf.*brought to you by<\/td>/im)
)
{
printf("XX1_1\n"); # Found a match
s/<img[^>]+>//im;
s/<embed[^>]+>//im;
s/\<td[^>]+>[^<]*<\/td>//;
$file .= $_;
}
printf(FILE_OUT "%s", $_);
}
I haven't found out how to get a list of files from the cache and scan them all yet, but I'm working on it.


Thanks for any help with this that you can give!
Furry
Quote Reply
Re: [Fat_N_Furry] [Perl] Multiline Regular Expression Searches In reply to
Just a small idea. There are good html parser modules. You may want to try one, before you write a parser.

Best regards,
Webmaster33


Paid Support
from Webmaster33. Expert in Perl programming & Gossamer Threads applications. (click here for prices)
Webmaster33's products (upd.2004.09.26) | Private message | Contact me | Was my post helpful? Donate my help...
Quote Reply
Re: [Fat_N_Furry] [Perl] Multiline Regular Expression Searches In reply to
The "m" in your regexs isn't doing what you think. "m" turns your match criteria into multiline, meaning that ^ $ will match the beginning/end of a line rather than the whole string.

I think what you were thinking of was "s" which turns your code into a string which will allow you to match over many lines.

Either way there are lots of issues with those regexs, the major one being that there are far too many :)

You can use parentheses to group sub-expressesions without backtracking, for example:

(?:ad|sponsor|banner|powered)

Last edited by:

Paul: Jan 9, 2003, 4:03 PM