Gossamer Forum
Home : General : Perl Programming :

Regex driving me *MAD*

Quote Reply
Regex driving me *MAD*
Hi,

Really struggling to find a solution to this one :|

We have the following in a HTML extract, which I'm trying to replace with the value from $image_url (defined in another routine, but that's irrespective of the problem, as some of the URL's are being changed ok, but not others :/).

The HTML looks like;

<img width=588 height=381 src="35598-Science%20Coursework1_files/image002.gif">

<img width=28 height=28 src="35598-Science%20Coursework1_files/image003.gif">

..and the regex is;

$content =~ s|\Qsrc="\E(.*?)/(.*?)\"\>|src="$image_path/$2.$3">|sig;

There are 3 images in the sample HTML document I'm trying to process;

Quote:
<img width=588 height=381 src="35598-Science%20Coursework1_files/image001.gif">

<img width=28 height=28 src="35598-Science%20Coursework1_files/image002.gif">

<img width=28 height=28 src="35598-Science%20Coursework1_files/image003.jpg">

The 001 and 003 images are being replaced correctly by the regex- but not the 002 image!

Can *anyone* see my obvious mistake. I'm loosing my haid over this one (what I've got left ;))

TIA!

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!

Last edited by:

Andy: Mar 24, 2006, 8:04 AM
Quote Reply
Re: [Andy] Regex driving me *MAD* In reply to
What do you get in place for image 2 ? (ie what does the regex return?)

Your code works for me with one exception...

$content = "<img width=588 height=381 src=\"35598-Science%20Coursework1_files/image001.gif\"> <img width=28 height=28 src=\"35598-Science%20Coursework1_files/image002.gif\"> <img width=28 height=28 src=\"35598-Science%20Coursework1_files/image003.jpg\">";

$image_path = "C:\docs";

$content =~ s|\Qsrc="\E(.*?)/(.*?)\"\>|src="$image_path/$2.$3">|sig;

print $content;

and I get...
<img ... /image003.jpg."> <-- note the extra period/dot here

The $3 isn't returning anything and the period "." doesn't seem to be function as intended.

Could there be another SRC="" in a tag that is causing it to break?

example: <javacript src="blah.js">

... just guessing ...

Last edited by:

Watts: Mar 24, 2006, 10:11 AM
Quote Reply
Re: [Watts] Regex driving me *MAD* In reply to
Hi,

Yeah, the extra . was a mess up on my end :D I had $2.$3 (as you stated), but only $1 and $2 had values (it was from a previous attempt, where I tried matching the extension, as well as the filename).

Still can't get it to work right :(

I'm trying a simpler approach now, in a seperate .cgi script;

Code:
#!/usr/local/bin/perl

my $DEBUG = 0;

my $string = qq|

<p class=MsoNormal><img width=552 height=216
src="35598-Science%20Coursework1_files/image001.jpg" align=left hspace=12><b><span
lang=EN-GB style='font-size:10.0pt'>Hypothesis: </span></b><span lang=EN-GB
style='font-size:10.0pt'>Saliva, which contains the enzyme amylase, is produced
in the parotid glands that are located in the mouth. Amylase is a
carbohydrase, which breaks down and digests starch. It does this by speeding

<td></td>
<td><img width=28 height=28
src="35598-Science%20Coursework1_files/image003.gif"></td>
</tr>

<p class=MsoNormal style='margin-left:1.25in;text-indent:-1.25in'><span
style='position:absolute;z-index:-4;left:0px;margin-left:-24px;margin-top:18px;
width:588px;height:381px'><img width=588 height=381
src="35598-Science%20Coursework1_files/image002.gif"></span><b><span
lang=EN-GB style='font-size:10.0pt'>Analysing Evidence: </span></b><span
lang=EN-GB style='font-size:10.0pt'>A graph to show the relationship between
the temperature of the starch solution and the time that it took for the
amylase for digest it. </span></p>
|;


my $image_path = 'http://www.foo.com/images/';

$string =~ s|\Qsrc="\E(.*?)/(.*?)\.gif\"\>|src="/$2.gif">|sig;
$string =~ s|\Qsrc="\E(.*?)/(.*?)\.jpg\"\>|src="/$2.jpg">|sig;
#
# $string =~ s|\Qsrc="\E(.*?)/(.*?)\"\>|src="$image_path/$2">|sig;

print $string;

This returns;

Code:
<p class=MsoNormal><img width=552 height=216
src="/image001.jpg" align=left hspace=12><b><span
lang=EN-GB style='font-size:10.0pt'>Hypothesis: </span></b><span lang=EN-GB
style='font-size:10.0pt'>Saliva, which contains the enzyme amylase, is produced
in the parotid glands that are located in the mouth. Amylase is a
carbohydrase, which breaks down and digests starch. It does this by speeding

<td></td>
<td><img width=28 height=28
src="35598-Science%20Coursework1_files/image003.gif"></td>
</tr>

<p class=MsoNormal style='margin-left:1.25in;text-indent:-1.25in'><span
style='position:absolute;z-index:-4;left:0px;margin-left:-24px;margin-top:18px;
width:588px;height:381px'><img width=588 height=381
src="/image002.gif"></span><b><span
lang=EN-GB style='font-size:10.0pt'>Analysing Evidence: </span></b><span
lang=EN-GB style='font-size:10.0pt'>A graph to show the relationship between
the temperature of the starch solution and the time that it took for the
amylase for digest it. </span></p>

As you can see, one of them still isn't updated correctly :| (I'm just trying to get it to work with a basic /imagename.ext format now, to at least try and get it working - I can then add the real URL in later :)).

I just don't get why its not working Unimpressed Must be something either really silly - or something really weird going on :/

TIA!

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] Regex driving me *MAD* In reply to
I had a similar problem with a very similar code... try removing the tab/space in front of the 2nd src="" attribute. Notice how all the others are left justified, but the one in the middle is not?
Quote Reply
Re: [Andy] Regex driving me *MAD* In reply to
Your regex is wrong:

$string =~ s|\Qsrc="\E(.*?)/(.*?)\.gif\"\>|src="/$2.gif">|sig;

Notice the last closing >. In your string, your first img tag doesn't have a closing > right after the src, so your match is going all the way to the second one.

Cheers,

Alex
--
Gossamer Threads Inc.

Last edited by:

Alex: Mar 27, 2006, 9:53 AM
Quote Reply
Re: [Alex] Regex driving me *MAD* In reply to
Hi Alex,

As usual, you hit the nail on the head :) Works like a charm! How I missed that, I'll never know :( (too many hours of staring at the same code I guess Unsure)

Thanks again!

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] Regex driving me *MAD* In reply to
Now just to get around the *lovely* fact of Mr M$ deciding to only support inline <span><img ..></span> formatting into IE, and not cross browser compatible with FireFox Pirate Go Bill Gates - NOT!


Thanks again =) You saved my last bit of hair :P

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] Regex driving me *MAD* In reply to
Ignoring the fact that the problem has already been solved, this could be done with the server configuration (such as .htaccess files in Apache if enabled). This is good for four reasons:
  1. It enables the programmer to be lazy. You never have to change the original documents but requests will be forwarded to the proper location.
  2. Cached documents will still function properly.
  3. Documents still referring to the old location will still function.
  4. You'll get fewer 404s in your server log and won't get penalized by search engines.

Quote Reply
Re: [mkp] Regex driving me *MAD* In reply to
Hi,

Thanks for the reply. Not really applicible in this case though =) We use win32::OLE on a central system to extract the HTML from .doc/.wps/.rtf documents (on an NT machine). Then, the data is put into a database table, and extracted at a later time, for when required. Thus where this code comes into play (as its using the extracted image paths, instead of the full URL ones :)).

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!