Gossamer Forum
Home : General : Perl Programming :

Substitution

Quote Reply
Substitution
Hi everybody

I have a few HTML files, where I want to change some tags. I thought I can use a small perl program to do this, but I am just a very very beginning perl programmer....

The things I want to do is: replace all occurrences of
<a href="footnote"> into <a href="footnote" name="ref_footnote">, where 'footnote' is some variable one or two-digit number. How would I do this?

I imagine that I would do a substitution line by line, but then what about the cases, where I have '<a' on one line and 'href="footnote">' on the next line?

Thanks a lot.

Ivan,
Iyengar Yoga Resources
http://www.iyengar-yoga.com/
Quote Reply
Re: Substitution In reply to
By having both the HREF and NAME attributes, I assume you intend the anchor to be both a link and a target, correct? You didn't say whether or not you are using "#" (although isn't it required?) so I made it optional...
Code:
s/(<a)([\s\S]+?)(href=)("?#?(\d+)"?)([\s\S]+?)(>)/$1$2$3$4 name="ref_$5"$6$7/i;
As for how to actually use this, it's up to you! That's the most complex regex I've ever written so I'm about fried.



Happy Coding,

--Drew
http://www.FindingHim.com
Quote Reply
Re: Substitution In reply to
Looks good to me Junko.

When you see it like that it looks so obvious but when you have to write it from scratch it is so difficult.

Paul Wilson.
http://www.wiredon.net/gt/
http://www.perlmad.com/
Quote Reply
Re: Substitution In reply to
 
Thanks a lot, I will try it tonight.

Yes, reason I want to do this is to be able to jump from the text to the footnote and vice versa.

Can you tell me what the $1 to $7 precisely refers to after the second forward slash? To the things in the brackets after the first slash? But what about the bracket in the bracket, i.e. ( .. ( .. ) .. ) ? What's the general rule?

Regards,

Ivan,
Iyengar Yoga Resources
http://www.iyengar-yoga.com/
Quote Reply
Re: Substitution In reply to
Ok......

(<a)
This is matching <a ie the beginning of the a href tag...

([\s\S] ?)
This is matching blank/white spaces and non-whitespace characters

(href=)
This is matching the href= part of the tag

("?#?(\d )"?)
This will match " and the # if it is present in the tag but it is optional and after the # it checks for digits

([\s\S] ?)
This checks for blank/whitespace and non-whitespace again

(>)
This matches the final > of the href tag

/$1$2$3$4 name="ref_$5"$6$7/i;
The $1 $2 etc have the values of the code inbetween the brackets before the slash assigned to them... ie $1 = (<) and $2 = ([\s\S] ?). Therefore after the slash the code is just replacing what was originally there but name="ref_" in added into the code.

Oh and the i at the end will make the pattern matching case insensitive.

Any clearer?

Paul Wilson.
http://www.wiredon.net/gt/
http://www.perlmad.com/
Quote Reply
Re: Substitution In reply to
I don't why I was using "[\s\S]" instead of "." Smile. I've modified the code and added comments after each block.
Code:
s/
(<a) #($1) find an anchor tag,
(.+?) #($2) then get the value of all preceding attributes if present before
(href=) #($3) the href attribute.
("?\#?(\d+)"?) #($4($5)) get the value, with our without quotes or hash mark
(.+?) #($6) then get all remaining attributes if present
(>) #($7) to the end of the tag
/$1$2$3$4 name="ref_$5"$6$7/ix; # rebuild the beginning anchor
# and allow case insensitivity and comments
To explain further about ("?\#?(\d+)"?):
$4 is the value of the whole block, which will contain the beginning and ending quote (if present), the hash mark (if present), and the actual block of numbers.
$5 is only the block of numbers found (\d+).
if $4 doesn't have quotes or the hash, both $4 and $5 are the same value.

Happy Coding,

--Drew
http://www.FindingHim.com
Quote Reply
Re: Substitution In reply to
Dear junko and PaulWilson

thanks a lot for the help. I tried the regular expression, and it did work after I deleted the "x" at the end (because that produced some error) and added "g" (replace all occurences) and "m" (because my string consisted of multiple lines). Afterwards, I even made my own regular expression for something related.

Thanks again.

Ivan,
Iyengar Yoga Resources
http://www.iyengar-yoga.com/
Quote Reply
Re: Substitution In reply to
I assume the error was something like "unmatched () in regexp...", right? I bet you stripped out the comments and reformatted it. The 'x' is only there so that you can break the code up and have comments (hence I excaped the the hash).

I should have left the code alone... multi-line matching worked before I changed ([\s\S]+)? to (.+)? and without the 'm' option. I can't believe I missed that. Mad
Code:
s/(<a)([\s\S]+?)(href=)("?#?(\d+)"?)([\s\S]+?)(>)/$1$2$3$4 name="ref_$5"$6$7/i;
Happy Coding,

--Drew
http://www.FindingHim.com