Gossamer Forum
Home : General : Perl Programming :

Regex Problem...

Quote Reply
Regex Problem...
Can anyone suggest some regex to replace this? The aim of it is to grab all URL's starting with http:// in a HTML page.

Code:
my @html = get($page);

foreach (@html) {

# now scan through to find URL's...
m,http://(.+?),g and print $1;

}

It just doesn't seem to be matching Unsure

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] Regex Problem... In reply to
Your regex will match any element of @html matching http:// but will match until the very end of the string.

I was playing with URL regexs for one of my own scripts and came up with this...not sure how good it is but it works for my purposes.

Code:
my $html = get($url);

while ($html =~ m|http://((?:[a-z0-9-]+\.)?[a-z0-9-]+\.[a-z0-9]+(?:\.[a-z0-9\.]*)?)|sg) {
print $1;
}

Last edited by:

Paul: Jan 30, 2003, 3:55 AM
Quote Reply
Re: [Paul] Regex Problem... In reply to
Doesn't match anything (i.e no page output Unsure)

Thanks for the try though :)

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] Regex Problem... In reply to
Seems ok for me.

Code:

Code:
my $url = 'http://www.adsql.com/';
print "Content-type: text/html\n\n";
print $url =~ m|http://((?:[a-z0-9-]+\.)?[a-z0-9-]+\.[a-z0-9]+(?:\.[a-z0-9\.]*)?)|;

Output:

www.adsql.com
Quote Reply
Re: [Paul] Regex Problem... In reply to

Web Images Groups Directory News

Advanced Search
Preferences
Language Tools
New! Take your search further. Take a Google Tour.


Advertise with Us - Search Solutions - Services & Tools - Jobs, Press, & Help

©2003 Google - Searching 3,083,324,652 web pages

Spider

Please tick which sites you want to spider, and get the details for....



SELECT ALL?



Thats pretty much what I see (a few extra buttons, but the forum wont let me show them). The code I was using, is;

Code:
my @html = get($page);

foreach (@html) {

m|http://((?:[a-z0-9-]+\.)?[a-z0-9-]+\.[a-z0-9]+(?:\.[a-z0-9\.]*)?)|sg and print "$1<BR>";
print $_;
}

As you can see, its a basic adaption of yours, to accommidate the array. However, nothing is printed out URL wise Unsure

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!

Quote Reply
Re: [Andy] Regex Problem... In reply to
Try it like the code I originally had instead of pushing everything into an array.
Quote Reply
Re: [Paul] Regex Problem... In reply to
I did...still nothing Frown (tried your first of all, then tried an adaption to how I used toi have it).

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] Regex Problem... In reply to
It definitely works in my original example so I don't see why nothing prints.
Quote Reply
Re: [Andy] Regex Problem... In reply to
You tried HTMLTokeParser? I'm sure there are plenty of solutions for this on the CPAN ...

- wil
Quote Reply
Re: [Wil] Regex Problem... In reply to
HTMLTokeParser doesn't seem to exist Unsure Tried looking on Google.com and PerlDoc.com :-|

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] Regex Problem... In reply to
He means HTML::TokeParser

But HTML::LinkExtor is your best bet.
Quote Reply
Re: [Paul] Regex Problem... In reply to
Yeah, either one would do the job just fine.

- wil
Quote Reply
Re: [Wil] Regex Problem... In reply to
They would indeed. I think HTML::LinkExtor is a little simpler and will handle this specifically. The HTML::TokeParser pod is a bit cryptic.