Gossamer Forum
Home : General : Perl Programming :

regexp to decode url?

Quote Reply
regexp to decode url?
Hi folks,

Could somebody help me with a regular expressions to decode a url? It's for a navbar. I'm getting the current location using the REQUEST_URI environment variable, and now I want to split the string up into directories and put them into an array. The problem is that when I use "split", the first forward slash causes an empty value in the array, and I don't want the filename either.

For example, I would like to return an array like this:

@dirs = ("services", "design", "samples");

from all of these:

/services/design/samples
/services/design/samples/
/services/design/samples/index.htm

Any suggestions?

Thanks,
adam
Quote Reply
Re: regexp to decode url? In reply to
try http://agora.leeds.ac.uk/nik/Perl/

hope this helps.....
Quote Reply
Re: regexp to decode url? In reply to
Weeeell, I was looking for a regexp to get rid of the first slash and the filename, but I guess maybe it's time I *did* delve into the complicated world of them and learn it for myself! Smile

The link should have been http://agora.leeds.ac.uk/nik btw, butI found it fast enuff, so thanks.

adam
Quote Reply
Re: regexp to decode url? In reply to
If you want to really decode a URL and get things like protocol, hostname, relative link, absolute link, etc, use the URI module available from CPAN.

If it's pretty simple and those are the only URL's you could do:

$input =~ m,/([^/]+)/([^/]+)/([^/]+),;

and you'll have services, design and samples in $1, $2, $3. It will work with all three samples you provided but wouldn't work if the input didn't look like:

/something/something/something

Cheers,

Alex

Quote Reply
Re: regexp to decode url? In reply to
Alex,

That's what the input *should* always look like, but I guess there'll always be anomalies. I'll have a look at the URI module as well.

Thanks,
adam
Quote Reply
Re: regexp to decode url? In reply to
Back to this one again. Smile

Ok, I guess my real problem isn't putting it into an array, it's stripping off the filename or forward slash at the end if there is one.

So is it possible to get something like this:

dir1/dir2/dir3

from the three examples I gave above? The string wouldn't necessarily be three dirs though (Alex's example depended on it, it could be one or ten...

Cheers,
adam
Quote Reply
Re: regexp to decode url? In reply to
Sure:

Code:
$input =~ s,
^/? # Find 0 or 1 leading slashes.
(.+?) # Store everything in the middle in $1
/ # Followed by a slash.
[^/]+$ # Followed by the file name and end of string.
,$1,x; # Replace with just the middle.

would do the trick.

Cheers,

Alex

[This message has been edited by Alex (edited May 12, 1999).]
Quote Reply
Re: regexp to decode url? In reply to
Hi again Alex,

Sorry to be a bother, but that's still not getting the result I want. If there's a trailing slash, but no filename, it's leaving the leading and trailing slashes.

Also, I'm a bit worried that if the request_uri had no trailing slash, but was referring to a directory, that it would strip that off thinking it was a filename. Apache seems to rewrite the URI internally now, but I like to be sure.

So can the regex check to see if it's a valid filename (somethingdotsomething)? I thought it would be something like *\.* but that doesn't work.

Thanks again alex,
adam
Quote Reply
Re: regexp to decode url? In reply to
Back again! Smile

Ok, I think the easiest way to do it is to split the request_uri with the forward slash:

Code:
$uri = $ENV{REQUEST_URI};
@dirs = split("/",$uri);

Now I can get rid of the first forward slash with shift:

Code:
shift(@dirs);

And I can get the last value like this:

Code:
$length = $#dirs;
$last = $dirs[$length];

So now all I have to do is check to see if $last is a valid filename, and if it is I can remove it with:

Code:
pop(@dirs);

Am I right so far? So now all I have left to do is actually CHECK for a filename. Filenames won't actually exist in this setup, I'm using mod_rewrite to send everything off to this script, so by rights the server will be looking for the first filename in DirectoryIndex, which in this case is "index.htm". However, for later scripts, it would be nice to check for any valid filename, excluding ones without an extension. As I said, I thought it would be *\.*, but that isn't valid (obviously now I come to look at it again!). So I looked at FileMan and then I reckoned:

m,^([A-Za-z0-9\-_.]\.[A-Za-z])$,)

...would do the job, but no. So any ideas what it would be?

I'll build me an indexable version of dbMan if it kills me! Smile

adam