
florz at florz
Sep 14, 2008, 11:04 AM
Views: 1466
Permalink
|
|
Request URI (path) normalisation
|
|
Hi, this email basically arose from a discussion on #catalyst/irc.perl.org where my (more or less) original question was for the format of the string that Regex actions do match against. As nobody really seemed to know the answer, it got into a discussion of basic URI semantics and finally kindof to the conclusion that the current implementation of Regex (at least) probably is broken. Part of that conclusion actually isn't from first-hand experience on my part, but rather from Sebastian Riedel's examination of the source of the current version, AFAICT - the debian backport package (5.7006) I am using behaves differently. So, please forgive me, should this invalidate parts of the following. So, to finally get to the meat of it: According to sri's examination, catalyst simply extracts the path component from the URI, but doesn't do any normalisation on it. This would mean that a request for http://bar/foo would have a different string being matched against the regexes than a request for http://bar/f%6fo . As those two URIs are mandated to be equivalent (to refer to the same resource) by the URI RFC (3986, 2.3), this kind of behaviour does make it pretty difficult to write standards-compliant software, as you'd have to match against ^(?:f|%66)(?:o|%6[fF]){2}$ for the example given above to meet the requirements. I've got no clue whether other action types may be affected by this, too. The behaviour I would consider sensible would be the normalisation of the path in such a way that any two URI paths that are mandated by the RFC to be equivalent will result in the exact same string, and any two URI paths that are not mandated by the RFC to be equivalent will result in different strings. IMO, in addition, as many characters as possible should be in unescaped form after normalisation. For the path alone, that would mean that only slashes in path components would really have to be escaped. I assume that also escaping the ASCII control range might be a good idea for security reasons with regard to use on syscall/shell interfaces. If it's supposed to be safe for direct injection into a URI, any other URI reserved characters probably should be escaped, too. But above all, I think the important thing is consistent, documented normalisation, independent of the engine. Well, I guess that this email is somewhat open-ended so far. But I don't really know what the next step should be - so, I'll leave it at that. Please don't flame me for it ;-) Florian _______________________________________________ Catalyst-dev mailing list Catalyst-dev [at] lists http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst-dev
|