
schmorp at schmorp
May 19, 2008, 9:20 PM
Post #27 of 52
(176 views)
Permalink
|
|
Re: on the almost impossibility to write correct XS modules
[In reply to]
|
|
On Mon, May 19, 2008 at 06:28:12PM -0700, Glenn Linderman <perl[at]NevCal.com> wrote: > >>1) The "automatic" conversion of 8-bit to UTF-8 "assumed" Latin1 because > >>it was (a) easy numerically (b) worked well on platforms that use Latin1 > >>as their native encoding. > > > >Which platform is that? I really don't know *any* such platform. > > You don't have to know of one to figure out that the present scheme > works fine on such a platform if it exists. True, but you stated those platforms were the reason for why the automatic conversion worked that way. If no such platform exists, your argument is moot, because nobody would implement a scheme because it is useful on no platform. > Since it was done this way, I would assume it must have been useful > somewhere... but perhaps it was just ASCII platforms for which it worked > well. Whatever an ascii platform is would be fine with about any other such conversion. > >Note also that the automatic conversion in perl doesn't assume any > >encoding *at all*, so this is simply not true. > > Perl assumes an encoding for various operations; you've stated that. My Yes, but automatic upgrade is _not_ one of them. > saying that Perl assumes an encoding, is simply a collection: the set of > all Perl operations that assume an encoding. fine, but automatic upgrade, what you were talking about, is not in that set. Point being? > The conversion of internal string formats does assume that all the > characters representable by various numbers in the octet format > (internal UTF8 flag turned off) convert to the same number in the > multi-bytes format (internal UTF8 flag turned on). Yes. > This is equivalent to converting from Latin1 to Unicode (UTF-8) for the > range of numbers corresponding to Unicode code points (which applies to > all the numbers that are representable in the octet format). No, it is not. If the source data isn't latin1-encoded to begin with than converting from latin1 to unicode is not a sensible operation to apply. automatic upgrade, however, is, and thats because it does not apply any such interpretation to the scalar. this is a subtle but crucial difference. > If you are able to disagree with that, then you are simply being > disagreeable, which doesn't help get the bugs fixed. "If you don't agree to me you are not helpful"? Now that's a nice strawmen argument :/ > >This is not what happens. Perl simply does not assume any encoding. If > >you have an 8-bit filename encoded in latin1 then perl doesn't treat it > >any different than an 8-bit filename encoded in koi8-r (another "ANSI" > >encoding). > > The conversion of numeric characters from an 8-bit representation to a > UTF8 multi-byte representation within Perl is often referred to as > "assuming a latin1 encoding" by many discussions on this list. In an informal way, you may well do that. When talking about unicode semantics in perl, then being so sloppy will not do, however, because it is important that the upgrade process works regardless of any encoding (and is reversible). > know, and I know, that it is simply two different representations of the > list of numbers that make up a string. Unfortunately, perl doesn't really handle it that way. regexes for example treat the same number on the perl level differently depending on how its encoded internally. And this is a problem. > But describing it the other way > helps other people understand it, and it is not particularly false. In my (not small) experience in explaining it to people, telling them "not particular wrong" things about perels unicode handling scares them, because they do not want that perl interprets their, say, koi8-r data as latin1 in any way. > you want to convince people of things, you should attempt to use their > terminology as much as possible, and explain the problems in a way > they'll understand it, rather than telling them they don't know what > they are talking about... Well, some people, like jan, clearly don't understand the issues. Also, my terminology is their terminology. Perl simply doesn't interpret your string as latin1 when upgrading. Thats a fact. In your or my terminology. > >upgrading and downgrading doesn't change that, or at least shouldn't > >change that. where it does, it affects unix as much as any other platform. > > It could; are you referring to a particular version of Unix heres No, all versions are the same here, right down to good old POSIX, or even ISO-C. > And what is its native 8-bit encoding? Unix, by specification, has no native (or preferred) 8-bit encoding, just like windows. There really isn't much of a difference in the 8-bit apartment, except that unix interprets your data much less then windows (for example, filesystem interaction doesn't check nor care for character encodings). > I can neither agree nor disagree with your statement here, without > knowing more facts about the unix you are referring to. There is only one, really, because all work the same. > >>Retrofitting Perl on Windows to assume 8-bit data is ANSI will break all > >>code that attempts to work with the constraints of 1 and 2. > > > >This would probably be true if 1) and 2) were real, but they are not. > > They are real; they are just stated in different terms than you prefer > to use. Sorry, but thtas bullshit. 1) for example claims this was implemented for the sake of platforms that don't exist, which is not a sensible argument. This has nothing to do with terminology. It also has nothing to do with my person. > >>have prevented, by example of a widely-used platform, the assumption > >>throughout lots of Perl code, that all 8-bit data is assumed to be > >>Latin1 implicitly. > > > >Perl doesn't do that anywhere on any platform, to my knowledge. Make an > >example of a platform that expects filenames as latin1. > > Every time Perl alters the internal UTF8 flag, and correspondingly the > representation of the string data, it makes the assumption that there is > no numeric difference between the octet encoding and the multi-bytes > encoding. Exactly. It makes no assumption about the character encoding itself, because the function is encoding-agnostic. It doesn't interpret your data as latin1. > The only character sets for which this is true is Latin1 and > Unicode, AFAIK It is true for other encodings as well, such as ascii. In fact, here is a good example why forcing an encoding interpretation on upgrading/downgrading is wrong: the assumption of no numeric difference between upgrading and downgrading is true for *any* 8-bit encoding and it is also true for *any* codeset, simply because the numbers do not change. If you have koi8-r data (which is not compatible to latin1), then upgrading and downgrading will not alter the fact that it is koi8-r data (in current perls, and outside e.g. the buggy win32 module which enforces different interpretation and breaks if strings get upgraded). This is why your enforcing of such an interpretation is scary, because many people still handle such data, and they need the safety that perl doesn't tinker with their characters silently. The transformation as it is was not chosen because of your two reasons. It was chosen because it doesn't alter the string on the perl level. If you take a string and upgrade it and dissect it, it will contain the same codepoints. Any other transformation (like the one proposed by jan) doesn't have this property, and since it isn't documented when perl does these upgrades and downgrades, this is exactly why that proposal is broken by design. If it was accompanied by making upgrades and downgrades *explicit*, i.e. perl would die when you concatenated an upgraded and a downgraded string, or would never silently upgrade/downgrade on its own and you always have to force it manually, then this model would become workable, at the expense of making perl strings type-ful, as there will be two incompatible string types. Of course, this wouldn't be very perl. > >(you can select this under unix, yes, but you can do so under windows as > >well). > > So there you have answered your own question about platforms. Yes, all humans are chinese because all chinese are humans. I said as a special case, you can make it true, but in general it is not. > issue arises because Perl for Windows does not require Windows to be > configured to use Latin1 as the default code page; And neither does it under unix. So your argument is wrong again, because the issue does not arise because of "anything windows" at all. > neither does it > convert to or from Latin1 (or anything else) when calling Windows APIs; and neither on unix. > but it does assume numerical equality when converting between octet and > multibytes strings, and that is only valid for Latin1 and Unicode. No, numeric equality is true for every encoding on the world that uses codepoints <256 - think about it. For example, the number 177 means the same koi-8 characterm regardless of wether it was upgraded or not. This is the property that is useful, not having latin1, that is not that useful, and not implemented in perl anyways (see regexes for example). > Hence, it assumes Latin1 during that conversion. Wrong. It doesn't do so. The conversion used was chosen because it doesn't change codepoints - character 177 stays character 177, not because latin1 is particularly important for any specific platform. > you read it... but I would be interested, if, setting aside the > disagreements you stated above, if you think a scheme such as I outlined > could be a helpful solution for Perl, using your mental model of > strings, implicit internal format conversions, and such, which I think > is reasonably accurate, even if it doesn't use the same terminology that > most people on this forum use. Well, to me, this is a mailinglist, but maybe my terminology is wrong there, too. I do think I use the same terminology as everybody else here, I am just being more exact in what I say, because if you are sloppy, you fail to communicate the important differences and fall into traps like you did above, because you couldn't escape the "character encoding" mental model. As for your points... as I outlined, the problem is not so much backwards compatibility - perl 5.6 is totally difeernt to 5.8, 5.8 is different in many such encoding issues as 5.10. The problem is mainly bugs, so while valid, I don't see how one could keep compatibility, because the question is what to keep compatibility to - 5.8, 5.6, 5.005, 5.10? choose one, all are different. I am alos not sure wether programs rely on the broken semantics - my experience is that e.g. reading a filename (%APPDATA%) from an environment variable and trying to access files that way doesn't work when ansi and unicode disagree on encoding (which is the case even on my latin1 system, btw.) But then, perl on windows is differently broken depending on which perl you use - activestate has a really broken fork for example, and handles filenames differently than other perls on windows. I am not sure how many people really rely on that behaviour, and I am not sure if this couldn't be just fixed by enforcing a single encoding. But my experience is limited - I know the windows APIs and the problems associated with not having a single format in which to store filenames. On unix, this is positively better, as there is only ever a single format to store filenames in that works regardless of locale (the problems start when you interpret these filenames). So I will only comment on E and F. I think the pragma already exists, namely "use locale". If I "use locale" in my program, I would expect perl to apply the current locale to any strings, in regexes or elsewhere (to the extent possible). If I don't "use locale", then I would expect regexes to interpret my strings as unicode, regardless of the utf-8 flag, which I can't see in my source. (the "surprising" behaviour). Regarding filenames, this is very easy on unix: all filenames are interpreted as octte strings, no specific encoding (perl cnanot know the encoding of filenames on unix), so the functions all have to downgrade, and if that fails, we have a bug (filenames are not locale-dependent on unix, they are simply octet strings where only "/" and \000 are interpreted). (if it does not fail, it might still be a bug, we we cannot detect this). I know "use locale" has weird side effects, but it basically boils down to what perluniintro calls "native 8-bit encoding" (fortunately, it is not even limited to 8-bit). even if there were need for a new pragma, I wouldn't call it "compatibility", because both behaviours are useful. The difference is that I can control which interpretation is applied to my strings and do not have to rely on an invisible flag on my scalars. But then, "locale" maps exactly on the concept of "native encoding", because my unix process might run ina locale using koi8-r, and then I would want a way to take advantage of the locale w.r.t. to interpreting my koi8-r data. (do not get confused by the mention of POSIX in the locale manpage, locales are an ISO-C thing and ought to exist on windows as well. So for me, this is not a compatibility issue - right now, I don't think anybody relies on the utf-8 flag behaviour in perl (a great deal has changed between 5.6 and 5.8, and less has changed between 5.8 and 5.10, so those programs need fixing already). -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / pcg[at]goof.com -=====/_/_//_/\_,_/ /_/\_\
|