
tchrist at perl
May 19, 2008, 11:07 PM
Post #5 of 19
(299 views)
Permalink
|
|
Re: on broken manpages, trolling, inconsistent implementation and the difficulty to fix bugs
[In reply to]
|
|
In his epistle of "Tue, 20 May 2008 06:56:10 +0200." <20080520045610.GB16896[at]schmorp.de> Marc Lehmann <schmorp[at]schmorp.de> graciously explained: > On Mon, May 19, 2008 at 06:54:39PM -0600, > Tom Christiansen <tchrist[at]perl.com> wrote: > [Thanks for summarising all the possible fears :] Oh, there might be more, you know. Haven't thought much on it. Those were just the ones that came to mind, both the relevant and the ir-. >>> part of the codebase) assuming this is just an internal flag, >>> as originally designed, will kill perl in the long run. >> The end of the world is near, eh? > I meet a lot of people who would like to use Unicode in perl, but fail > to do so because they run into the problems mentioned and claim it > should be much easier (yes, it should, and certainly less random). > But almost all of the issues they run into, *iff* they really want to > use Unicode and are open to learning a bit about it first originate in > the utf-8 flag that they cannot see in their perl sources yet that > affects so many things. I see you've been talking with Phil Harvey again. :-) > Basically, if you don't have it set right, stuff breaks everywhere, > whether it's perl core functions or XS modules, but the brakage [.SIC: > probably meant to read "breakage" unless one's foot hits the brakes > instead of the accelerator --tchrist] is not universal (if it were, it > would simply be a different model; the problem is the inconsistency). But is it a foolish one for little minds to worry about, or a great one for bigger minds to mull over? I believe that Phil, for example, due perhaps to such things as you allude to, tries quite hard to be Unicode-agnostic. By that I means he insistently uses byte-interfaces only, even though he sometimes has to encode or decode byte-data into Unicodepoints. That said, I always feel there's something *WRONG* if I find myself having to resort to Encode's encode or decode functions. Can't quite say why. >>> Regarding the perlunicode manpage, it is basically a helpless case. >> "Helpless"? >> More rhetorical hyperbole, I presume, since if it is incorrectly >> worded, the road to helping it is obvious: just send patches. > I certainly won't send patches if people tell me before I submit > them that the current manpage is correct. I can waste my time in > better ways. Wise lesson, that. However, it never stopped me from doing so. It's like how all change to the world comes from unreasonable people. > I would probably submit patches if the process to do so would be > easier, and the first step would be an agreement of the existing perl5- > porters on how strings are to be interpreted. That might require further instruction so that we can all be on the same, um, code page. > Note I am not asking for agreement on how it *should* be done or what > would be better, but an agreement on which semantics will be acceptable > and which are not. >>> This is completely untrue: in earlier releases, "use utf8/use >>> bytes" switched between interpreting the strings as utf-8 vs. >>> bytes, and did nothing about Unicode-awareness. >> *That*, I think, is more a matter of casuistry than of correctness. > Maybe, but I still expect *one* manpage to be consistent to itself - if it > defines operation as one thing and contrasts it with some other meaning of > operation then they better should be the same thing - if you compare, then > apples to apples and oranges to oranges. First, those are hardly category errors. After all, would you not agree that: * Both are fruit. * Both are juicy. * Both are often served juiced with breakfast. * Both start green and then usually fall somewhere into red-orange area of the visible spectrum. * Both are usually of a similar size. * Both are of topologically equivalent shape So for more punch, you might sometimes consider trying the alternate aphorism of comparing apples with avarice, or oranges with oratory. Just an idea. :-) >>> Perl currently implements a model where encoding is *not* >>> attached to perl scalars, Second, I am in need of deeper understanding, or more sleep, to see how my statement regarding casuistry does not apply. >> Right: it's supposed to be attached to the I/O Layer alone. > More correctly to the interfaces communicating with the outside world, > which includes other things as well (for example filenames, or XS > modules). Well... At one level, nearly all meaningful communication "with the outside world" falls within the category of being I/O, with signals and exit status being the most common exceptions. And timing attacks don't count. :-) But I still think that you are asking a lot if you want to make the claim that filenames as used to access the system's underlying files VIA ITS OWN INTERFACES are data rather than metadata. And I don't think that filesystem metadata is reliably treated as anything but bytes, at least on systems with which I am conversant. Sure, their contents are certainly data, but even that has its limits. The restrictions under the BUGS section of the perl(1) manpage still apply: you are for the most part at your system's mercy. If it provides byte-access seeks only, not variable-width utf-8 encoded positions, you can't do much about that. Well, not much that I'd care to do, at least. > But the basic theme is indeed "I/O" here - one way to treat characters > is to encode/decode them when they leave/enter perl and use Unicode > semantics within. That sounds sane. >>> and neither is *Unicodeness* attached to perl scalars. >> And now you've lost me. >> If that is universally true, then might you gently explain how >> an SV set to chr(500) always has its UTF8 flag turned on? > Easy, it is the only way for perl to internally represent characters > with a value of 500. Of course. > If that 500 is for example the second character in v5.500, then this > might simply be the perl 5.500 version string (or part of an ip > address in the game "uplink" stored in a compact string form, e.g. > v478.321.571.277). I'm a little bit queasy about v-strings, thank you very much. > No Unicode anywhere in sight. Did I say there was? > Note also that I can have the Unicode character 32 stored with or > without the UTF8 flag, which doesn't change the fact that it is > still the Unicode character 32. Oh, now that I'm not sure I agree with. But I fear we may be back to casuistry again. > This is the insane part - I wouldn't expect even an expert perl > programmer to predict how $s gets interpreted here. No, neither would I. > So, my "end of the world" in a more verbose way and less drastic, > would be "as long as perl has this totally unpredictable rules on > character interpretation it will not gain wide acceptance for > Unicode usage". >> Armed with that output, it seems to me that if you are correct, >> then UTF8 is not "Unicodeness", but that if it is, then you are >> not correct. > I think the examples above made it clear that the UTF8 > flag is not "Unicodeness". I think I'd better sign off. Perhaps sleep will make your statement obviously true to me. It isn't now. >> * Start by explain just what it is that you are calling >> Unicodeness. > Not sure what you mean - Unicodeness is the state of being Unicode. That's either a trivial tautology of no significance, or something deeper than I can now fathom. > More explicitly, a perl string contains Unicode characters when it > contains Unicode characters Now *THAT* belongs in the formerly mentioned set. > - the interpreter itself does not know this currently, nor do I > see a way for it to do so except by forcing the user to make > this explicit. Ever read in a string, or grabbed something from @ARGV or %ENV, that you had to do this to: $num =~ oct($num) if $num =~ /^0/; And if you did, did this "bother" you? > For example, "open" (at least on Unix) cannot support Unicodeness, > because the system interfaces do not allow for Unicode to be used - > you have to encode it, and it is not clear which encoding is the > right one. > So open would be an operation that enforces octet semantics, because > that's what the system interface relies on. Well, there you have it then, don't you? Good night. /* HIC JACENT VERBA DELETA */ >> Thank you. > Nice to have you around again. Oh sure, *now* you say that. Just wait. :-) Anyway, it's SUMMER, and this is a fluke; I shouldn't even be here. --tom
|