
rgs at consttype
Dec 15, 2009, 2:23 PM
Post #5 of 10
(451 views)
Permalink
|
|
Re: Should Unicode semantics be the default for Latin1 characters in 5.12?
[In reply to]
|
|
2009/12/13 karl williamson <public [at] khwilliamson>: > I'm inclined to think not. So do I, finally, after all discussion about backwards compatibility. > I just think it is too much to spring on people with no real warning. As > noted before, several CPAN modules that are in blead failed with this > change. Gerard has said that Kurila experienced these same module failures, > but I haven't heard back from him about what others had the same pattern. > > Even if all the failures are just bugs that are getting exposed for the > first time, but could crop up anytime with the right sets of input, I think > that, similar to Jesse, that we shouldn't be the apparent breakers of a > bunch of CPAN. > > So here is my proposal: > 1) We continue to have 'use legacy'. I agree with Rafael and Aristotle > about this. > > 2) I will submit a patch that just flips the default. People will for the > first time not have to do a utf8::upgrade or a Unicode::Semantics all over > the place to get the new effect. They can just do a 'no legacy' at the > beginning of their program to get the effect, except, unfortunately, for > modules outside their control. > > 3) We announce in perldelta, perhaps other places, that the plan is to flip > the state in 5.14. I don't think it's feasible to flip the default state of a pragma from one release to another -- at least not without a very explicit version requirement (as in C<use 5.12.0>). Without that precaution, that will lead to too much confusion. I also think that the advantage of having a legacy.pm in addition of feature.pm is only real when the C<use legacy> behaviour is not on by default. That's the main factor that differentiate legacy features versus new features. So, I would think that it would be better to remove legacy, and make "unicode8bit" a feature, in the sense of feature.pm. Also, that means that it would be loaded by default by C<use 5.12.0>, like other features. (The problem here being qr// regexs that leak to other scopes in which the feature is not in effect.) Alternatively, or additionally, a new //u switch could turn on the new behaviour per-regex. > 4) My patch for regex case-sensitive matching be placed into blead, knowing > that it is not the default, and we document the flaw that Yves has mentioned > that an already compiled re that is compiled into a surrounding one will > have the surroundings state. Note that this is not a problem is someone > just has the one call to 'no legacy' at the beginning of their code, as the > state will be constant through out their code. > > 5) I will find time in the next few days to work on a patch for case > insensitive matching similar to the one previously submitted for case > sensitive. It will suffer from the same flaw. To hopefully please Jesse, I > will first submit a patch that extends the fold testing .t to many more > cases that show, I'm afraid, many more flaws in the existing scheme, not > limited to the Unicode bug. Here's an example that surprised me: > ./perl -I./lib -E 'my $c = chr(0xe0); utf8::upgrade $c; say $c =~ > /\x{c0}/i' > > doesn't print 1, even though the string is in utf8. Swapping the c0 and e0 > does work. > > 6) I have given up for now on fixing the user-defined case overriding to not > be sensitive to utf8ness. It turns out it is misdocumented; it isn't as > restrictive as it says it is. Instead of it having to be on a global level, > it actually is on a package level. It's been many years since I have had to > worry about real-time issues, and processor speeds and apparently optimizers > have improved significantly since then. So, perhaps I'm overly conservative > here. I thought it would be an acceptable slow-down to add a quick test for > every call to uc() etc to check if a global case override has been defined. > But the testing becomes more intensive when it might not be a global. > Correct me if I'm wrong. Should we penalize everyone (unless the penalty > is lighter than I think) for a feature that we're not sure is used at all. > A deficiency of this feature is that you can't just override a few > mappings. If you override any, you must furnish a complete set of casings. > We tell you how to find the current complete set as an aid for that, but > still it's a pain. But on the other hand, if the function did know that > there is no override mapping defined (which will be the case 99.9999% of the > time), it could save time for code points in the Latin1 range which would > not have to go out to utf8_heavy.pl. > > If this proposal is acceptable, programmers could have the Unicode bug at > bay in 5.12.
|