
rgarciasuarez at gmail
Nov 14, 2009, 3:08 PM
Post #2 of 4
(189 views)
Permalink
|
|
Re: PATCH: Add code to solve the casing portion of the "Unicode bug"
[In reply to]
|
|
2009/11/13 karl williamson <public [at] khwilliamson>: > Attached is a patch that adds code to almost fix the case changing portion > of the "Unicode bug" (see > http://rt.perl.org/rt3/Public/Bug/Display.html?id=58182). This is a > significant portion of perltodo's UTF-8 revamp. Thanks, applied as 00f254e235ff10d6223aa9a402ad5b7a85689829. > What this means is that characters whose ordinals on ASCII machines are in > the 128-255 range will have Unicode semantics as far as case changing goes > regardless of whether they are encoded in utf8 or not. The reason I say it > 'almost' fixes this is that any user-defined case mapping is still not > called unless the scalar is in utf8. See below for a discussion on that. > > I am leaving the legs that implement this behavior disabled by default until > the smokes show that the new stuff doesn't break anything; then I'll flip > the bit to enable it. If you want to play with it in the meantime, you can > say 'no legacy "unicode8bit"'; Prudence is good. (so currently in blead the legacy-unicode8bit pragma has the reverse meaning than the one it will have in 5.12) > I've proposed changing the name of this; so that may happen, but it doesn't > affect the heart of the code, so I'm delivering it now; it's easy to change > the name later. I still like that name; the documentation might need improvements though. When I read this line in the SYNOPSIS : use legacy ':5.10'; # Keeps semantics the same as in perl 5.10 as a perl user, I'm not certain of how that works, because it's not clear that the doc itself depends on a version strictly greater than 5.10. Maybe it's not such a good idea to use versioned bundles like feature.pm does. > I don't understand Perl magic. So I have tried to avoid touching anything > around that. But there are a couple comments from the old code that make me > think I really don't understand what's going on in regard to that. I'm now > thinking the comments are obsolete. If someone would care to look at them, > they are at lines 3963 and 4227 in pp.c and both say the same thing: > "Overloaded values may have toggled the UTF-8 flag on source, so we need to > check DO_UTF8 again here". This doesn't make sense to me based on the code > earlier in the functions they occur in, so I think they are wrong. I'll look, but I don't promise to understand either. > There are minor changes in several files: macros are created in headers to > access the bit and look up the case change mappings. perl.h has two tables > added with the mappings for all 256 Latin1 characters that give > respectively, 1) their lowercased values; and 2) their upper and title cased > values. I removed trailing blank space in all the submitted files. > > There are significant changes in the casing functions in pp.c to accommodate > this new behavior. Basically, if the bit is set, the case change is looked > up via these tables instead of the existing mechanism. If the bit is off or > 'use locale' or 'use bytes' is in effect the existing mechanism is used. > > Complications arise because three characters in the latin1 range require > special handling when title and uppercasing them. Two of them have the case > change out-of-range, which means that the result has to be converted to > utf8, and the other expands to two in-range characters. The code in the > ucfirst() function was rearranged to compute the case change first, so as to > know if the length of the result changes. If it doesn't change, and the > scalar isn't read-only etc, then only the first character needs to be > touched. Also, the uc() function has to cope with the possibility that in > mid-stream it will have to decide to upgrade the result to utf8. > > I tried to make this efficient, and not slow things down from what they are > now. It may actually be faster than the current implementation in general > because it does a table look up, which avoids some tests. One test was > added in the inner loop for upper casing, but then again the table lookup > took out two tests > > There are two things I started to implement, but left #ifdef'd out > currently: > > One of them implements the context sensitive casing that Unicode defines. > This is not implemented because I need more time to think about things; for > one, Unicode has also recently revised their guidelines on this and I > haven't looked at the new ones. > > The other would change the case of a utf8-encoded character in the Latin1 > range by using the built-in tables without having to go to the swash. It is > disabled because it would break the ability of user-defined case mappings > overriding the default behavior for characters in that range. > > Which brings us to the topic of these user-defined case-changing mappings. > I just haven't gotten around to figuring out how to tell if such a function > is in existence or not. Currently, they must be in main::. Since this is a > very obscure corner of the language, I'm deferring fixing it until later. > Rafael gave me one hint about how to figure out if such a mapping function > exists, but I haven't pursued it yet. I understand that Zefram has been > working on lexically-scoped subroutines, so that could affect this. I agree with the deferring. Is anyone aware about some code using user-defined case mappings, by the way ?
|