Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Perl: porters

PATCH: Add code to solve the casing portion of the "Unicode bug"

 

 

Perl porters RSS feed   Index | Next | Previous | View Threaded


public at khwilliamson

Nov 12, 2009, 10:00 PM

Post #1 of 4 (224 views)
Permalink
PATCH: Add code to solve the casing portion of the "Unicode bug"

Attached is a patch that adds code to almost fix the case changing
portion of the "Unicode bug" (see
http://rt.perl.org/rt3/Public/Bug/Display.html?id=58182). This is a
significant portion of perltodo's UTF-8 revamp.

What this means is that characters whose ordinals on ASCII machines are
in the 128-255 range will have Unicode semantics as far as case changing
goes regardless of whether they are encoded in utf8 or not. The reason
I say it 'almost' fixes this is that any user-defined case mapping is
still not called unless the scalar is in utf8. See below for a
discussion on that.

I am leaving the legs that implement this behavior disabled by default
until the smokes show that the new stuff doesn't break anything; then
I'll flip the bit to enable it. If you want to play with it in the
meantime, you can say 'no legacy "unicode8bit"';

I've proposed changing the name of this; so that may happen, but it
doesn't affect the heart of the code, so I'm delivering it now; it's
easy to change the name later.

I don't understand Perl magic. So I have tried to avoid touching
anything around that. But there are a couple comments from the old code
that make me think I really don't understand what's going on in regard
to that. I'm now thinking the comments are obsolete. If someone would
care to look at them, they are at lines 3963 and 4227 in pp.c and both
say the same thing: "Overloaded values may have toggled the UTF-8 flag
on source, so we need to check DO_UTF8 again here". This doesn't make
sense to me based on the code earlier in the functions they occur in, so
I think they are wrong.

There are minor changes in several files: macros are created in headers
to access the bit and look up the case change mappings. perl.h has two
tables added with the mappings for all 256 Latin1 characters that give
respectively, 1) their lowercased values; and 2) their upper and title
cased values. I removed trailing blank space in all the submitted files.

There are significant changes in the casing functions in pp.c to
accommodate this new behavior. Basically, if the bit is set, the case
change is looked up via these tables instead of the existing mechanism.
If the bit is off or 'use locale' or 'use bytes' is in effect the
existing mechanism is used.

Complications arise because three characters in the latin1 range require
special handling when title and uppercasing them. Two of them have the
case change out-of-range, which means that the result has to be
converted to utf8, and the other expands to two in-range characters.
The code in the ucfirst() function was rearranged to compute the case
change first, so as to know if the length of the result changes. If it
doesn't change, and the scalar isn't read-only etc, then only the first
character needs to be touched. Also, the uc() function has to cope with
the possibility that in mid-stream it will have to decide to upgrade the
result to utf8.

I tried to make this efficient, and not slow things down from what they
are now. It may actually be faster than the current implementation in
general because it does a table look up, which avoids some tests. One
test was added in the inner loop for upper casing, but then again the
table lookup took out two tests

There are two things I started to implement, but left #ifdef'd out
currently:

One of them implements the context sensitive casing that Unicode
defines. This is not implemented because I need more time to think
about things; for one, Unicode has also recently revised their
guidelines on this and I haven't looked at the new ones.

The other would change the case of a utf8-encoded character in the
Latin1 range by using the built-in tables without having to go to the
swash. It is disabled because it would break the ability of
user-defined case mappings overriding the default behavior for
characters in that range.

Which brings us to the topic of these user-defined case-changing
mappings. I just haven't gotten around to figuring out how to tell if
such a function is in existence or not. Currently, they must be in
main::. Since this is a very obscure corner of the language, I'm
deferring fixing it until later. Rafael gave me one hint about how to
figure out if such a mapping function exists, but I haven't pursued it
yet. I understand that Zefram has been working on lexically-scoped
subroutines, so that could affect this.
Attachments: 0001-add-code-for-Unicode-semantics-for-non-utf8-latin1-c.patch (53.8 KB)


rgarciasuarez at gmail

Nov 14, 2009, 3:08 PM

Post #2 of 4 (189 views)
Permalink
Re: PATCH: Add code to solve the casing portion of the "Unicode bug" [In reply to]

2009/11/13 karl williamson <public [at] khwilliamson>:
> Attached is a patch that adds code to almost fix the case changing portion
> of the "Unicode bug" (see
> http://rt.perl.org/rt3/Public/Bug/Display.html?id=58182).  This is a
> significant portion of perltodo's UTF-8 revamp.

Thanks, applied as 00f254e235ff10d6223aa9a402ad5b7a85689829.

> What this means is that characters whose ordinals on ASCII machines are in
> the 128-255 range will have Unicode semantics as far as case changing goes
> regardless of whether they are encoded in utf8 or not.  The reason I say it
> 'almost' fixes this is that any user-defined case mapping is still not
> called unless the scalar is in utf8.  See below for a discussion on that.
>
> I am leaving the legs that implement this behavior disabled by default until
> the smokes show that the new stuff doesn't break anything; then I'll flip
> the bit to enable it.  If you want to play with it in the meantime, you can
> say 'no legacy "unicode8bit"';

Prudence is good. (so currently in blead the legacy-unicode8bit pragma
has the reverse meaning than the one it will have in 5.12)

> I've proposed changing the name of this; so that may happen, but it doesn't
> affect the heart of the code, so I'm delivering it now; it's easy to change
> the name later.

I still like that name; the documentation might need improvements
though. When I read this line in the SYNOPSIS :

use legacy ':5.10'; # Keeps semantics the same as in perl 5.10

as a perl user, I'm not certain of how that works, because it's not
clear that the doc itself depends on a version strictly greater than
5.10. Maybe it's not such a good idea to use versioned bundles like
feature.pm does.

> I don't understand Perl magic.  So I have tried to avoid touching anything
> around that.  But there are a couple comments from the old code that make me
> think I really don't understand what's going on in regard to that.  I'm now
> thinking the comments are obsolete.  If someone would care to look at them,
> they are at lines 3963 and 4227 in pp.c and both say the same thing:
> "Overloaded values may have toggled the UTF-8 flag on source, so we need to
> check DO_UTF8 again here".  This doesn't make sense to me based on the code
> earlier in the functions they occur in, so I think they are wrong.

I'll look, but I don't promise to understand either.

> There are minor changes in several files: macros are created in headers to
> access the bit and look up the case change mappings.  perl.h has two tables
> added with the mappings for all 256 Latin1 characters that give
> respectively, 1) their lowercased values; and 2) their upper and title cased
> values.  I removed trailing blank space in all the submitted files.
>
> There are significant changes in the casing functions in pp.c to accommodate
> this new behavior.  Basically, if the bit is set, the case change is looked
> up via these tables instead of the existing mechanism.  If the bit is off or
> 'use locale' or 'use bytes' is in effect the existing mechanism is used.
>
> Complications arise because three characters in the latin1 range require
> special handling when title and uppercasing them.  Two of them have the case
> change out-of-range, which means that the result has to be converted to
> utf8, and the other expands to two in-range characters. The code in the
> ucfirst() function was rearranged to compute the case change first, so as to
> know if the length of the result changes.  If it doesn't change, and the
> scalar isn't read-only etc, then only the first character needs to be
> touched.  Also, the uc() function has to cope with the possibility that in
> mid-stream it will have to decide to upgrade the result to utf8.
>
> I tried to make this efficient, and not slow things down from what they are
> now.  It may actually be faster than the current implementation in general
> because it does a table look up, which avoids some tests.  One test was
> added in the inner loop for upper casing, but then again the table lookup
> took out two tests
>
> There are two things I started to implement, but left #ifdef'd out
> currently:
>
> One of them implements the context sensitive casing that Unicode defines.
>  This is not implemented because I need more time to think about things; for
> one, Unicode has also recently revised their guidelines on this and I
> haven't looked at the new ones.
>
> The other would change the case of a utf8-encoded character in the Latin1
> range by using the built-in tables without having to go to the swash.  It is
> disabled because it would break the ability of user-defined case mappings
> overriding the default behavior for characters in that range.
>
> Which brings us to the topic of these user-defined case-changing mappings.
>  I just haven't gotten around to figuring out how to tell if such a function
> is in existence or not.  Currently, they must be in main::.  Since this is a
> very obscure corner of the language, I'm deferring fixing it until later.
>  Rafael gave me one hint about how to figure out if such a mapping function
> exists, but I haven't pursued it yet.  I understand that Zefram has been
> working on lexically-scoped subroutines, so that could affect this.

I agree with the deferring. Is anyone aware about some code using
user-defined case mappings, by the way ?


nick at ccl4

Nov 16, 2009, 6:13 AM

Post #3 of 4 (187 views)
Permalink
Re: PATCH: Add code to solve the casing portion of the "Unicode bug" [In reply to]

On Thu, Nov 12, 2009 at 11:00:52PM -0700, karl williamson wrote:
> Attached is a patch that adds code to almost fix the case changing
> portion of the "Unicode bug" (see
> http://rt.perl.org/rt3/Public/Bug/Display.html?id=58182). This is a
> significant portion of perltodo's UTF-8 revamp.

I've not read the patch. I can comment on this:

> I don't understand Perl magic. So I have tried to avoid touching
> anything around that. But there are a couple comments from the old code
> that make me think I really don't understand what's going on in regard
> to that. I'm now thinking the comments are obsolete. If someone would
> care to look at them, they are at lines 3963 and 4227 in pp.c and both
> say the same thing: "Overloaded values may have toggled the UTF-8 flag
> on source, so we need to check DO_UTF8 again here". This doesn't make
> sense to me based on the code earlier in the functions they occur in, so
> I think they are wrong.

They're not wrong. The calls SvPV_force_nomg() and SvPV_nomg_const() are
both capable of changing the state of SVf_UTF8 on source. So, in this code:


if (SvPADTMP(source) && !SvREADONLY(source) && !SvAMAGIC(source)
&& SvTEMP(source) && !DO_UTF8(source)
&& (IN_LOCALE_RUNTIME || ! IN_UNI_8_BIT)) {

...

s = d = (U8*)SvPV_force_nomg(source, len);
min = len + 1;
} else {

...

if (SvOK(source)) {
s = (const U8*)SvPV_nomg_const(source, len);

...

}

/* Overloaded values may have toggled the UTF-8 flag on source, so we need
to check DO_UTF8 again here. */

if (DO_UTF8(source)) {


The two evaluations of DO_UTF8(source) may give different answers.

It's rare - it happens if source is actually a reference to an overloaded
value, and that value returns something of the opposite UTF-8-ness to the
previous time. Normally that doesn't happen, because there's an unwritten
rule that overloading is idempotent, unlike tie, but

a: nothing enforces that
b: because every scalar reference starts out "not UTF-8", even for a "well
behaved" (ie idempotent) overloaded value, if it returns UTF-8, then (only)
on the first call to get its value will the flag be turned on. So anything
checking it *before* the call will get the wrong idea.


If there's a comment in there, there's probably a regression test for it.
I went through and added a lot of tests, fixing the problems, commenting
the fixes.

Nicholas Clark


public at khwilliamson

Nov 16, 2009, 8:52 AM

Post #4 of 4 (180 views)
Permalink
Re: PATCH: Add code to solve the casing portion of the "Unicode bug" [In reply to]

Nicholas Clark wrote:
> On Thu, Nov 12, 2009 at 11:00:52PM -0700, karl williamson wrote:
>> Attached is a patch that adds code to almost fix the case changing
>> portion of the "Unicode bug" (see
>> http://rt.perl.org/rt3/Public/Bug/Display.html?id=58182). This is a
>> significant portion of perltodo's UTF-8 revamp.
>
> I've not read the patch. I can comment on this:
>
>> I don't understand Perl magic. So I have tried to avoid touching
>> anything around that. But there are a couple comments from the old code
>> that make me think I really don't understand what's going on in regard
>> to that. I'm now thinking the comments are obsolete. If someone would
>> care to look at them, they are at lines 3963 and 4227 in pp.c and both
>> say the same thing: "Overloaded values may have toggled the UTF-8 flag
>> on source, so we need to check DO_UTF8 again here". This doesn't make
>> sense to me based on the code earlier in the functions they occur in, so
>> I think they are wrong.
>
> They're not wrong. The calls SvPV_force_nomg() and SvPV_nomg_const() are
> both capable of changing the state of SVf_UTF8 on source. So, in this code:
>
>
> if (SvPADTMP(source) && !SvREADONLY(source) && !SvAMAGIC(source)
> && SvTEMP(source) && !DO_UTF8(source)
> && (IN_LOCALE_RUNTIME || ! IN_UNI_8_BIT)) {
>
> ...
>
> s = d = (U8*)SvPV_force_nomg(source, len);
> min = len + 1;
> } else {
>
> ...
>
> if (SvOK(source)) {
> s = (const U8*)SvPV_nomg_const(source, len);
>
> ...
>
> }
>
> /* Overloaded values may have toggled the UTF-8 flag on source, so we need
> to check DO_UTF8 again here. */
>
> if (DO_UTF8(source)) {
>
>
> The two evaluations of DO_UTF8(source) may give different answers.
>
> It's rare - it happens if source is actually a reference to an overloaded
> value, and that value returns something of the opposite UTF-8-ness to the
> previous time. Normally that doesn't happen, because there's an unwritten
> rule that overloading is idempotent, unlike tie, but
>
> a: nothing enforces that
> b: because every scalar reference starts out "not UTF-8", even for a "well
> behaved" (ie idempotent) overloaded value, if it returns UTF-8, then (only)
> on the first call to get its value will the flag be turned on. So anything
> checking it *before* the call will get the wrong idea.
>
>
> If there's a comment in there, there's probably a regression test for it.
> I went through and added a lot of tests, fixing the problems, commenting
> the fixes.
>
> Nicholas Clark
>

Ah! I figured that the calls could flip the flag, but the comment still
doesn't make sense to me, because you would have to check the utf8ness
of the source at this point even if the state couldn't change. Perhaps
an earlier version of the code somehow was structured so that DO_UTF8
was only called once, but I don't see how, unless it stashed the result
in a variable to save re-evaluating it. So, what the comment really
means is you can't stash the result of DO_UTF8 in a variable; you have
to evaluate the macro. I guess the reason I find it confusing is I
don't see the code using variables to save re-evaluating macros. So the
comment still seems gratuitous; I no one objects, I'll change it to the
following the next time I make substantive changes to the file:
/* Note. Overloaded values may have toggled the UTF-8 flag on source,
so DO_UTF8 may give a different result here than it did above */

Perl porters RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.