Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Perl: porters

Should Unicode semantics be the default for Latin1 characters in 5.12?

 

 

Perl porters RSS feed   Index | Next | Previous | View Threaded


public at khwilliamson

Dec 13, 2009, 12:06 PM

Post #1 of 10 (490 views)
Permalink
Should Unicode semantics be the default for Latin1 characters in 5.12?

I'm inclined to think not.

I just think it is too much to spring on people with no real warning.
As noted before, several CPAN modules that are in blead failed with this
change. Gerard has said that Kurila experienced these same module
failures, but I haven't heard back from him about what others had the
same pattern.

Even if all the failures are just bugs that are getting exposed for the
first time, but could crop up anytime with the right sets of input, I
think that, similar to Jesse, that we shouldn't be the apparent breakers
of a bunch of CPAN.

So here is my proposal:
1) We continue to have 'use legacy'. I agree with Rafael and Aristotle
about this.

2) I will submit a patch that just flips the default. People will for
the first time not have to do a utf8::upgrade or a Unicode::Semantics
all over the place to get the new effect. They can just do a 'no
legacy' at the beginning of their program to get the effect, except,
unfortunately, for modules outside their control.

3) We announce in perldelta, perhaps other places, that the plan is to
flip the state in 5.14.

4) My patch for regex case-sensitive matching be placed into blead,
knowing that it is not the default, and we document the flaw that Yves
has mentioned that an already compiled re that is compiled into a
surrounding one will have the surroundings state. Note that this is not
a problem is someone just has the one call to 'no legacy' at the
beginning of their code, as the state will be constant through out their
code.

5) I will find time in the next few days to work on a patch for case
insensitive matching similar to the one previously submitted for case
sensitive. It will suffer from the same flaw. To hopefully please
Jesse, I will first submit a patch that extends the fold testing .t to
many more cases that show, I'm afraid, many more flaws in the existing
scheme, not limited to the Unicode bug. Here's an example that
surprised me:
./perl -I./lib -E 'my $c = chr(0xe0); utf8::upgrade $c; say $c =~
/\x{c0}/i'

doesn't print 1, even though the string is in utf8. Swapping the c0 and
e0 does work.

6) I have given up for now on fixing the user-defined case overriding to
not be sensitive to utf8ness. It turns out it is misdocumented; it
isn't as restrictive as it says it is. Instead of it having to be on a
global level, it actually is on a package level. It's been many years
since I have had to worry about real-time issues, and processor speeds
and apparently optimizers have improved significantly since then. So,
perhaps I'm overly conservative here. I thought it would be an
acceptable slow-down to add a quick test for every call to uc() etc to
check if a global case override has been defined. But the testing
becomes more intensive when it might not be a global. Correct me if I'm
wrong. Should we penalize everyone (unless the penalty is lighter than
I think) for a feature that we're not sure is used at all. A deficiency
of this feature is that you can't just override a few mappings. If you
override any, you must furnish a complete set of casings. We tell you
how to find the current complete set as an aid for that, but still it's
a pain. But on the other hand, if the function did know that there is
no override mapping defined (which will be the case 99.9999% of the
time), it could save time for code points in the Latin1 range which
would not have to go out to utf8_heavy.pl.

If this proposal is acceptable, programmers could have the Unicode bug
at bay in 5.12.


juerd at convolution

Dec 13, 2009, 12:34 PM

Post #2 of 10 (467 views)
Permalink
Re: Should Unicode semantics be the default for Latin1 characters in 5.12? [In reply to]

karl williamson skribis 2009-12-13 13:06 (-0700):
> I'm inclined to think not.

I'm very strongly inclined to think they should.

> As noted before, several CPAN modules that are in blead failed with
> this change.

Then let's work together to fix these modules in both 5.8 and 5.12.
Apparently a mere "use legacy" does this.

> Gerard has said that Kurila experienced these same module
> failures

Kurila has a WILDLY different string model.

> 2) I will submit a patch that just flips the default. People will for
> the first time not have to do a utf8::upgrade or a Unicode::Semantics
> all over the place to get the new effect. They can just do a 'no
> legacy' at the beginning of their program to get the effect, except,
> unfortunately, for modules outside their control.
> 3) We announce in perldelta, perhaps other places, that the plan is to
> flip the state in 5.14.

Will this automatically fix the CPAN modules, somehow?

I don't see why the change should be postponed by another few years.
Perl 5.10 will be around for everyone who's really dependent on the old
behaviour as the default, for a very long time. Just like enough
businesses are still using 5.6 or 5.005 even. Postponing the inevitable
just to buy people a little more time is a bad idea. A new non-bugfix
release is the perfect opportunity to introduce the change.

Besides, flipping a default essentially means that lots of code without
"use/no legacy" will break just as it will now.

If an in-between version is needed, then I think it would be better to
put that in the 5.10 series! Don't delay the progress of the whole of
Perl, because of one little issue that affects only people who haven't
been paying attention for years (even if that is, perhaps, the larger
part of the user base).
--
Met vriendelijke groet, Kind regards, Korajn salutojn,

Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig>
Convolution: ICT solutions and consultancy <sales [at] convolution>


jesse at fsck

Dec 14, 2009, 10:07 AM

Post #3 of 10 (463 views)
Permalink
Re: Should Unicode semantics be the default for Latin1 characters in 5.12? [In reply to]

> If an in-between version is needed, then I think it would be better to
> put that in the 5.10 series! Don't delay the progress of the whole of
> Perl, because of one little issue that affects only people who haven't
> been paying attention for years (even if that is, perhaps, the larger
> part of the user base).

That argument...doesn't work so well. Between the current "critical
fixes only" maint policy and the fact that this wasn't there when 5.10.0
shipped, it's pretty much a non-starter.

-Jesse


demerphq at gmail

Dec 14, 2009, 10:17 AM

Post #4 of 10 (463 views)
Permalink
Re: Should Unicode semantics be the default for Latin1 characters in 5.12? [In reply to]

2009/12/13 karl williamson <public [at] khwilliamson>:
> I'm inclined to think not.

Then we came to the same conclusion. Changing the semantics WILL break
LOTS of stuff.

> I just think it is too much to spring on people with no real warning. As
> noted before, several CPAN modules that are in blead failed with this
> change.  Gerard has said that Kurila experienced these same module failures,
> but I haven't heard back from him about what others had the same pattern.
>
> Even if all the failures are just bugs that are getting exposed for the
> first time, but could crop up anytime with the right sets of input, I think
> that, similar to Jesse, that we shouldn't be the apparent breakers of a
> bunch of CPAN.
>
> So here is my proposal:
> 1) We continue to have 'use legacy'.  I agree with Rafael and Aristotle
> about this.
>
> 2) I will submit a patch that just flips the default.  People will for the
> first time not have to do a utf8::upgrade or a Unicode::Semantics all over
> the place to get the new effect.  They can just do a 'no legacy' at the
> beginning of their program to get the effect, except, unfortunately, for
> modules outside their control.
>
> 3) We announce in perldelta, perhaps other places, that the plan is to flip
> the state in 5.14.
>
> 4) My patch for regex case-sensitive matching be placed into blead, knowing
> that it is not the default, and we document the flaw that Yves has mentioned
> that an already compiled re that is compiled into a surrounding one will
> have the surroundings state.  Note that this is not a problem is someone
> just has the one call to 'no legacy' at the beginning of their code, as the
> state will be constant through out their code.
>
> 5) I will find time in the next few days to work on a patch for case
> insensitive matching similar to the one previously submitted for case
> sensitive.  It will suffer from the same flaw.  To hopefully please Jesse, I
> will first submit a patch that extends the fold testing .t to many more
> cases that show, I'm afraid, many more flaws in the existing scheme, not
> limited to the Unicode bug.  Here's an example that surprised me:
>  ./perl -I./lib -E 'my $c = chr(0xe0); utf8::upgrade $c; say $c =~
> /\x{c0}/i'
>
> doesn't print 1, even though the string is in utf8.  Swapping the c0 and e0
> does work.
>
> 6) I have given up for now on fixing the user-defined case overriding to not
> be sensitive to utf8ness.  It turns out it is misdocumented; it isn't as
> restrictive as it says it is.  Instead of it having to be on a global level,
> it actually is on a package level.  It's been many years since I have had to
> worry about real-time issues, and processor speeds and apparently optimizers
> have improved significantly since then.  So, perhaps I'm overly conservative
> here.  I thought it would be an acceptable slow-down to add a quick test for
> every call to uc() etc to check if a global case override has been defined.
>  But the testing becomes more intensive when it might not be a global.
>  Correct me if I'm wrong.  Should we penalize everyone (unless the penalty
> is lighter than I think) for a feature that we're not sure is used at all.
>  A deficiency of this feature is that you can't just override a few
> mappings.  If you override any, you must furnish a complete set of casings.
>  We tell you how to find the current complete set as an aid for that, but
> still it's a pain.  But on the other hand, if the function did know that
> there is no override mapping defined (which will be the case 99.9999% of the
> time), it could save time for code points in the Latin1 range which would
> not have to go out to utf8_heavy.pl.
>
> If this proposal is acceptable, programmers could have the Unicode bug at
> bay in 5.12.

++

yves



--
perl -Mre=debug -e "/just|another|perl|hacker/"


rgs at consttype

Dec 15, 2009, 2:23 PM

Post #5 of 10 (451 views)
Permalink
Re: Should Unicode semantics be the default for Latin1 characters in 5.12? [In reply to]

2009/12/13 karl williamson <public [at] khwilliamson>:
> I'm inclined to think not.

So do I, finally, after all discussion about backwards compatibility.

> I just think it is too much to spring on people with no real warning. As
> noted before, several CPAN modules that are in blead failed with this
> change.  Gerard has said that Kurila experienced these same module failures,
> but I haven't heard back from him about what others had the same pattern.
>
> Even if all the failures are just bugs that are getting exposed for the
> first time, but could crop up anytime with the right sets of input, I think
> that, similar to Jesse, that we shouldn't be the apparent breakers of a
> bunch of CPAN.
>
> So here is my proposal:
> 1) We continue to have 'use legacy'.  I agree with Rafael and Aristotle
> about this.
>
> 2) I will submit a patch that just flips the default.  People will for the
> first time not have to do a utf8::upgrade or a Unicode::Semantics all over
> the place to get the new effect.  They can just do a 'no legacy' at the
> beginning of their program to get the effect, except, unfortunately, for
> modules outside their control.
>
> 3) We announce in perldelta, perhaps other places, that the plan is to flip
> the state in 5.14.

I don't think it's feasible to flip the default state of a pragma from
one release to another -- at least not without a very explicit version
requirement (as in C<use 5.12.0>). Without that precaution, that will
lead to too much confusion.

I also think that the advantage of having a legacy.pm in addition of
feature.pm is only real when the C<use legacy> behaviour is not on by
default. That's the main factor that differentiate legacy features
versus new features.

So, I would think that it would be better to remove legacy, and make
"unicode8bit" a feature, in the sense of feature.pm. Also, that means
that it would be loaded by default by C<use 5.12.0>, like other
features. (The problem here being qr// regexs that leak to other scopes
in which the feature is not in effect.)

Alternatively, or additionally, a new //u switch could turn on the new
behaviour per-regex.

> 4) My patch for regex case-sensitive matching be placed into blead, knowing
> that it is not the default, and we document the flaw that Yves has mentioned
> that an already compiled re that is compiled into a surrounding one will
> have the surroundings state.  Note that this is not a problem is someone
> just has the one call to 'no legacy' at the beginning of their code, as the
> state will be constant through out their code.
>
> 5) I will find time in the next few days to work on a patch for case
> insensitive matching similar to the one previously submitted for case
> sensitive.  It will suffer from the same flaw.  To hopefully please Jesse, I
> will first submit a patch that extends the fold testing .t to many more
> cases that show, I'm afraid, many more flaws in the existing scheme, not
> limited to the Unicode bug.  Here's an example that surprised me:
>  ./perl -I./lib -E 'my $c = chr(0xe0); utf8::upgrade $c; say $c =~
> /\x{c0}/i'
>
> doesn't print 1, even though the string is in utf8.  Swapping the c0 and e0
> does work.
>
> 6) I have given up for now on fixing the user-defined case overriding to not
> be sensitive to utf8ness.  It turns out it is misdocumented; it isn't as
> restrictive as it says it is.  Instead of it having to be on a global level,
> it actually is on a package level.  It's been many years since I have had to
> worry about real-time issues, and processor speeds and apparently optimizers
> have improved significantly since then.  So, perhaps I'm overly conservative
> here.  I thought it would be an acceptable slow-down to add a quick test for
> every call to uc() etc to check if a global case override has been defined.
>  But the testing becomes more intensive when it might not be a global.
>  Correct me if I'm wrong.  Should we penalize everyone (unless the penalty
> is lighter than I think) for a feature that we're not sure is used at all.
>  A deficiency of this feature is that you can't just override a few
> mappings.  If you override any, you must furnish a complete set of casings.
>  We tell you how to find the current complete set as an aid for that, but
> still it's a pain.  But on the other hand, if the function did know that
> there is no override mapping defined (which will be the case 99.9999% of the
> time), it could save time for code points in the Latin1 range which would
> not have to go out to utf8_heavy.pl.
>
> If this proposal is acceptable, programmers could have the Unicode bug at
> bay in 5.12.


ben at morrow

Dec 15, 2009, 3:05 PM

Post #6 of 10 (450 views)
Permalink
Re: Should Unicode semantics be the default for Latin1 characters in 5.12? [In reply to]

Quoth rgs [at] consttype (Rafael Garcia-Suarez):
>
> I don't think it's feasible to flip the default state of a pragma from
> one release to another -- at least not without a very explicit version
> requirement (as in C<use 5.12.0>). Without that precaution, that will
> lead to too much confusion.
>
> I also think that the advantage of having a legacy.pm in addition of
> feature.pm is only real when the C<use legacy> behaviour is not on by
> default. That's the main factor that differentiate legacy features
> versus new features.
>
> So, I would think that it would be better to remove legacy, and make
> "unicode8bit" a feature, in the sense of feature.pm. Also, that means
> that it would be loaded by default by C<use 5.12.0>, like other
> features. (The problem here being qr// regexs that leak to other scopes
> in which the feature is not in effect.)

Finally! +lots, since I never understood the point of 'legacy' in the
first place, and thought this was the sort of thing 'feature' was for.

> Alternatively, or additionally, a new //u switch could turn on the new
> behaviour per-regex.

If necessary, some sort of (?~<unicode8bit,other_feature>...) regex
syntax could be devised to allow features to propagate into qr// without
needing a new flag for each. IMHO having regex semantics 'leak' along
with a qr// is correct.

Ben


rvtol+usenet at isolution

Dec 15, 2009, 3:07 PM

Post #7 of 10 (452 views)
Permalink
Re: Should Unicode semantics be the default for Latin1 characters in 5.12? [In reply to]

Rafael Garcia-Suarez wrote:

> So, I would think that it would be better to remove legacy, and make
> "unicode8bit" a feature, in the sense of feature.pm. Also, that means
> that it would be loaded by default by C<use 5.12.0>, like other
> features. (The problem here being qr// regexs that leak to other scopes
> in which the feature is not in effect.)
>
> Alternatively, or additionally, a new //u switch could turn on the new
> behaviour per-regex.

And then also a modifier to get POSIX behavior?

--
Ruud


jesse at fsck

Dec 15, 2009, 4:00 PM

Post #8 of 10 (450 views)
Permalink
Re: Should Unicode semantics be the default for Latin1 characters in 5.12? [In reply to]

> I don't think it's feasible to flip the default state of a pragma from
> one release to another -- at least not without a very explicit version
> requirement (as in C<use 5.12.0>). Without that precaution, that will
> lead to too much confusion.
>
> I also think that the advantage of having a legacy.pm in addition of
> feature.pm is only real when the C<use legacy> behaviour is not on by
> default. That's the main factor that differentiate legacy features
> versus new features.
>
> So, I would think that it would be better to remove legacy, and make
> "unicode8bit" a feature, in the sense of feature.pm. Also, that means
> that it would be loaded by default by C<use 5.12.0>, like other
> features. (The problem here being qr// regexs that leak to other scopes
> in which the feature is not in effect.)

+1

> Alternatively, or additionally, a new //u switch could turn on the new
> behaviour per-regex.

I think that this is a separate new feature that's a good thing to have
in addition. While I'd love to have it today, I wouldn't shed too many
tears if we don't get it for this release.

-Jesse


rgs at consttype

Dec 16, 2009, 12:29 AM

Post #9 of 10 (447 views)
Permalink
Re: Should Unicode semantics be the default for Latin1 characters in 5.12? [In reply to]

2009/12/16 Dr.Ruud <rvtol+usenet [at] isolution>:
>> Alternatively, or additionally, a new //u switch could turn on the new
>> behaviour per-regex.
>
> And then also a modifier to get POSIX behavior?

Can't we just use a pluggable re engine for that ? (Not expecting it
will be used a lot)


public at khwilliamson

Dec 16, 2009, 2:22 PM

Post #10 of 10 (445 views)
Permalink
Re: Should Unicode semantics be the default for Latin1 characters in 5.12? [In reply to]

jesse wrote:
>
>> I don't think it's feasible to flip the default state of a pragma from
>> one release to another -- at least not without a very explicit version
>> requirement (as in C<use 5.12.0>). Without that precaution, that will
>> lead to too much confusion.
>>
>> I also think that the advantage of having a legacy.pm in addition of
>> feature.pm is only real when the C<use legacy> behaviour is not on by
>> default. That's the main factor that differentiate legacy features
>> versus new features.
>>
>> So, I would think that it would be better to remove legacy, and make
>> "unicode8bit" a feature, in the sense of feature.pm. Also, that means
>> that it would be loaded by default by C<use 5.12.0>, like other
>> features. (The problem here being qr// regexs that leak to other scopes
>> in which the feature is not in effect.)
>
> +1

I have mixed feelings about this. Aesthetically, I don't like the idea
that you have to use 'feature' to get something to work the way it
should have all along. This is a bug fix which wouldn't have been
necessary if previous Perl's hadn't tried to shoe horn in Unicode
transparently, which turns out to be impossible.

I also don't understand the problems of flipping the default pragma
states; but I don't think I need to, as it's good enough for me that
Rafael, who has been a champion of this, thinks we should abandon the
previously agreed on line of attack.

>
>> Alternatively, or additionally, a new //u switch could turn on the new
>> behaviour per-regex.
>
> I think that this is a separate new feature that's a good thing to have
> in addition. While I'd love to have it today, I wouldn't shed too many
> tears if we don't get it for this release.
>
> -Jesse

It has to be additionally, because of the issue with uc(), etc.

So where do we go from here? And who does what?

The casing component of the problem has been in blead for some time,
except for the user-defined casing overrides, which I don't think we
should do at this time.

A case-sensitive matching patch for this has been submitted, and Yves
thinks it is worthwhile, but doesn't solve the problem of interpolating
an already compiled regex into another one which has the opposite state.
The /u modifier would solve this, but the person who has been looking
at this is tied up for apparently some weeks still.

I have been looking at the case-insensitive matching issue, and it is
not going well. Details in a separate email. But there are a number of
problems with the current mechanism--not as many as I had thought for a
while.

We could ship with just the two fixes. I can get in some fixes for the
case insensitive, but it's quite complicated to cover all those bases.

Perl porters RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.