nick at ccl4
May 11, 2012, 4:13 AM
On Fri, May 11, 2012 at 11:47:16AM +0100, Dave Mitchell wrote:
Re: [perl #112792] Perl Major Release Doc Flaws (5.8.0->5.10.0)
[In reply to]
> On Thu, May 10, 2012 at 06:22:25PM -0700, Linda W wrote:
> > Oh but you did. You acted like an over-the top ass toward someone trying to
> > document problems as they find them -- this one I as a bit late on.
> > BUT so were
> > alot of ther people -- because for YEARS after 5.8, people told me
> > all I had to do was to have my environment vars set for UTF-8 and
> > perl would respect that.
> > So that misinformation was out there for long time. Now, multiple
> > years later,
> > I'm getting a different story. It wasn't just the crew of 5.8 /
> > 5.8.1 that messed up. 5.10.0 was the biggest miss. That's where it
> > should have been well publicized. An opportunity to address it in
> > release notes in 5.12/5.14 happened, but other incompatibilities
> > were being introduced there.
> In 5.8.0, a new feature was introduced, whereby a utf8 locale set in the
> environment automatically made the STD filehandles utf8. This seemed like
> a sensible idea at the time; in practice, it broke a lot of code, and was
> quickly removed for 5.8.1, and was *clearly documented* in the 5.8.1
> perldelta as the second item in 'Incompatible Changes'.
And if the people offering advice were not aware of that, they weren't
very competent to be offering advice in the first place. The problem is
that too many people over-estimate their abilities, and it's hard to tell
the over-confident (bluffer or genuine) from the genuinely competent.
Particularly as the genuinely competent are aware of their limitations,
and as a result generally downplay their confidence.
> Note that in general, adding unicode support is *very* hard to do; around
> the time of 5.6.x and 5.8.x no one really know the best way to make it
> work, and some of the early ideas turned out to poor. Since then we have
> been fixing some of these implementation choices, and steering a
> difficult course between fixing things that are clearly wrong, and trying
> not to break backwards compatibility; often there is much debate before
> such changes.
Also, IIRC, the idea of enabling things based on a UTF-8 locale was what the
Linux/Unicode folks at the time said was the right thing to do. We were
following the best advice we had, which turned out to be horribly broken in
practice. In particular, some distributions changed to defaulting installs
to give the user a UTF-8 locale, without the user really being aware of
what was being thrust upon them, and even though the rest of the programs
and systems the user was interacting with were unchanged and assumed the
same 8-bit environment as before.
(This stuff is *still* screwed up. Ex-work upgraded a system from really-old
Linux to modern-ish Linux about 18 months ago and something broke. Turned
out that a program written in *Java*, of all things, was "honouring" the
locale setting by using UTF-8 down a network socket that it opened if the
locale was UTF-8. "using UTF-8" in the sense of converting all octets to
UTF-8 on the way out. This is despite the fact that this broke the
content-length and content-type headers that the self same code had just sent
down the same socket. The fix - LC_ALL="C")
(Note also the fun Python 3.0 had by assuming that a UTF-8 locale means that
all aspects of the environment are reliably UTF-8, and not coping well when
that was not true. They fixed Python 3.1 to cope with messy realities)
And that's not the only example of Unicode theory proving to be badly wrong
in practice, when implemented. The Unicode Consortium themselves are having
to rethink their rules on matching inverted character classes, given that
doesn't match what people expect, and this
"ß" =~ /(s)(s)/i
is positively unimplementable.
> it regarded by some that perl currently has amongst the most complete and
> correct unicode support of comparable programming languages.
Which is also why we have some of the pain of an early adopter.