Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Perl: porters

unicode question

 

 

First page Previous page 1 2 Next page Last page  View All Perl porters RSS feed   Index | Next | Previous | View Threaded


perl-diddler at tlinx

Apr 25, 2012, 6:12 PM

Post #1 of 38 (355 views)
Permalink
unicode question

I read this in my 5.14 documentation (in man page, for perlunicode, my
p#\) added).

1) "use encoding" needed to upgrade non-Latin-1 byte strings
By default, there is a fundamental asymmetry in Perl's Unicode
model: implicit upgrading from byte strings to Unicode strings
assumes that they were encoded in ISO 8859-1 (Latin-1), but Unicode
strings are downgraded with UTF-8 encoding. This happens because
the first 256 codepoints in Unicode happens to agree with Latin-1

1.1) See "Byte and Character Semantics" for more details.

2) Byte and Character Semantics
Beginning with version 5.6, Perl uses logically-wide characters to
represent strings internally.

3) Starting in Perl 5.14, Perl-level operations work with characters
rather than bytes within the scope of a "use feature 'unicode_strings'"
(or equivalently "use 5.012" or higher). (This is not true if bytes
have been explicitly requested by "use bytes", nor necessarily true for
interactions with the platform's operating system.)

------------------------------------

Ok. If I understand the above correctly, then starting in 5.14 -- but
triggered by 5.012? 5.14.0 (or 5.014?)... and NOT in 5.12, (?!?! what
happens there, if not the same, why is the trigger 5.12?)

Then the statement in paragraph 1 about perl having fundamental assymetric
problems, no longer applies?


I.e. If I'm on a UTF-8 system, and my env is set for UTF-8:

> locale
LANG=en_US.UTF-8
LC_CTYPE=en_US.UTF-8
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE=C
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

-----------

The the default will NOW be UTF-8??? (and is wasn't before?!?!?!
ARG!!!)...


IF perl properly pays attention to the environment, as I thought it was
documented to do in 5.8, great. If not... er..*OUCH* (that hurts) --
I've though perl was fully unicode in a UTF-8 env since 5.8... when I
was told it was... (me<-gullible).


So was a I idiot for drinking the koolaid instead of the fine print (a
bit dry to quench thirst)? Is it fixed now?


(and why the weird version stuff....512 or 514?...or is it a triplet vector


ikegami at adaelis

Apr 25, 2012, 7:37 PM

Post #2 of 38 (341 views)
Permalink
Re: unicode question [In reply to]

On Wed, Apr 25, 2012 at 9:12 PM, Linda W <perl-diddler [at] tlinx> wrote:

> I read this in my 5.14 documentation (in man page, for perlunicode, my
> p#\) added).
>
> 1) "use encoding" needed to upgrade non-Latin-1 byte strings
> By default, there is a fundamental asymmetry in Perl's Unicode
> model: implicit upgrading from byte strings to Unicode strings
> assumes that they were encoded in ISO 8859-1 (Latin-1), but Unicode
> strings are downgraded with UTF-8 encoding. This happens because
> the first 256 codepoints in Unicode happens to agree with Latin-1
>
> 1.1) See "Byte and Character Semantics" for more details.
>
> 2) Byte and Character Semantics
> Beginning with version 5.6, Perl uses logically-wide characters to
> represent strings internally.
>
> 3) Starting in Perl 5.14, Perl-level operations work with characters
> rather than bytes within the scope of a "use feature
> 'unicode_strings'"
> (or equivalently "use 5.012" or higher). (This is not true if bytes
> have been explicitly requested by "use bytes", nor necessarily true
> for
> interactions with the platform's operating system.)
>
> ------------------------------**------
>
> Ok. If I understand the above correctly, then starting in 5.14 -- but
> triggered by 5.012? 5.14.0 (or 5.014?)... and NOT in 5.12, (?!?! what
> happens there, if not the same, why is the trigger 5.12?)
>

> Then the statement in paragraph 1 about perl having fundamental assymetric
>
problems, no longer applies?
>

(1) refers to how Perl behaves in response to bugs in user code.

(3) refers to the fixing (when C<< use feature 'unicode_strings'; >> is
used) of most instances of a collection of bug in Perl known as "The
Unicode Bug".

They are not related.


> I.e. If I'm on a UTF-8 system, and my env is set for UTF-8:
>

Your locale only affects Perl when C<< use locale >> is in effect, and even
then, it doesn't affect file handles. Additionally, there is

use open ':std' => ':locale';

and

use open IO => ':locale';

The the default will NOW be UTF-8??? (and is wasn't before?!?!?!
> ARG!!!)...
>

The default then and now is that Perl does not mess your file handles. Perl
returns the bytes it reads from the file handle as is. If your file handles
are expected to have text of a certain encoding, it's up to you to decode
it or to tell Perl to decode it. Perl has no way of knowing whether a file
handle is used to transmit text or not, and it has no way of knowing the
encoding of that text.

IF perl properly pays attention to the environment


...it would corrupt data on many file handles.

(and why the weird version stuff....512 or 514?...or is it a triplet vector
>

I do not understand the question.


fraserbn at gmail

Apr 25, 2012, 8:50 PM

Post #3 of 38 (342 views)
Permalink
Re: unicode question [In reply to]

On Wed, Apr 25, 2012 at 10:12 PM, Linda W <perl-diddler [at] tlinx> wrote:

> I read this in my 5.14 documentation (in man page, for perlunicode, my
> p#\) added).
>
> 1) "use encoding" needed to upgrade non-Latin-1 byte strings
> By default, there is a fundamental asymmetry in Perl's Unicode
> model: implicit upgrading from byte strings to Unicode strings
> assumes that they were encoded in ISO 8859-1 (Latin-1), but Unicode
> strings are downgraded with UTF-8 encoding. This happens because
> the first 256 codepoints in Unicode happens to agree with Latin-1
>

Side comment, but eep, are the docs still suggesting that people do 'use
encoding'? I don't think they should.


>
> 1.1) See "Byte and Character Semantics" for more details.
>
> 2) Byte and Character Semantics
> Beginning with version 5.6, Perl uses logically-wide characters to
> represent strings internally.
>
> 3) Starting in Perl 5.14, Perl-level operations work with characters
> rather than bytes within the scope of a "use feature
> 'unicode_strings'"
> (or equivalently "use 5.012" or higher). (This is not true if bytes
> have been explicitly requested by "use bytes", nor necessarily true
> for
> interactions with the platform's operating system.)
>
> ------------------------------**------
>
> Ok. If I understand the above correctly, then starting in 5.14 -- but
> triggered by 5.012? 5.14.0 (or 5.014?)... and NOT in 5.12, (?!?! what
> happens there, if not the same, why is the trigger 5.12?)
>

unicode_strings did not apply to all operations in 5.12. If I recall
correctly only the regex engine was effected in 5.12? Anyway, use 5.012;
(or use v5.12;) or later (use 5.014; use 5.016: yadda) will implicitly
enable unicode_strings, as if you had explicitly done 'use feature
':5.12';', or 'use feature qw( say switch state unicode_strings );'


>
> Then the statement in paragraph 1 about perl having fundamental assymetric
> problems, no longer applies?
>

In 5.14+ under unicode_strings, that's mostly right. It's all mostly
treated as UTF-8, syscalls aside. See "The Unicode Bug" and "When Unicode
Does not Happen"


>
>
> I.e. If I'm on a UTF-8 system, and my env is set for UTF-8:
>
> > locale
> LANG=en_US.UTF-8
> LC_CTYPE=en_US.UTF-8
> LC_NUMERIC="en_US.UTF-8"
> LC_TIME="en_US.UTF-8"
> LC_COLLATE=C
> LC_MONETARY="en_US.UTF-8"
> LC_MESSAGES="en_US.UTF-8"
> LC_PAPER="en_US.UTF-8"
> LC_NAME="en_US.UTF-8"
> LC_ADDRESS="en_US.UTF-8"
> LC_TELEPHONE="en_US.UTF-8"
> LC_MEASUREMENT="en_US.UTF-8"
> LC_IDENTIFICATION="en_US.UTF-**8"
> LC_ALL=
>
> -----------
>
> The the default will NOW be UTF-8??? (and is wasn't before?!?!?!
> ARG!!!)...
>

Nope. locales have nothing to do with it. Your locales will only come into
effect if you do 'use locale', or slightly more sanely in 5.16 if you do
'use locale ":not_characters"'. Moreso, after you set all those vars, what
did you the expect to be affected? The regex engine? Operations on strings?
IO? How @ARGV is decoded? How the source code is parsed? All/some?


>
>
> IF perl properly pays attention to the environment, as I thought it was
> documented to do in 5.8, great.


PERL_UNICODE=SAD
I have never really used 5.8, but it was my impression that the Unicode
model in 5.8.0 was different from the model of later versions. So when
someone tells you something about 5.8, you ask what sub-version, and by
which vendor : D


> If not... er..*OUCH* (that hurts) -- I've though perl was fully unicode
> in a UTF-8 env since 5.8... when I was told it was... (me<-gullible).
>
>
> So was a I idiot for drinking the koolaid instead of the fine print (a bit
> dry to quench thirst)? Is it fixed now?
>

We are all idiots drinking the koolaid when it comes to Unicode. In that
regard, http://stackoverflow.com/a/6163129 as well as the perlunicook
manpage (new in 5.16!) are _very_ nice resources.


ikegami at adaelis

Apr 25, 2012, 9:47 PM

Post #4 of 38 (342 views)
Permalink
Re: unicode question [In reply to]

On Wed, Apr 25, 2012 at 11:50 PM, Brian Fraser <fraserbn [at] gmail> wrote:

>
> On Wed, Apr 25, 2012 at 10:12 PM, Linda W <perl-diddler [at] tlinx> wrote:
>
>> Then the statement in paragraph 1 about perl having fundamental assymetric
>> problems, no longer applies?
>>
>
> In 5.14+ under unicode_strings, that's mostly right. It's all mostly
> treated as UTF-8, syscalls aside. See "The Unicode Bug" and "When Unicode
> Does not Happen"
>

No, the asymmetry is still 100% there. If you pass bytes to an op which
expects characters, the bytes will effectively be treated as iso-8859-1. If
you pass non-bytes to print, they will be encoded using utf8.


fraserbn at gmail

Apr 25, 2012, 11:49 PM

Post #5 of 38 (343 views)
Permalink
Re: unicode question [In reply to]

On Thu, Apr 26, 2012 at 1:47 AM, Eric Brine <ikegami [at] adaelis> wrote:

> On Wed, Apr 25, 2012 at 11:50 PM, Brian Fraser <fraserbn [at] gmail> wrote:
>
>>
>> On Wed, Apr 25, 2012 at 10:12 PM, Linda W <perl-diddler [at] tlinx> wrote:
>>
>>> Then the statement in paragraph 1 about perl having fundamental
>>> assymetric
>>> problems, no longer applies?
>>>
>>
>> In 5.14+ under unicode_strings, that's mostly right. It's all mostly
>> treated as UTF-8, syscalls aside. See "The Unicode Bug" and "When Unicode
>> Does not Happen"
>>
>
> No, the asymmetry is still 100% there. If you pass bytes to an op which
> expects characters, the bytes will effectively be treated as iso-8859-1. If
> you pass non-bytes to print, they will be encoded using utf8.
>
>
Syscalls and I/O aside (I mistakenly omitted the latter in my previous mail
-- apologies), I think this prove you wrong:

use Devel::Peek;
my $x = "\xdf";
utf8::downgrade($x);

{
no feature 'unicode_strings';
Dump uc $x;
}
{
use feature 'unicode_strings';
Dump uc $x;
}

The whole point of unicode_strings (and unicode_eval) is making ops work on
characters transparently, regardless of the internal encoding.


perl-diddler at tlinx

Apr 26, 2012, 1:26 AM

Post #6 of 38 (343 views)
Permalink
Re: unicode question [In reply to]

Eric Brine wrote:
> On Wed, Apr 25, 2012 at 9:12 PM, Linda W <perl-diddler [at] tlinx
> <mailto:perl-diddler [at] tlinx>> wrote:
>
> I read this in my 5.14 documentation (in man page, for
> perlunicode, my p#\) added).
>
> 1) � � �"use encoding" needed to upgrade non-Latin-1 byte strings
> � � � �By default, there is a fundamental asymmetry in Perl's Unicode
> � � � �model: implicit upgrading from byte strings to Unicode strings
> � � � �assumes that they were encoded in ISO 8859-1 (Latin-1), but
> Unicode
> � � � �strings are downgraded with UTF-8 encoding. �This happens
> because
> � � � �the first 256 codepoints in Unicode happens to agree with
> Latin-1
>
>
> (1) refers to how Perl behaves in response to bugs in user code.
---
??? Bugs in user code the first 256 code points don't agree! The
first 127 code points agree. But at encoding 80, you have to go to 2-byte
encoding, to save everything, -- I don't understand when you say
'downgraded', as downgrading implies a loss of information. Where as
UTF-8 can hold all of
unicode, ISO-8859-1 only holds 256 bytes, the latter half of which are not
unicode compatible because they have the high bit set.

If Perl interprets **STDIN**, (not an arbitrary file opened with 'open', but
standard stream'ed input from an all UTF-8 environment, then the assumption
should be UTF-8 encoding.

To do otherwise is going to cause problems.


>
> (3) refers to the fixing (when C<< use feature 'unicode_strings'; >>
> is used) of most instances of a collection of bug in Perl known as
> "The Unicode Bug".
>
> They are not related.

What I didn't understand was why is it fixed in 5.14 but with a use 5.12
statement?

I.e. wasn't it fixed in 5.12? If it wasn't fixed until 5.14, then why isn't
it a use 5.14 that triggers the new behavior?



>
>
> I.e. If I'm on a UTF-8 system, and my env is set for UTF-8:
>
>
> Your locale only affects Perl when C<< use locale >> is in effect, and
> even then, it doesn't affect file handles. Additionally, there is
>
> use open ':std' => ':locale';
?!!?!?


>
> The default then and now is that Perl does not mess your file handles.
It shouldn't.

It shouldn't mess with UTF-8 encoded STDIN/STDOUT either.

It shouldn't assume a charset that's about 20 years out of date when
most systems default to UTF-8 encoding (Windows aside)...


> Perl returns the bytes it reads from the file handle as is. If your
> file handles are expected to have text of a certain encoding, it's up
> to you to decode it or to tell Perl to decode it. Perl has no way of
> knowing whether a file handle is used to transmit text or not, and it
> has no way of knowing the encoding of that text.
----
If the encoding is NOT UTF-8, yes, but I thought it perl was fully UTF-8
compliant now...?


>
> IF perl properly pays attention to the environment
>
>
> ...it would corrupt data on many file handles.
---
Never mentioned file handles, I'd talking <[STDIN]> and print
[STDOUT/STDERR].

If there is an "asymmetry", in perl it IS messing with the bytes. the
same bytes should be able to go in as come out... asymmetry implies
this isn't
the case.


But I see there is confusion about this with others as well...


public at khwilliamson

Apr 26, 2012, 7:17 AM

Post #7 of 38 (342 views)
Permalink
Re: unicode question [In reply to]

On 04/26/2012 02:26 AM, Linda W wrote:
>>
>> (3) refers to the fixing (when C<< use feature 'unicode_strings'; >>
>> is used) of most instances of a collection of bug in Perl known as
>> "The Unicode Bug".
>>
>> They are not related.
>
> What I didn't understand was why is it fixed in 5.14 but with a use 5.12
> statement?
>
> I.e. wasn't it fixed in 5.12? If it wasn't fixed until 5.14, then why isn't
> it a use 5.14 that triggers the new behavior?

feature 'unicode_strings' is part of the use v5.12 feature bundle. Not
all the fixes that it includes were ready in time for 5.12. The
decision was made to put into 5.12 the ones that were ready, and to not
split the functionality into multiple features, with multiple names.

When you say 'use feature :5.12', you get whatever portion of
unicode_strings is implemented on the current version of Perl that you
are running. 5.16 is extending it even more, to change the behavior of
quotemeta.


doy at tozt

Apr 26, 2012, 7:32 AM

Post #8 of 38 (344 views)
Permalink
Re: unicode question [In reply to]

On Thu, Apr 26, 2012 at 01:26:01AM -0700, Linda W wrote:
> Eric Brine wrote:
> >(1) refers to how Perl behaves in response to bugs in user code.
> ---
> ??? Bugs in user code the first 256 code points don't agree! The
> first 127 code points agree. But at encoding 80, you have to go to 2-byte
> encoding, to save everything, -- I don't understand when you say
> 'downgraded', as downgrading implies a loss of information. Where
> as UTF-8 can hold all of
> unicode, ISO-8859-1 only holds 256 bytes, the latter half of which are not
> unicode compatible because they have the high bit set.
>
> If Perl interprets **STDIN**, (not an arbitrary file opened with 'open', but
> standard stream'ed input from an all UTF-8 environment, then the assumption
> should be UTF-8 encoding.
>
> To do otherwise is going to cause problems.

Why are you assuming that text is the only thing that people ever pipe
to a program? Interpreting STDIN as UTF-8 would break something along
the lines of a perl implementation of gzip, for instance. This may not
be a bad thing for a default assuming it can be overridden, but it would
certainly not be backwards compatible.

-doy


ikegami at adaelis

Apr 26, 2012, 9:06 AM

Post #9 of 38 (342 views)
Permalink
Re: unicode question [In reply to]

On Thu, Apr 26, 2012 at 2:49 AM, Brian Fraser <fraserbn [at] gmail> wrote:

> Syscalls and I/O aside (I mistakenly omitted the latter in my previous
> mail -- apologies), I think this prove you wrong:
>

You can't push those aside, since that's the only place automatic encoding
happens.

use feature 'unicode_strings';
my $bytes = "\xE9";
$bytes =~ /\w/; # Matches. Chars were expected, but you provided bytes, so
it's as if iso-8859-1 decoding happened.

my $char = "\x2660";
print $bin_fh $char; # Outputs utf8. Bytes were expected, but you provided
chars, so it encoded using utf8.

The so-called "default" is still asymmetric, exactly like described in the
quoted paragraph.

- Eric


ikegami at adaelis

Apr 26, 2012, 9:51 AM

Post #10 of 38 (341 views)
Permalink
Re: unicode question [In reply to]

On Thu, Apr 26, 2012 at 4:26 AM, Linda W <perl-diddler [at] tlinx> wrote:

> **
> If Perl interprets **STDIN**, (not an arbitrary file opened with 'open',
> but
> standard stream'ed input from an all UTF-8 environment, then the
> assumption
> should be UTF-8 encoding.
>

No. At best, it's only valid to assume it's UTF-8 if the handle is known to
be text, and Perl has no way of knowing that.

To do otherwise would corrupt data.


> (3) refers to the fixing (when C<< use feature 'unicode_strings'; >> is
>> used) of most instances of a collection of bug in Perl known as "The
>> Unicode Bug".
>>
>
>
What I didn't understand was why is it fixed in 5.14 but with a use 5.12
> statement?
>

Some instances were fixed in 5.12, some more in 5.14. Some haven't been
fixed.

use open ':std' => ':locale';
>>
> ?!!?!?


Tells Perl you're expecting text encoded as per your locale. Does what you
want. Read the docs.


> The default then and now is that Perl does not mess your file handles.
>
> It shouldn't.
>
> It shouldn't mess with UTF-8 encoded STDIN/STDOUT either.
>

Exactly. It doesn't. You get exactly what the file contains.


> It shouldn't assume a charset that's about 20 years out of date when most
> systems default to UTF-8 encoding (Windows aside)...
>

It doesn't. It makes no assumption whatsoever. You get exactly what is on
the other end of the handle. That's the only sane approach. Anything else
would corrupt some data.

> Perl returns the bytes it reads from the file handle as is. If your
> file handles are expected to have text of a certain encoding, it's up to
> you to decode it or to tell Perl to decode it. Perl has no way of knowing
> whether a file handle is used to transmit text or not, and it has no way of
> knowing the encoding of that text.
>
> ----
> If the encoding is NOT UTF-8, yes, but I thought it perl was fully UTF-8
> compliant now...?
>

Perl does indeed support UTF-8 and Unicode. That doesn't mean it'll assume
something is UTF-8 when it has no way to know it is.

IF perl properly pays attention to the environment
>>
>>
>> ...it would corrupt data on many file handles.
>>
> ---
> Never mentioned file handles, I'd talking <[STDIN]> and print
> [STDOUT/STDERR].
>

What do you think those are?!?!?!


> If there is an "asymmetry", in perl it IS messing with the bytes. the
> same bytes should be able to go in as come out...

asymmetry implies this isn't the case.
>

The same bytes do "go in as come out". There's no such implication.

The asymmetry mentioned regards Perl's behaviour when given buggy code.
Specifically, when you treat those bytes as unicode code points or vice
versa.

- Eric


ikegami at adaelis

Apr 26, 2012, 10:13 AM

Post #11 of 38 (342 views)
Permalink
Re: unicode question [In reply to]

Linda,

When Perl sees, say, C3 A9 coming in from STDIN, it has no way of knowing
whether that means "" (UTF-8), 43459 (little-endian 16-bit unsigned
integer) or something else. As such, absent instruction such as C<< use
open ':std', ':locale'; >>, it will return those bytes as is.

This is not a bug. This cannot be changed. Other languages do the same
thing, because there is no choice. Take Java for example. Java came out
after Unicode was out and embraced it. Yet, you must still specify the
stream must be decoded.

InputStream stream = System.in;
InputStreamReader byte_reader = new InputStreamReader(stream)
BufferedReader char_reader = new BufferedReader(byte_reader);
char_reader.readLine();

- Eric


jvromans at squirrel

Apr 26, 2012, 10:32 AM

Post #12 of 38 (342 views)
Permalink
Re: unicode question [In reply to]

Jesse Luehrs <doy [at] tozt> writes:

> Why are you assuming that text is the only thing that people ever pipe
> to a program? Interpreting STDIN as UTF-8 would break something along
> the lines of a perl implementation of gzip, for instance.

I expect a Perl program that processes binary information from STDIN to
use an explicit binmode.

-- Johan


tchrist at perl

Apr 26, 2012, 11:15 AM

Post #13 of 38 (345 views)
Permalink
Re: unicode question [In reply to]

Jesse Luehrs <doy [at] tozt> wrote
on Thu, 26 Apr 2012 09:32:59 CDT:

> Why are you assuming that text is the only thing that people ever pipe
> to a program? Interpreting STDIN as UTF-8 would break something along
> the lines of a perl implementation of gzip, for instance. This may not
> be a bad thing for a default assuming it can be overridden, but it would
> certainly not be backwards compatible.

Playing the devil's advocate for a moment, any program that assumes an
unmarked stream to be in binary not text is inherently broken on all
Microsoft-encumbered platforms, as those assume the contrary condition.

--tom

PS: Where Devil=Microsoft


tchrist at perl

Apr 26, 2012, 11:17 AM

Post #14 of 38 (343 views)
Permalink
Re: unicode question [In reply to]

> You can't push those aside, since that's the only place
> automatic encoding happens.

> use feature 'unicode_strings';
> my $bytes = "\xE9";
> $bytes =~ /\w/; # Matches. Chars were expected, but you provided bytes,
> # so it's as if iso-8859-1 decoding happened.

Eh? I see no bytes. Did you say use bytes? Don't think so.

That's just codepoint U+00E9.

But perhaps by byte you actually simply meant code point < 0x100?

--tom


tchrist at perl

Apr 26, 2012, 11:19 AM

Post #15 of 38 (347 views)
Permalink
Re: unicode question [In reply to]

Eric Brine <ikegami [at] adaelis> wrote
on Thu, 26 Apr 2012 12:51:28 EDT:

>> ** If Perl interprets **STDIN**, (not an arbitrary file opened with
>> 'open', but standard stream'ed input from an all UTF-8 environment,
>> then the assumption should be UTF-8 encoding.

> No. At best, it's only valid to assume it's UTF-8 if the handle
> is known to be text, and Perl has no way of knowing that.

> To do otherwise would corrupt data.

Well, yes, kinda.

But I've run with a PERL_UNICODE=SA environment for years, and
can't remember having to disable that.

The annoying thing is having to do PERL_UNICODE=SAD for certain
programs but not others. It really really annoys me that <ARGV>
could ever have a mixed encoding.

--tom


tchrist at perl

Apr 26, 2012, 11:26 AM

Post #16 of 38 (341 views)
Permalink
Re: unicode question [In reply to]

Eric Brine <ikegami [at] adaelis> wrote
on Thu, 26 Apr 2012 13:13:23 EDT:

> When Perl sees, say, C3 A9 coming in from STDIN, it has no way of
> knowing whether that means "é" (UTF-8), 43459 (little-endian 16-bit
> unsigned integer) or something else. As such, absent instruction such
> as C<< use open ':std', ':locale'; >>, it will return those bytes as is.

> This is not a bug. This cannot be changed.

No argument at all. We once tried going down that route. It didn't
work. At all. From the perlrun manpage:

"-C" on its own (not followed by any number or option list), or the
empty string "" for the "PERL_UNICODE" environment variable, has the
same effect as "-CSDL". In other words, the standard I/O handles and
the default "open()" layer are UTF-8-fied but only if the locale
environment variables indicate a UTF-8 locale. This behaviour follows
the implicit (and problematic) UTF-8 behaviour of Perl 5.8.0.

It was a Bad Thing.

> Other languages do the same
> thing, because there is no choice. Take Java for example. Java came
> out after Unicode was out and embraced it. Yet, you must still specify
> the stream must be decoded.
>
> InputStream stream = System.in;
> InputStreamReader byte_reader = new InputStreamReader(stream)
> BufferedReader char_reader = new BufferedReader(byte_reader);
> char_reader.readLine();

This is one of the most common bugs with Java code: they don't set
the character encoding, which gets a "platform default" that I promise
you that you don't ever want: it's always 8 bit. Furthermore, the
default Java encoder/decoder code supresses errors. You get bogus
input and bogus output *all the time*. Here's my standard example
of getting all one's ducks in order in Java:

Process
slave_process = Runtime.getRuntime().exec("perl -CS script args");

OutputStream
__bytes_into_his_stdin = slave_process.getOutputStream();

OutputStreamWriter
chars_into_his_stdin = new OutputStreamWriter(
__bytes_into_his_stdin,
/* DO NOT OMIT! */ Charset.forName("UTF-8").newEncoder()
);

InputStream
__bytes_from_his_stdout = slave_process.getInputStream();

InputStreamReader
chars_from_his_stdout = new InputStreamReader(
__bytes_from_his_stdout,
/* DO NOT OMIT! */ Charset.forName("UTF-8").newDecoder()
);

InputStream
__bytes_from_his_stderr = slave_process.getErrorStream();

InputStreamReader
chars_from_his_stderr = new InputStreamReader(
__bytes_from_his_stderr,
/* DO NOT OMIT! */ Charset.forName("UTF-8").newDecoder()
);


Blech.

--tom


tchrist at perl

Apr 26, 2012, 11:27 AM

Post #17 of 38 (340 views)
Permalink
Re: unicode question [In reply to]

Johan Vromans <jvromans [at] squirrel> wrote
on Thu, 26 Apr 2012 19:32:52 +0200:

> Jesse Luehrs <doy [at] tozt> writes:

>> Why are you assuming that text is the only thing that people ever pipe
>> to a program? Interpreting STDIN as UTF-8 would break something along
>> the lines of a perl implementation of gzip, for instance.

> I expect a Perl program that processes binary information from STDIN to
> use an explicit binmode.

Ayup.

--tom


perl-diddler at tlinx

Apr 26, 2012, 12:23 PM

Post #18 of 38 (344 views)
Permalink
Re: unicode question [In reply to]

Jesse Luehrs wrote:
> Why are you assuming that text is the only thing that people ever pipe
> to a program? Interpreting STDIN as UTF-8 would break something along
> the lines of a perl implementation of gzip, for instance. This may not
> be a bad thing for a default assuming it can be overridden, but it would
> certainly not be backwards compatible.
>
---
Because historically STDIN/STDOUT were not always 8-bit safe? -
especially
over telnet connections?

I don't know if that's the case any more, but even today, there are
some mail handlers that don't do well with 8-bit encoding -- and you
need to 7-bit encode things to send such things "through pipes".

Pipes were designed manipulate I/O to or from a terminal or a file
containing terminal displayable data -- whole bunches of messes can
happen on some platforms if you pipe random binary text into any program
that does I/O
to STDIN/STDOUT.

Try using cat /etc/bash sometime and see how well your terminal
likes that.

Besides, doesn't perl do default text processing on STDIN/OUT? I
don't know if it would be binary transparent (especially not if there
asymmetry in I/O).


ikegami at adaelis

Apr 26, 2012, 1:01 PM

Post #19 of 38 (344 views)
Permalink
Re: unicode question [In reply to]

On Thu, Apr 26, 2012 at 2:17 PM, Tom Christiansen <tchrist [at] perl> wrote:

>
> > You can't push those aside, since that's the only place
> > automatic encoding happens.
>
> > use feature 'unicode_strings';
> > my $bytes = "\xE9";
> > $bytes =~ /\w/; # Matches. Chars were expected, but you provided
> bytes,
> > # so it's as if iso-8859-1 decoding happened.
>
> Eh? I see no bytes. Did you say use bytes? Don't think so.
>

>
That's just codepoint U+00E9.
>

I most assuredly placed a byte in that scalar. I got it from /dev/random.
The match operator requires a code point, but that's not what I passed to
it. Yet, my code is buggy. Then when the asymmetry occurs.


ikegami at adaelis

Apr 26, 2012, 1:03 PM

Post #20 of 38 (344 views)
Permalink
Re: unicode question [In reply to]

On Thu, Apr 26, 2012 at 4:01 PM, Eric Brine <ikegami [at] adaelis> wrote:

> On Thu, Apr 26, 2012 at 2:17 PM, Tom Christiansen <tchrist [at] perl>wrote:
>
>>
>> > You can't push those aside, since that's the only place
>> > automatic encoding happens.
>>
>> > use feature 'unicode_strings';
>> > my $bytes = "\xE9";
>> > $bytes =~ /\w/; # Matches. Chars were expected, but you provided
>> bytes,
>> > # so it's as if iso-8859-1 decoding happened.
>>
>> Eh? I see no bytes. Did you say use bytes? Don't think so.
>>
>
>>
> That's just codepoint U+00E9.
>>
>
> I most assuredly placed a byte in that scalar. I got it from /dev/random.
> The match operator requires a code point, but that's not what I passed to
> it. Yet, my code is buggy. Then when the asymmetry occurs.
>

That should read "That's when the asymmetry occurs."


tchrist at perl

Apr 26, 2012, 1:04 PM

Post #21 of 38 (341 views)
Permalink
Re: unicode question [In reply to]

>I most assuredly placed a byte in that scalar. I got it from /dev/random.

I don't think so. You placed a code point smaller than 256 there.

--tom


zefram at fysh

Apr 26, 2012, 1:17 PM

Post #22 of 38 (342 views)
Permalink
Re: unicode question [In reply to]

Linda W wrote:
> Because historically STDIN/STDOUT were not always 8-bit safe?

Unix files descriptors have always been 8-bit safe, and furthermore
binary-safe. (9-bit binary safe in at least one implementation.)

>especially
>over telnet connections?

TCP is 8-bit safe, and so is telnet. telnet is not by default
*binary*-safe, however, due to its intended purpose as a virtual terminal.
Actual terminals are never properly binary-safe (actually the concept
doesn't properly apply), and many will do something funny with the top
bit of 8-bit bytes.

> I don't know if that's the case any more, but even today, there
>are some mail handlers that don't do well with 8-bit encoding

Mail transport has historically been neither binary-safe nor 8-bit safe.
Today it's commonly 8-bit safe, but that doesn't really matter, because
MIME makes mail messages binary-safe even on old transport infrastructure.

>you need to 7-bit encode things to send such things "through pipes".
>
> Pipes were designed manipulate I/O to or from a terminal or a file
>containing terminal displayable data

Rubbish. Unix pipes were designed to handle any data, and as such have
always been 8-bit (or 9-bit) binary safe.

>happen on some platforms if you pipe random binary text into any
>program that does I/O
>to STDIN/STDOUT.

Unix utility programs have historically often not been binary-safe.
That's a bug in those programs, not present in modern versions, and not
a feature of the OS.

> Try using cat /etc/bash sometime and see how well your terminal
>likes that.

Terminals again; yes, they're not binary-safe. Terminals, as the
name suggests, are not intended to act as transparent pipes to convey
arbitrary data.

You grossly misunderstand Unix by supposing that stdin and stdout
necessarily refer to terminals, or that any other part of the I/O
infrastructure is specific to terminals.

> Besides, doesn't perl do default text processing on STDIN/OUT?

If I understand you correctly, "default text processing" has historically
been null on Unix. A stream can be used equally well for text and for
binary data with no difference in how it is operated, and Unix programs
have always been able to rely on that. Applying a non-identity encoding
layer by default would break this.

-zefram


zefram at fysh

Apr 26, 2012, 1:23 PM

Post #23 of 38 (345 views)
Permalink
Re: unicode question [In reply to]

Tom Christiansen wrote:
[quoting Eric Brine]
>>I most assuredly placed a byte in that scalar.
>
>I don't think so. You placed a code point smaller than 256 there.

In Perl these two are synonymous; Perl aliases them. The difference
between byte and character below U+0100 is only one of intent.
So extensionally Eric did place a byte in the scalar, and intensionally
I think he's canonical about the intent of his code, so contradicting
him is wrong on either count. Your statement that he placed a codepoint
in the scalar is extensionally just as correct as his statement, and
intensionally incorrect if Eric's intent is taken as canonical.

-zefram


perl-diddler at tlinx

Apr 26, 2012, 7:00 PM

Post #24 of 38 (341 views)
Permalink
Re: unicode question [In reply to]

Zefram wrote:
> Linda W wrote:
>
>> Because historically STDIN/STDOUT were not always 8-bit safe?
>>
>
> Unix files descriptors have always been 8-bit safe, and furthermore
> binary-safe. (9-bit binary safe in at least one implementation.)
>
I didn't say unix file descriptors -- argue the straw man will ya?

>
>> especially
>> over telnet connections?
>>
>
> TCP is 8-bit safe, and so is telnet. telnet is not by default
> *binary*-safe, however, due to its intended purpose as a virtual terminal.
>
----
We are talking 8-bit safe in the context of binary, so please don't
confuse people by talking between the lines.


> Actual terminals are never properly binary-safe (actually the concept
> doesn't properly apply), and many will do something funny with the top
> bit of 8-bit bytes.
>
----
We are talking I/O to terminals -- not files.

>
>> I don't know if that's the case any more, but even today, there
>> are some mail handlers that don't do well with 8-bit encoding
>>
>
> Mail transport has historically been neither binary-safe nor 8-bit safe.
> Today it's commonly 8-bit safe, but that doesn't really matter, because
> MIME makes mail messages binary-safe even on old transport infrastructure.
>
====
Um...yeah. that's sorta what I said, though not as detailed.

>
>> you need to 7-bit encode things to send such things "through pipes".
>>
>> Pipes were designed manipulate I/O to or from a terminal or a file
>> containing terminal displayable dat
>>
>
> Rubbish. Unix pipes were designed to handle any data, and as such have
> always been 8-bit (or 9-bit) binary safe.
>
----
Well double rubbish...yeah, you are right, but I'm thinking of pipes in
a different sense -- not unix pipes as they are now, but as they were
originally
created in unix as a way to chain together various filters that
processed text, that DID make assumptions about the data being
TEXTual... so it wasn't safe to blindly pipe binary data through random
programs. Also -- off hand, I don't
know of any linux pipes that handle a 9 bit data type... but I'm sure
there have
been all sorts of word sizes...
>
>> happen on some platforms if you pipe random binary text into any
>> program that does I/O
>> to STDIN/STDOUT.
>>
>
> Unix utility programs have historically often not been binary-safe.
> That's a bug in those programs, not present in modern versions, and not
> a feature of the OS.
>
----
This is the whole point -- expecting your STDIO/STDOUT to be binary
safe is not logical -- it's not done.

Perl wasn't written as a binary processor. It was written as
a super "shell+awk+grep+sed+tr" all rolled into one -- those all processed
TEXT... None of those were designed for binary (doesn't mean they might not
be used for such), but Perl was designed for text files.

Text today is Unicode in most environments (UTF-8 in *nix, and usually UCS-2
in windows (though some of their progs really support unicode> V2.0, not
many,
Unicode is at version 6.1, and MS support is somewhere around 3.5 at most...
Idiots... they build roadblocks into their SW with they could just
display the
decoded chars by following formula...but they block chars they haven't
approved of yet (not Unicode---MS!)


>
>> Try using cat /etc/bash sometime and see how well your terminal
>> likes that.
>>
>
> Terminals again; yes, they're not binary-safe. Terminals, as the
> name suggests, are not intended to act as transparent pipes to convey
> arbitrary data.
>
> You grossly misunderstand Unix by supposing that stdin and stdout
> necessarily refer to terminals, or that any other part of the I/O
> infrastructure is specific to terminals.
>
-----
You grossly misunderstand the problem.

We are talking usage of perl to process material on STDIN/STDOUT at
a terminal in an environment with a standard locale set.

Any other stuff about unix pipes is you confusing the issue..

I know you can transfer binary data over pipes -- but those are pipes
that are not connected to terminals (usually)... sockets, named pipes, etc.
all handle binary data.... But UTF-8 is an encoding designed for humans
to look
at -- not machines. We are discussing perls ability to decipher and/or
encode
data to be read directly by humans -- not binary data.
Please don't confuse the issue.


>> Besides, doesn't perl do default text processing on STDIN/OUT?
>>
>
> If I understand you correctly, "default text processing" has historically
> been null on Unix.
You don't understand -- Perl != unix. Unix != Perl. I'm sure Larry
would glow at your equating the two, but they aren't the same.


> A stream can be used equally well for ...
>
---
Looking to fabricate arguments? Bored? Troll much?


perl-diddler at tlinx

Apr 26, 2012, 7:10 PM

Post #25 of 38 (342 views)
Permalink
Re: unicode question [In reply to]

Tom Christiansen wrote:
> Eric Brine <ikegami [at] adaelis> wrote
> on Thu, 26 Apr 2012 13:13:23 EDT:
>
>
>> When Perl sees, say, C3 A9 coming in from STDIN, it has no way of
>> knowing whether that means "é" (UTF-8), 43459 (little-endian 16-bit
>> unsigned integer) or something else. As such, absent instruction such
>> as C<< use open ':std', ':locale'; >>, it will return those bytes as is.
>>
>
>
>> This is not a bug. This cannot be changed.
>>
>
> No argument at all. We once tried going down that route. It didn't
> work. At all. From the perlrun manpage:
>
> "-C" on its own (not followed by any number or option list), or the
> empty string "" for the "PERL_UNICODE" environment variable, has the
> same effect as "-CSDL". In other words, the standard I/O handles and
> the default "open()" layer are UTF-8-fied but only if the locale
> environment variables indicate a UTF-8 locale. This behaviour follows
> the implicit (and problematic) UTF-8 behaviour of Perl 5.8.0.
>
-----
AH HA!....

That was when I remember it becoming UTF-8 compat!... and you are saying
that was undone.... !)!@()$(!)*%#)%T@$^%(&)($%

Why wasn't "undone" put in a "use 5.10" or some such...
Every frickin time I upgrade perl 5.6-8-10-12-14, multiple programs
break
in non-trivial ways.


IT sucks. Why bother with deprecation and "use feature" posing, if
people are going to break compat every cycle?


> It was a Bad Thing.
>
----
Worked for me.... I relied on it and now am trying to figure out why
another
prog from (well before that era, actually, but it got a major overhaul then
along with a few others, since I've been an early adopter on Unicode/UTF-8
for some time).


> This is one of the most common bugs with Java code: they don't set
> the character encoding, which gets a "platform default" that I promise
> you that you don't ever want: it's always 8 bit.
That's the bogus part. On linux for the past 10 years, platform
defaults have
been UTF-8 -- not 8-bit charsets. The last windows platform to NOT have
native
unicode support was Win98/ME. WinXP and beyond was 16-bit unicode (later
got left behind when they needed UTF-16 and not just UCS-2)...


Perl wasn't designed to be java.

Is it now the goal to emulate java???


> Blech.
>
Java --- blech!...;-)_

First page Previous page 1 2 Next page Last page  View All Perl porters RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.