Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Perl: porters

on the almost impossibility to write correct XS modules

 

 

First page Previous page 1 2 3 Next page Last page  View All Perl porters RSS feed   Index | Next | Previous | View Threaded


schmorp at schmorp

Apr 25, 2008, 5:13 PM

Post #1 of 52 (596 views)
Permalink
on the almost impossibility to write correct XS modules

Hi!

I recently found out that it is almost impossible to write XS modules that
deal with unicode correctly, and here is why:

First, the long-known-issue:

In XS parameters, the type "char *" is utterly useless, as you have no clue
about the encoding of the characters. This even breaks backward compatibility
to existing xs modules, who do not expect character values >255.

A lot of modules on CPAN have been broken by this incompatible change in
5.6 or so.

Now, how about fixing it?

Some modules started to use different typemap entries to work around this
issue, for example:

void LOG (utf8_string msg)

T_OCTETS
$var = SvPVbyte_nolen ($arg)

T_UTF8 // == utf8_string
$var = SvPVutf8_nolen ($arg)

Unfortunately, unlike other, similar, functions (like SvIV, SvPV etc.), this
easily destroys the scalar value:

LOG ("see this object:");
LOG ($obj);
# $obj no longer an object here, it became a string

So unlike other accessor functions such as SvPV, SvPVutf8 changes the
contents of the SV in a very visible way (while SvIV doesn't destroy the
string, for example).

I can understand why it does so, but the problem is, there is simply no good
way to deal with utf-8 in XS as the API is extremely hostile at the moment.

To get it right, I think one has to do something like this (this can be
optimised of course, but that makes it even more complicated):

T_UTF8
$var = SvPVutf8_nolen (sv_mortalcopy ($arg))

I think the situation with unicode and cpan perl modules cannot improve
as long as it so difficult to do somethign as simple as get at the string
data in a non-random/godgiven encoding.

Also, even though it is 5.10 now, it should be *seriously* considered to
replace the almost completely useless char * typemap entry by something
that gives you octets (preferably non-destructively). Or somebody explain
to me when "char *" does something useful in current perl versions without
tinkering with retesting ST(x) manually...

Just my 0.02¤.

--
The choice of a Deliantra, the free code+content MORPG
-----==- _GNU_ http://www.deliantra.net
----==-- _ generation
---==---(_)__ __ ____ __ Marc Lehmann
--==---/ / _ \/ // /\ \/ / pcg[at]goof.com
-=====/_/_//_/\_,_/ /_/\_\


rgarciasuarez at gmail

Apr 26, 2008, 8:33 AM

Post #2 of 52 (574 views)
Permalink
Re: on the almost impossibility to write correct XS modules [In reply to]

2008/4/26 Marc Lehmann <schmorp[at]schmorp.de>:
> Some modules started to use different typemap entries to work around this
> issue, for example:
>
> void LOG (utf8_string msg)
>
> T_OCTETS
> $var = SvPVbyte_nolen ($arg)
>
> T_UTF8 // == utf8_string
> $var = SvPVutf8_nolen ($arg)
>
> Unfortunately, unlike other, similar, functions (like SvIV, SvPV etc.), this
> easily destroys the scalar value:
>
> LOG ("see this object:");
> LOG ($obj);
> # $obj no longer an object here, it became a string

You mean, like "Class=HASH(0xDEADBEEF)", I suppose (I haven't checked).

> So unlike other accessor functions such as SvPV, SvPVutf8 changes the
> contents of the SV in a very visible way (while SvIV doesn't destroy the
> string, for example).
>
> I can understand why it does so, but the problem is, there is simply no good
> way to deal with utf-8 in XS as the API is extremely hostile at the moment.
>
> To get it right, I think one has to do something like this (this can be
> optimised of course, but that makes it even more complicated):
>
> T_UTF8
> $var = SvPVutf8_nolen (sv_mortalcopy ($arg))
>
> I think the situation with unicode and cpan perl modules cannot improve
> as long as it so difficult to do somethign as simple as get at the string
> data in a non-random/godgiven encoding.

That's right, and that's probably also why people find it difficult to
handle utf-8 in perl as soon as they begin using XS modules.

I think that this screams for a new macro which would be more or less
the one you suggested here, maybe implemented in a more efficient way if
possible.

> Also, even though it is 5.10 now, it should be *seriously* considered to
> replace the almost completely useless char * typemap entry by something
> that gives you octets (preferably non-destructively). Or somebody explain
> to me when "char *" does something useful in current perl versions without
> tinkering with retesting ST(x) manually...

You mean this one ?
T_PV
$var = ($type)SvPV_nolen($arg)
Here, the result would be dependent on the internal representation of
the string in perl, so I suppose you would like to change this to
something that uses SvPVbyte. I wonder, however, how much code would
break with that change. Actually I suspect that more code would be fixed
than breaked, but that's a wild intuition...


schmorp at schmorp

Apr 26, 2008, 8:59 PM

Post #3 of 52 (571 views)
Permalink
Re: on the almost impossibility to write correct XS modules [In reply to]

On Sat, Apr 26, 2008 at 05:33:02PM +0200, Rafael Garcia-Suarez <rgarciasuarez[at]gmail.com> wrote:
> > LOG ("see this object:");
> > LOG ($obj);
> > # $obj no longer an object here, it became a string
>
> You mean, like "Class=HASH(0xDEADBEEF)", I suppose (I haven't checked).

Yes, for example references.

> > I think the situation with unicode and cpan perl modules cannot improve
> > as long as it so difficult to do somethign as simple as get at the string
> > data in a non-random/godgiven encoding.
>
> That's right, and that's probably also why people find it difficult to
> handle utf-8 in perl as soon as they begin using XS modules.

Yes, it becomes unpredictable. It is also not very helpful when one has to
fight for every bugfix regarding unicode in perl (the bugfix could break
existing code), and when everybody mentally uses and propagates a slightly
different unicode model.

The biggest problem with perl and unicode is that it isn't consistent, and
little is done to make it consistent.

And little can be done from the side of XS to make this work.

Now, to make this mail a bit more helpful, which flags would I need to check
to achieve the following:

1. if the SV can safely be SvPVutf'd, just do it
2. if not, SvPVutf8'd a mortalcopy

i.e., how could I check that calling SvPVutf8 (or bytes) on a scalar is
"safe" in the sense of not modifying it w.r.t. to perl visibility?

> I think that this screams for a new macro which would be more or less
> the one you suggested here, maybe implemented in a more efficient way if
> possible.

Well, I think that would be a very bad way to solve the problem.

For example, in the reference case, the stringified non-utf-8 reference is
still valid utf-8, so we don't have to create an sv with an utf-8 string.

(if backwards-compatibility is a problem, name the macro differently, but
both solutions would require that).

> You mean this one ?
> T_PV
> $var = ($type)SvPV_nolen($arg)
> Here, the result would be dependent on the internal representation of
> the string in perl, so I suppose you would like to change this to
> something that uses SvPVbyte. I wonder, however, how much code would
> break with that change. Actually I suspect that more code would be fixed
> than breaked, but that's a wild intuition...

Thats my point: all modules that haven't been updated would be fixed.

The problem that people out there have with unicode in perl is that it
doesn't work with their favourite module. When it doesn't work, they start to
google and find stuff about this utf-8 flag that they then want to
manipulate, thinking unicode==utf-8 flag, and this doesn't work too well
either, so they get frustrated etc.

I keep giving talks about perls unicode model, but it is hard to explain it
without also saying that it doesn't work in practise, and everything is
immensely complicate din practise, and....

Now, as for changing T_PV:

what could be done trivially (I'd happily send a patch) would be to provide

typedef char char_bytes;
typedef char char_utf8;

char_utf8 T_PVutf8
char_bytes T_PVbytes

*preferably* with the mortalcopy trick or a more efficient/perlish
solution to the SvPVutf8 problem.

Changing T_PV itself is something that should really be considered,
though. Code that breaks almost certainly exists, but isn't it somewhat
quetsionable anyways to use "char *" in perl and then get access to the
flag in other ways (e.g. by counting agruments and using ST(n))?

the point is, code that uses char * without looking at the flag in other
ways is simply broken, it has no defined/documentable behaviour on the
perl level.

--
The choice of a Deliantra, the free code+content MORPG
-----==- _GNU_ http://www.deliantra.net
----==-- _ generation
---==---(_)__ __ ____ __ Marc Lehmann
--==---/ / _ \/ // /\ \/ / pcg[at]goof.com
-=====/_/_//_/\_,_/ /_/\_\


nick at ccl4

May 8, 2008, 9:04 AM

Post #4 of 52 (546 views)
Permalink
Re: on the almost impossibility to write correct XS modules [In reply to]

On Sun, Apr 27, 2008 at 05:59:53AM +0200, Marc Lehmann wrote:

> the point is, code that uses char * without looking at the flag in other
> ways is simply broken, it has no defined/documentable behaviour on the
> perl level.

Agree. The code can't know for sure what it's getting.

I'm still digesting the rest of the ideas in the thread.

Nicholas Clark


davidnicol at gmail

May 9, 2008, 12:38 PM

Post #5 of 52 (540 views)
Permalink
Re: on the almost impossibility to write correct XS modules [In reply to]

On Thu, May 8, 2008 at 4:04 PM, Nicholas Clark
>
> The code can't know for sure what it's getting.

IIRC, one of the ideas kicked around when Ingy stunned the world with
Inline::C was that with the smaller set of standard macros, code
written with Inline rather than directly in XS would be more
forward-compatible; a trend would occur (magical thinking!) of
maintainers of XS modules rewriting using Inline instead.


demerphq at gmail

May 14, 2008, 2:55 PM

Post #6 of 52 (527 views)
Permalink
Re: on the almost impossibility to write correct XS modules [In reply to]

2008/5/8 Nicholas Clark <nick[at]ccl4.org>:
> On Sun, Apr 27, 2008 at 05:59:53AM +0200, Marc Lehmann wrote:
>
> > the point is, code that uses char * without looking at the flag in other
> > ways is simply broken, it has no defined/documentable behaviour on the
> > perl level.
>
> Agree. The code can't know for sure what it's getting.
>
> I'm still digesting the rest of the ideas in the thread.

This problem crops in the core as well as in XS. For instance our
internal file apis use char* parameters. (This is relevant for Win32
in particular)

yves




--
perl -Mre=debug -e "/just|another|perl|hacker/"


schmorp at schmorp

May 15, 2008, 4:02 PM

Post #7 of 52 (520 views)
Permalink
Re: on the almost impossibility to write correct XS modules [In reply to]

On Wed, May 14, 2008 at 11:55:40PM +0200, demerphq <demerphq[at]gmail.com> wrote:
> > I'm still digesting the rest of the ideas in the thread.
>
> This problem crops in the core as well as in XS. For instance our

Didn't know that, at least in the core it would be "easily" fixable,
though.

The bigger problem seems to be that there is no universal agreement on
how unicode works and what the utf-8 flag means between perl5-porters,
and this issue would have to be tackled first. Violations of the official
model should be treated as bugs. I remember well how hard it was to get
the unpack fix in, and don't really want to repeat the experience.

> internal file apis use char* parameters. (This is relevant for Win32
> in particular)

From what I saw, however, filename handling on win32 is broken beyond
repair, as much of the win32 code overloads the utf-8 flag with the
additional meaning "unicode encoded, not ansi codepage" (see for example
almost all functions dealing with paths in Win32.xs).

In the BDB module, I had to manually convert from ansi to utf-8 *iff* the
utf-8 flag isn't set, as that is the only way to find out the encoding of
the filename. Pray there are no upgrades/downgrades.

This is the code I use, and of course, it doesn't work reliable, but at
least with manual enforcement on the perl level you can access files if
you outwit perl: http://ue.tst.eu/11efe1d0af10d044d9c6fe5061d9ee77.txt

So in practise, this is not a real problem, as non-ascii filenames don't
really work under perl on windows anyways :) *oink*.

--
The choice of a Deliantra, the free code+content MORPG
-----==- _GNU_ http://www.deliantra.net
----==-- _ generation
---==---(_)__ __ ____ __ Marc Lehmann
--==---/ / _ \/ // /\ \/ / pcg[at]goof.com
-=====/_/_//_/\_,_/ /_/\_\


jand at activestate

May 15, 2008, 4:31 PM

Post #8 of 52 (518 views)
Permalink
RE: on the almost impossibility to write correct XS modules [In reply to]

On Thu, 15 May 2008, Marc Lehmann wrote:
> On Wed, May 14, 2008 at 11:55:40PM +0200, demerphq <demerphq[at]gmail.com> wrote:
> From what I saw, however, filename handling on win32 is broken beyond
> repair, as much of the win32 code overloads the utf-8 flag with the
> additional meaning "unicode encoded, not ansi codepage" (see for example
> almost all functions dealing with paths in Win32.xs).

Why is this beyond repair? The problem of course is that Perl on Windows
assumes that strings without the utf-8 flag are encoded with Latin-1
encoding whereas they really are ANSI encoded. So once the automatic
upgrading assumes ANSI encoding instead of Latin-1, everything should be
working correctly, no?

Cheers,
-Jan


rurban at x-ray

May 16, 2008, 5:45 AM

Post #9 of 52 (514 views)
Permalink
Re: on the almost impossibility to write correct XS modules [In reply to]

2008/5/16 Jan Dubois <jand[at]activestate.com>:
> On Thu, 15 May 2008, Marc Lehmann wrote:
>> On Wed, May 14, 2008 at 11:55:40PM +0200, demerphq <demerphq[at]gmail.com> wrote:
>> From what I saw, however, filename handling on win32 is broken beyond
>> repair, as much of the win32 code overloads the utf-8 flag with the
>> additional meaning "unicode encoded, not ansi codepage" (see for example
>> almost all functions dealing with paths in Win32.xs).
>
> Why is this beyond repair? The problem of course is that Perl on Windows
> assumes that strings without the utf-8 flag are encoded with Latin-1
> encoding whereas they really are ANSI encoded. So once the automatic
> upgrading assumes ANSI encoding instead of Latin-1, everything should be
> working correctly, no?

To me the Win32 attempts also sounds fine, and I'm using it as template for
the upcoming wide char support within cygwin-1.7.

I make the same assumptions as the Win32 port, so we have to handle
the native IO with wchar_t * and the whole mb(=utf8)<->wide enchilada quirks.
--
Reini Urban
http://phpwiki.org/ http://murbreak.at/


schmorp at schmorp

May 17, 2008, 6:17 AM

Post #10 of 52 (506 views)
Permalink
Re: on the almost impossibility to write correct XS modules [In reply to]

On Thu, May 15, 2008 at 04:31:13PM -0700, Jan Dubois <jand[at]activestate.com> wrote:
> encoding whereas they really are ANSI encoded. So once the automatic
> upgrading assumes ANSI encoding instead of Latin-1, everything should be
> working correctly, no?

Uhm.... that one can even suggest such brokenness :)

Of course basically everything will break, you mean, because the
assumption that its not latin1 of course breaks roughly all code dealing
with unicode in perl, which doesn't expect that perl suddenly uses ANSI
instead of unicode codepoints (they differ!).

--
The choice of a Deliantra, the free code+content MORPG
-----==- _GNU_ http://www.deliantra.net
----==-- _ generation
---==---(_)__ __ ____ __ Marc Lehmann
--==---/ / _ \/ // /\ \/ / pcg[at]goof.com
-=====/_/_//_/\_,_/ /_/\_\


jand at activestate

May 17, 2008, 10:50 AM

Post #11 of 52 (503 views)
Permalink
RE: on the almost impossibility to write correct XS modules [In reply to]

On Sat, 17 May 2008, Marc Lehmann wrote:
> On Thu, May 15, 2008 at 04:31:13PM -0700, Jan Dubois <jand[at]activestate.com> wrote:
> > encoding whereas they really are ANSI encoded. So once the automatic
> > upgrading assumes ANSI encoding instead of Latin-1, everything should be
> > working correctly, no?
>
> Uhm.... that one can even suggest such brokenness :)

I see the smiley, but I'm not sure I understand the comment. Surely the
actual strings without SvUTF8 set are encoded in the system default ANSI
codepage: text returned by qx() will be ANSI encoded, filenames returned
by readdir() will be ANSI encoded and so on. This is just the nature of
the 8-bit OS API.

The brokenness right now is that when Perl automatically upgrades this
data to UTF8, it assumes that the data is Latin1 instead of ANSI,
potentially garbling the data if it contained codepoints where the
current ANSI codepage and Latin1 are different.

How would you want to "fix" this then? Translate all 8-bit data when it
is read from the OS from ANSI to Latin1? That seems a lot harder, and
will also be quite unintuitive.

> Of course basically everything will break, you mean, because the
> assumption that its not latin1 of course breaks roughly all code dealing
> with unicode in perl, which doesn't expect that perl suddenly uses ANSI
> instead of unicode codepoints (they differ!).

Only code making the explicit assumption that 8-bit strings are encoded
in Latin1 is going to break. All code relying on the implicit conversion
between 8-bit and UTF8 will actually be fixed and not broken by this
change. :)

Cheers,
-Jan


ben at morrow

May 17, 2008, 11:38 AM

Post #12 of 52 (503 views)
Permalink
Re: on the almost impossibility to write correct XS modules [In reply to]

Quoth jand[at]activestate.com ("Jan Dubois"):
>
> The brokenness right now is that when Perl automatically upgrades this
> data to UTF8, it assumes that the data is Latin1 instead of ANSI,
> potentially garbling the data if it contained codepoints where the
> current ANSI codepage and Latin1 are different.

So you would have

"\xff"

and

substr "\xff\x{100}", 0, 1

be different? Or would you have ord("\xff") != 0xff ? More generally,
how would you handle the relationship between bytes, characters and
Unicode codepoints, if not as it is done now? If integers less than 256
are mapped to their ANSI codepoints rather than their ISO8859-1
codepoints, how do you get those characters in Latin1 that aren't in the
ANSI codepage?

> How would you want to "fix" this then? Translate all 8-bit data when it
> is read from the OS from ANSI to Latin1? That seems a lot harder, and
> will also be quite unintuitive.

I would say Win32 should exclusively use the Unicode APIs, and treat
8-bit strings the same as their upgraded equivalents (that is, as
ISO8859-1). This may break code that reads ANSI-encoded data from a file
under the assumption it will be passed to the 8-bit API, of course.

> Only code making the explicit assumption that 8-bit strings are encoded
> in Latin1 is going to break. All code relying on the implicit conversion
> between 8-bit and UTF8 will actually be fixed and not broken by this
> change. :)

Perl explicitly documents that 8-bit data is treated as ISO8859-1,
except on EBCDIC platforms.

Ben

--
BEGIN{*(=sub{$,=*)=sub{print[at]_};local($#,$;,$/)=@_;for(keys%{ #ben[at]morrow.me.uk
$#}){/m/&&next;**=${$#}{$_};/(\w):/&&(&(($#.$_,$;.$+,$/),next);$/==\$*&&&)($;.$
_)}};*_=sub{for(@_){$|=(!$|||$_||&)(q) )));&((q:\:\::,q,,,\$_);$_&&&)("\n")}}}_
$J::u::s::t, $a::n::o::t::h::e::r, $P::e::r::l, $h::a::c::k::e::r, $,


demerphq at gmail

May 18, 2008, 2:36 AM

Post #13 of 52 (477 views)
Permalink
Re: on the almost impossibility to write correct XS modules [In reply to]

2008/5/17 Ben Morrow <ben[at]morrow.me.uk>:
>
> Quoth jand[at]activestate.com ("Jan Dubois"):
>>
>> The brokenness right now is that when Perl automatically upgrades this
>> data to UTF8, it assumes that the data is Latin1 instead of ANSI,
>> potentially garbling the data if it contained codepoints where the
>> current ANSI codepage and Latin1 are different.
>
> So you would have
>
> "\xff"
>
> and
>
> substr "\xff\x{100}", 0, 1
>
> be different? Or would you have ord("\xff") != 0xff ? More generally,
> how would you handle the relationship between bytes, characters and
> Unicode codepoints, if not as it is done now? If integers less than 256
> are mapped to their ANSI codepoints rather than their ISO8859-1
> codepoints, how do you get those characters in Latin1 that aren't in the
> ANSI codepage?
>
>> How would you want to "fix" this then? Translate all 8-bit data when it
>> is read from the OS from ANSI to Latin1? That seems a lot harder, and
>> will also be quite unintuitive.
>
> I would say Win32 should exclusively use the Unicode APIs, and treat
> 8-bit strings the same as their upgraded equivalents (that is, as
> ISO8859-1). This may break code that reads ANSI-encoded data from a file
> under the assumption it will be passed to the 8-bit API, of course.
>
>> Only code making the explicit assumption that 8-bit strings are encoded
>> in Latin1 is going to break. All code relying on the implicit conversion
>> between 8-bit and UTF8 will actually be fixed and not broken by this
>> change. :)
>
> Perl explicitly documents that 8-bit data is treated as ISO8859-1,
> except on EBCDIC platforms.

I dont know about that. We make such assumptions in the regex engine,
and possibly in terms of the expected encoding of source files
without use locale, but i dont think we actually do mandate that it is
latin-1 generally. And Im unconvinced that the suggestion made by Jan
is as problematic as either your or Marc have said. If we used Win32
API calls to convert/acccess system data as widechar (UTF16) and then
converted the result to utf8 then we should be in the clear.

And I dont believe that the problem is in reading data from a *file*.
That type of issue is a) not win32 specific b) extremely common and c)
well soved by the proper application of Encode and friends. (We DO NOT
document that all data files operated on by perl must be in Latin-1).
The problem is how we access the file API's and other system apis (the
most commonly used registry code is not widechar aware for instance).
Currently as far as I know there is no way using perl to use the Win32
widechar apis to create unicode filenames and directories. And if i
understood Jan right then his suggestion would resolve that problem.

Yves


--
perl -Mre=debug -e "/just|another|perl|hacker/"


jand at activestate

May 18, 2008, 8:40 AM

Post #14 of 52 (474 views)
Permalink
RE: on the almost impossibility to write correct XS modules [In reply to]

On Sat, 17 May 2008, Ben Morrow wrote:
> Quoth jand[at]activestate.com ("Jan Dubois"):
> >
> > The brokenness right now is that when Perl automatically upgrades this
> > data to UTF8, it assumes that the data is Latin1 instead of ANSI,
> > potentially garbling the data if it contained codepoints where the
> > current ANSI codepage and Latin1 are different.
>
> So you would have
>
> "\xff"
>
> and
>
> substr "\xff\x{100}", 0, 1
>
> be different?

Potentially, yes. That's what you get for mixing byte and character semantics.

> Or would you have ord("\xff") != 0xff?

No, "\xff" is guaranteed to have byte semantics for backwards compatibility:

perluniintro.pod:
| Note that C<\x..> (no C<{}> and only two hexadecimal digits), C<\x{...}>,
| and C<chr(...)> for arguments less than C<0x100> (decimal 256)
| generate an eight-bit character for backward compatibility with older
| Perls. For arguments of C<0x100> or more, Unicode characters are
| always produced. If you want to force the production of Unicode
| characters regardless of the numeric value, use C<pack("U", ...)>
| instead of C<\x..>, C<\x{...}>, or C<chr()>.

> More generally,
> how would you handle the relationship between bytes, characters and
> Unicode codepoints, if not as it is done now?

The same as it is being done now, just using a different encoding for
the 8-bit character strings.

> If integers less than 256
> are mapped to their ANSI codepoints rather than their ISO8859-1
> codepoints, how do you get those characters in Latin1 that aren't in the
> ANSI codepage?

As perluniintro.pod above points out, the only reliable way to do this
is pack("U", $codepoint). Or you can use named characters using charnames.pm.

> > How would you want to "fix" this then? Translate all 8-bit data when it
> > is read from the OS from ANSI to Latin1? That seems a lot harder, and
> > will also be quite unintuitive.
>
> I would say Win32 should exclusively use the Unicode APIs, and treat
> 8-bit strings the same as their upgraded equivalents (that is, as
> ISO8859-1). This may break code that reads ANSI-encoded data from a file
> under the assumption it will be passed to the 8-bit API, of course.

This is exactly the problem: the C runtime library on Windows assumes
that every char* argument is encoded in the systems ANSI codepage, and
not in Latin1.

So every XS extension would have to not only check for the UTF8 flag and
use a Unicode API when available, but also convert all strings passed without
UTF8 flag from Latin1 to ANSI when calling third-party libraries that don't
provide a Unicode API.

> > Only code making the explicit assumption that 8-bit strings are encoded
> > in Latin1 is going to break. All code relying on the implicit conversion
> > between 8-bit and UTF8 will actually be fixed and not broken by this
> > change. :)
>
> Perl explicitly documents that 8-bit data is treated as ISO8859-1,
> except on EBCDIC platforms.

I know that this is the way it works now, but that was not the original
intend. If you read perluniintro, you'll see:

| =head2 Perl's Unicode Model
|
| Perl supports both pre-5.6 strings of eight-bit native bytes, and
| strings of Unicode characters. The principle is that Perl tries to
| keep its data as eight-bit bytes for as long as possible, but as soon
| as Unicodeness cannot be avoided, the data is transparently upgraded
| to Unicode.
|
| Internally, Perl currently uses either whatever the native eight-bit
| character set of the platform (for example Latin-1) is, defaulting to
| UTF-8, to encode Unicode strings. Specifically, if all code points in
| the string are C<0xFF> or less, Perl uses the native eight-bit
| character set. Otherwise, it uses UTF-8.

Note the explicit reference to "whatever the native character set is".

If we expected all external data to always be converted to Latin1, then
we could have saved us the trouble of having 2 different internal
representations and always gone straight to UTF8.

Cheers,
-Jan


Robin.Barker at npl

May 18, 2008, 11:24 AM

Post #15 of 52 (472 views)
Permalink
RE: on the almost impossibility to write correct XS modules [In reply to]

> So you would have
>
> "\xff"
>
> and
>
> substr "\xff\x{100}", 0, 1
>
> be different?

This are already different:

% perl -lwe 'print "\xff" =~ /[[:print:]]/'

% perl -lwe 'print +(substr "\xff\x{100}", 0, 1) =~ /[[:print:]]/'
1

Robin

-------------------------------------------------------------------
This e-mail and any attachments may contain confidential and/or
privileged material; it is for the intended addressee(s) only.
If you are not a named addressee, you must not use, retain or
disclose such information.

NPL Management Ltd cannot guarantee that the e-mail or any
attachments are free from viruses.

NPL Management Ltd. Registered in England and Wales. No: 2937881
Registered Office: Serco House, 16 Bartley Wood Business Park,
Hook, Hampshire, United Kingdom RG27 9UY
-------------------------------------------------------------------


pagaltzis at gmx

May 18, 2008, 7:50 PM

Post #16 of 52 (465 views)
Permalink
Re: on the almost impossibility to write correct XS modules [In reply to]

* Marc Lehmann <schmorp[at]schmorp.de> [2008-05-17 15:20]:
> On Thu, May 15, 2008 at 04:31:13PM -0700, Jan Dubois <jand[at]activestate.com> wrote:
> > encoding whereas they really are ANSI encoded. So once the
> > automatic upgrading assumes ANSI encoding instead of Latin-1,
> > everything should be working correctly, no?
>
> Uhm.... that one can even suggest such brokenness :)
>
> Of course basically everything will break, you mean, because
> the assumption that its not latin1 of course breaks roughly all
> code dealing with unicode in perl, which doesn't expect that
> perl suddenly uses ANSI instead of unicode codepoints (they
> differ!).

Backtracking a bit here, why would this break anything? For
strings coming out of the Win32 API, immediately decode them to
characters; for strings going in, upgrade them to characters if
necessary, then encode them to ANSI at the last moment.

That way, no one ever needs to care that filenames are in ANSI,
because as far as Perl code is concerned it always gets them as
character strings.

Regards,
--
Aristotle Pagaltzis // <http://plasmasturm.org/>


schmorp at schmorp

May 19, 2008, 8:26 AM

Post #17 of 52 (451 views)
Permalink
Re: on the almost impossibility to write correct XS modules [In reply to]

On Sat, May 17, 2008 at 10:50:12AM -0700, Jan Dubois <jand[at]activestate.com> wrote:
> actual strings without SvUTF8 set are encoded in the system default ANSI
> codepage

Not in perl, no.

Strings in Perl aren't encoded at all. Thats a basic fact. They are encoded
*inside* the perl interpreter,. but on the Perl level, strings are simply not
encoded in any way.

Thats the whole point of unicode (and string handling in general) in Perl.

> text returned by qx() will be ANSI encoded,

Why? programs can output whatever they want, it doesn't need to be
ansi-encoded.

> filenames returned by readdir() will be ANSI encoded and so on.

Possibly (but of course only on windows).

> This is just the nature of the 8-bit OS API.

Well, it isn't. On windows, some part of the API use encodings, and others
do not. On unix, its simpler in that low-level OS interfaces generally do
not care for character encodings.

> The brokenness right now is that when Perl automatically upgrades this
> data to UTF8, it assumes that the data is Latin1 instead of ANSI,

Uhm, no, you are totally confused about how character handling is done in
perl, and I cannot blame you (the many bugs and documentation mistakes
combined make it hard to see what is meant).

Strings in perl are simply concatenated characters, which in turn are
represented by numbers.

Perl doesn't store an encoding together with strings, only the programmer
knows the encoding of strings.

This is the correct way to approach unicode because it frees the programmer
from tracking both external and internal encodings.

Perl *cannot* know the encoding of a string unless the user tells it in a
case-by-case basis.

> potentially garbling the data if it contained codepoints where the
> current ANSI codepage and Latin1 are different.

I don't understand this at all.

> How would you want to "fix" this then?

There is nothing to fix. Your suggested change would completely break
perl, it is that simple.

You assume perl knows how strings are encoded, which isn't reasonable
(and certainly not how it is implemented). Strings in perl are just
concatenations of codepoints.

> Only code making the explicit assumption that 8-bit strings are encoded
> in Latin1 is going to break.

No, you don't understand how perl tretas strings at all, sorry.

And this is one of the problems with perl5-porters: too many people who have
no clue about the perl unicode model have opinions, and thats why perl is
currently so broken: parts of the API (win32) effectively implement a model
that the perl core doesn't support, and vice versa.

> All code relying on the implicit conversion between 8-bit and UTF8 will
> actually be fixed and not broken by this change. :)

I really don't have the time to lecture on unicode again and again to
the amsses who don't understand it. I suggets working with unicode and
encodings in perl for a few years *first*, as I did. I learned a great
deal during that time, as I use these features daily.

And as long as perl5-porters have little clue about unicode, perl will
always stay too buggy for anything but experts to use (the experts
supposedly working around the bugs, whcih is extreemly annoying).

--
The choice of a Deliantra, the free code+content MORPG
-----==- _GNU_ http://www.deliantra.net
----==-- _ generation
---==---(_)__ __ ____ __ Marc Lehmann
--==---/ / _ \/ // /\ \/ / pcg[at]goof.com
-=====/_/_//_/\_,_/ /_/\_\


nospam-abuse at bloodgate

May 19, 2008, 9:37 AM

Post #18 of 52 (448 views)
Permalink
Re: on the almost impossibility to write correct XS modules [In reply to]

Moin,

On Monday 19 May 2008 17:26:55 Marc Lehmann wrote:
> On Sat, May 17, 2008 at 10:50:12AM -0700, Jan Dubois
<jand[at]activestate.com> wrote:
[snip]

> > The brokenness right now is that when Perl automatically upgrades
> > this data to UTF8, it assumes that the data is Latin1 instead of
> > ANSI,
>
> Uhm, no, you are totally confused about how character handling is
> done in perl, and I cannot blame you (the many bugs and documentation
> mistakes combined make it hard to see what is meant).
>
> Strings in perl are simply concatenated characters, which in turn are
> represented by numbers.
>
> Perl doesn't store an encoding together with strings, only the
> programmer knows the encoding of strings.
>
> This is the correct way to approach unicode because it frees the
> programmer from tracking both external and internal encodings.

Uhm, excuse me? I don't think this actually frees the programmer from
tracking internal encodings and especially not tracking of external
encodings.

Perl's "one-encoding-for-all" approach has the real world problem that
you cannot easily mix strings without being very very very very
careful, or you get garbage. Automatically and without warning.

And most of the problems when you want to work with Unicode (even if you
_only_ want to use UTF-8, not even throwing UTF-16 into the mix), is
that it is very very easy to have data that is not encoded in UTF-8 nor
latin1, and you mix it with UTF-8 (or encode it twice or whatever) and
you end up with garbage. Which is usually bad as this very discussion
about ansi shows :)

Or in other words, Perls "frees the programmer from traking encodings"
by making him carefully track all strings as they come in and go out
and then track which strings internally are in which encoding, and even
then sometimes you mix fire with water unintentionally. Which I don't
think is ideal as the many many bugs I have found in my own (supposedly
working bugfree) utf-8 using Perl code.

Not to mention that you actually lose the information what original
encoding the string had - "aa" looks the same in latin1 and utf-8, but
depending on which encoding it "has", acts differently. (at least thats
what I remember from regexps discussions)

It would be _much_ easier if all strings in Perl carried their encoding
with them, and Perl would be able to simple mix two strings by
automatically upgrading them according to their encoding. Then you'd
also be able to query the encoding, btw. No more guesswork based upon a
single bit.

The current way (everything is either Latin-1 or UTF-8 and we only have
a single bit to distinguish between these two cases) is just a pain,
especially if you need something else than utf-8.

Here is an example what bit me today, just in case people think this is
a theoretical discussion:

You have a UTF-8 regeps like the following:

my $skip = qr/Quarantäne/i;

You read in data and manually decode it to utf-8 to match it against the
regexp:

my $data = decode('utf-8',from_file());

# much later in the file
if ($data =~ $skip) { ... do something ... };

Now, some time later (maybe much later, and a different person),
replaces the hand-rolled from_file() routine with something that
pre-parses the data. As a side-effect, the data now comes already
decoded in UTF-8 format. The second decode() then destroys the data,
because Perl does not know that the data was already in UTF-8 and
encodes it twice.

Oops, new bug. And this bug could have been prevented entirely if the
string was properly tagged with its encoding, and thus a double
encoding would have been never possible.

So while the current situation is "working" somehow, please do not
describe it as "ideal" :)

All the best,

Tels

--
Signed on Mon May 19 18:11:35 2008 with key 0x93B84C15.
Get one of my photo posters: http://bloodgate.com/posters
PGP key on http://bloodgate.com/tels.asc or per email.

"My glasses, my glasses. I cannot see without my glasses."
- "My
glasses, my glasses. I cannot be seen without my glasses."


jand at activestate

May 19, 2008, 11:45 AM

Post #19 of 52 (447 views)
Permalink
RE: on the almost impossibility to write correct XS modules [In reply to]

On Mon, 19 May 2008, Marc Lehmann wrote:
> On Sat, May 17, 2008 at 10:50:12AM -0700, Jan Dubois
> <jand[at]activestate.com> wrote:
> > actual strings without SvUTF8 set are encoded in the system default
> > ANSI codepage
>
> Not in perl, no.
>
> Strings in Perl aren't encoded at all. Thats a basic fact. They are
> encoded *inside* the perl interpreter,. but on the Perl level, strings
> are simply not encoded in any way.

I guess this is where the confusion originates. I'm not talking about
the Perl language level at atll, I'm only talking about Perl interpreter
internals. I thought this was obvious from the reference to SvUTF8,
which doesn't exist at the Perl language level.

Inside the Perl internals there is an implicit assumption about the
encoding, at least on operating systems that care about this stuff at
the low level (e.g. Windows). On Windows all 8-bit APIs to the
filesystem expect filenames to be encoded in the CP_ACP (ANSI) codepage
(yes, I know you can switch this for some APIs to CP_OEM, but please
let's ignore that).

Since Perl (interpreter internals) doesn't perform any encoding changes
for filenames passed to and from the operating system APIs, it
therefore follows that strings internally are assumed to be ANSI
encoded. If you call

open(my $fh, "<", $filename) or die;

the octets stored in $filename are interpreted according to the ANSI
codepage. This is documented in perluniintro.pod as using "native"
encoding for 8-bit strings.

With the introduction of the SvUTF8 flag we started having an alternate
encoding in the internals. Problems arise when we have to combine strings
from the native encoding with the UTF8 encoding. Since UTF8 is able to encode
all code points encoded by the native encoding, but not vice versa, we
have to re-encode the native strings into UTF8 before we can perform
operations like concatenation. This re-encoding cannot be done without
knowing (or assuming) the encoding of the native strings.

Currently Perl doesn't have any code to treat the native encoding on
Windows correctly and therefore (incorrectly) assumes that all 8-bit
strings are Latin1 encoded. This is the part that needs to be fixed
if strings with SvUTF8 are going to be correct.

[...]

> I really don't have the time to lecture on unicode again and again to
> the amsses who don't understand it.

I guess I don't understand why you are engaging in a discussion about
Unicode implementation on perl5-porters if you don't have the time to
actually argue your points instead of your general handwaving about
how things should work at the high level and then calling people who
point out problems in the actual implementation clueless. Are you
just trolling?

Cheers,
-Jan


schmorp at schmorp

May 19, 2008, 12:36 PM

Post #20 of 52 (441 views)
Permalink
Re: on the almost impossibility to write correct XS modules [In reply to]

On Mon, May 19, 2008 at 11:45:49AM -0700, Jan Dubois <jand[at]activestate.com> wrote:
> I guess this is where the confusion originates. I'm not talking about
> the Perl language level at atll,

Thats bad, because you would break it at the perl language level.

> internals. I thought this was obvious from the reference to SvUTF8,
> which doesn't exist at the Perl language level.

It doesn't matter what *you* are altking about. The change you propose would
breka Perl, the language.

> Inside the Perl internals there is an implicit assumption about the
> encoding,

Inside Perl, there is no such assumption.

> at least on operating systems that care about this stuff at
> the low level (e.g. Windows).

Yes, and there this assumption is simply wrong. It doesn't matter why the
win32 parts expects that, but the assumption is wrong.

> On Windows all 8-bit APIs to the filesystem expect filenames to be
> encoded in the CP_ACP (ANSI) codepage (yes, I know you can switch this
> for some APIs to CP_OEM, but please let's ignore that).

Yes, but Perl isn't win32-specific. You cann change the perl character
handling unilaterally on win32 only and thus break portability to other
systems who don't mangle codepoints on upgrades.

> Since Perl (interpreter internals) doesn't perform any encoding changes
> for filenames passed to and from the operating system APIs,

It doesn't, but it silently upgrades and probably also downgrades them.

> therefore follows that strings internally are assumed to be ANSI
> encoded. If you call

This assumption is untrue. the win32 parts assume in amyn places that a
string is ANSI encoded *if* the utf8 flag is not set, and unicode otherwise
(not-utf-8).

> open(my $fh, "<", $filename) or die;
>
> the octets stored in $filename are interpreted according to the ANSI
> codepage. This is documented in perluniintro.pod as using "native"
> encoding for 8-bit strings.

The problem is not open, but e.g. functions like Win32::SetCwd and lots of
other places where the filename is interpreted as ANSI if SvUTF8 is false,
and unicode otherwise.

This is broken, because it forces the interpretation of the utf-8 flag to be
an encoding for the perl string, which it isn't.

Again: Perl doesn't attach an encoding to a scalar string.

> from the native encoding with the UTF8 encoding. Since UTF8 is able to encode
> all code points encoded by the native encoding,

This is wrong in general.

> but not vice versa, we

This is also wrong in general.

> have to re-encode the native strings into UTF8 before we can perform
> operations like concatenation. This re-encoding cannot be done without
> knowing (or assuming) the encoding of the native strings.

Yes. And your change would force the encoding to be whatever the local
codepage is, and this is simply wrong.

> Currently Perl doesn't have any code to treat the native encoding on
> Windows correctly and therefore (incorrectly) assumes that all 8-bit
> strings are Latin1 encoded.

Thats where you are confused: perl doesn't assume anythjing about latin1 or
unicode. perl uses utf-8 to store codepoints >255 internally, but it does not
treat those strings as unicode, unless forced by usage.

> This is the part that needs to be fixed if strings with SvUTF8 are going
> to be correct.

This doesn't change what I said before, unfortunately. You are simply wrong
in your assumptions in that you believe perl stores or attaches an encodind
to its strings: it does so neither externally (Perl level) nor internally
(perl).

What it does is store strings with codepoints >255 encoded in utf-8,
regardless of the encoding they are in. It also has the ability to store
strings with codepoints <256 in a more space-efficient octet form.

Nowhere in perl is there an assumption about the actual encoding of perl
scalars themselves. Only in certain places (regex matching would be one,
although I heard there are problems, another one is open, and yet another
one is Win32::SetCwd) does it assume so, and this is only after the user
tells perl to do so.

As long as I don't use a function in perl (such as open) that expects some
encoding, perl does not attach encodings to strings.

As such, your proposal to force the encoding to ANSI would completely
break perl.

> > I really don't have the time to lecture on unicode again and again to
> > the amsses who don't understand it.
>
> I guess I don't understand why you are engaging in a discussion about
> Unicode implementation on perl5-porters if you don't have the time to
> actually argue your points instead of your general handwaving about

Because I already have argued my points. Also, I *expect* from peopel enaging
in a disucssion about bugs or problems in the current implementation to have
a working understanding of how perl treats strings.

The fact that you continue to make ourragously wrong claims such as perl
assuming octet-encoded strings would be latin1-encoded proves that you

a) havent' researched the topic (see mailinglist archives)
b) don't know how perl stores strings and how it interprets them

> how things should work at the high level and then calling people who
> point out problems in the actual implementation clueless. Are you
> just trolling?

What do you call what you are doing by coming into a discussion without
having working knowledge of perl string handlign, despite this being
discussed many times over the past years.

Again: perl does *not*, neither internally or externally, attach an
encoding to scalars.

I and others have explained this a number of times. I can explain it to
you privately again if you so desire, but this threadis about practical
problems and bugs, and not about how to break perl in incompatible ways on
windows or about spreading fud about like "perl interprets octet-encoded
strings as latin1" or other obviously wrong stuff.

--
The choice of a Deliantra, the free code+content MORPG
-----==- _GNU_ http://www.deliantra.net
----==-- _ generation
---==---(_)__ __ ____ __ Marc Lehmann
--==---/ / _ \/ // /\ \/ / pcg[at]goof.com
-=====/_/_//_/\_,_/ /_/\_\


schmorp at schmorp

May 19, 2008, 1:19 PM

Post #21 of 52 (442 views)
Permalink
Re: on the almost impossibility to write correct XS modules [In reply to]

On Mon, May 19, 2008 at 04:50:42AM +0200, Aristotle Pagaltzis <pagaltzis[at]gmx.de> wrote:
> > > encoding whereas they really are ANSI encoded. So once the
> > > automatic upgrading assumes ANSI encoding instead of Latin-1,
> > > everything should be working correctly, no?
> >
> > Uhm.... that one can even suggest such brokenness :)
> >
> > Of course basically everything will break, you mean, because
> > the assumption that its not latin1 of course breaks roughly all
> > code dealing with unicode in perl, which doesn't expect that
> > perl suddenly uses ANSI instead of unicode codepoints (they
> > differ!).
>
> Backtracking a bit here, why would this break anything? For
> strings coming out of the Win32 API, immediately decode them to
> characters; for strings going in, upgrade them to characters if
> necessary, then encode them to ANSI at the last moment.

great idea, the basic question is "what are characters"? Obviously,
you cannot mean charatcers in the sense of "lettery/glyphs, character
codepoints etc." because perl doesn't store this information (for example,
when you load a jpg image into some scalar, you don't have a string
composed of "characters", but only octets).

If you mean "codepoints/numerical values" with "characters", then you lose
the information about their encoding.

However, *your* idea would mostly work *iff* you only ever used operating
system interfaces when dealing with filenames.

This is, however, not the case: consider prompting the user for a filename
using a Gtk+ entry to acquire the filename, using a commandline argument as a
filename.

In all those cases, perl cannot know that those strings are filenames, and
when asked to "open" them, might assume they are encoded in "characters"
(whatever they are), when in fact, they are encoded in "utf-8", "koi8-r",
"euc-jp" or so.

Such a model is workable, but there would need to be a defined way to convert
external filenames (e.g. on the comamndline) into something perl's open
understands.

> That way, no one ever needs to care that filenames are in ANSI,
> because as far as Perl code is concerned it always gets them as
> character strings.

If that were possible, sure.

_however_

note that jan didn't propose that, he said the "automatic upgrading"
should change its interpretation from currently 0..255 become 0..255 to
something else, where e.g. character values suddenly change codepoints
(or, equally worse, change their interpretation).

As the name "automatic" implies, perl does this kind of upgrading
automatically, and to my knowledge it is not documented anywhere where this
happens (nor are there any guarantees that it doesn't happen). This is
because "automatic upgrading" is assumed to be something that doesn't change
the string itself.

Jan proposes to actualyl change the string itself (on the perl level) on
those automatic upgrades, and this is what breaks perl, because suddenly
all the internals are exposed to perl code and, worse, your string
interpretation changes at undocumented points that you have to track
yourself.

--
The choice of a Deliantra, the free code+content MORPG
-----==- _GNU_ http://www.deliantra.net
----==-- _ generation
---==---(_)__ __ ____ __ Marc Lehmann
--==---/ / _ \/ // /\ \/ / pcg[at]goof.com
-=====/_/_//_/\_,_/ /_/\_\


schmorp at schmorp

May 19, 2008, 1:25 PM

Post #22 of 52 (441 views)
Permalink
Re: on the almost impossibility to write correct XS modules [In reply to]

On Sat, May 17, 2008 at 07:38:16PM +0100, Ben Morrow <ben[at]morrow.me.uk> wrote:
> Perl explicitly documents that 8-bit data is treated as ISO8859-1,
> except on EBCDIC platforms.

If it only were so (from perluniintro):

Internally, Perl currently uses either whatever the native eight-bit
character set of the platform (for example Latin-1) is, defaulting to
UTF-8, to encode Unicode strings. Specifically, if all code points in the
string are 0xFF or less, Perl uses the native eight-bit character set.
Otherwise, it uses UTF-8.

Of course, this isn't even implementable (nor is it even remotely true),
but this is one of the many issues: whoeever wrote the manpage part either
was confused, or used extremely bad wording (some of it is simply wrong,
other things are maybe badly presented, and still others are illogical).

For example, the "attached to operations" can be easily misunderstood,
as the "bad model" attached wide-characterness to operations, while the
current model attached "unicodeness" to operations (or at least most of
perl uses that interpretation, e.g. open, concatenation etc.), so the wording
"attaching it to operations" is not very helpful.

the people who maintain perl need to agree on a unicode model at one
point, and then implement it with force. The current state of affairs is
extremely damaging, as its impossible to get bugfixes in because everybody
disagrees on wetehr it is a bug or not.

--
The choice of a Deliantra, the free code+content MORPG
-----==- _GNU_ http://www.deliantra.net
----==-- _ generation
---==---(_)__ __ ____ __ Marc Lehmann
--==---/ / _ \/ // /\ \/ / pcg[at]goof.com
-=====/_/_//_/\_,_/ /_/\_\


schmorp at schmorp

May 19, 2008, 1:40 PM

Post #23 of 52 (442 views)
Permalink
Re: on the almost impossibility to write correct XS modules [In reply to]

On Sun, May 18, 2008 at 08:40:33AM -0700, Jan Dubois <jand[at]activestate.com> wrote:
> > So you would have
> > "\xff"
> > substr "\xff\x{100}", 0, 1
> > be different?
>
> Potentially, yes. That's what you get for mixing byte and character semantics.
>
> > Or would you have ord("\xff") != 0xff?
>
> No, "\xff" is guaranteed to have byte semantics for backwards compatibility:

So the \xff in the substr example has different semantics for backwards
compatibility but still you get different results?

Come on, it cannot work that way.

> perluniintro.pod:

perluniintro is simply wrong w.r.t. the current implementation, and
inconsistent overall. quoting it is not helpful.

> > I would say Win32 should exclusively use the Unicode APIs, and treat
> > 8-bit strings the same as their upgraded equivalents (that is, as
> > ISO8859-1). This may break code that reads ANSI-encoded data from a file
> > under the assumption it will be passed to the 8-bit API, of course.
>
> This is exactly the problem: the C runtime library on Windows assumes
> that every char* argument is encoded in the systems ANSI codepage, and
> not in Latin1.

Note that unix basically does the same (figure in "locale"/"user
expectancy" etc. instead of "ANSI").

This issue has nothing to do with windows.

> So every XS extension would have to not only check for the UTF8 flag and
> use a Unicode API when available, but also convert all strings passed without
> UTF8 flag from Latin1 to ANSI when calling third-party libraries that don't
> provide a Unicode API.

The problem is that the utf-8 flag simply has no meaning w.r.t. encoding or
not. It is an internal flag, and forcing the user to track its state is just
braindamaged.

Note that perl doesn't implement this, either, perl's open for example
treats filenames without utf-8 flag not as latin1 on unix, but as octet
strings (in the encodign the user wants), while it treats filenames with
the utf-8 flag set as "utf-8" encoded.

This is in direct contradiction to the perluniintro, like so many others.

Quoting perluniintro when its so fundamentally broken w.r.t. to the existing
implementation is not helpful.

> > Perl explicitly documents that 8-bit data is treated as ISO8859-1,
> > except on EBCDIC platforms.
>
> I know that this is the way it works now, but that was not the original
> intend. If you read perluniintro, you'll see:

again, perluniintro doesn't describe the original model, nor current
reality.

> Note the explicit reference to "whatever the native character set is".

Yes, the reference is there, but the manpage is simply wrong, no matter how
often you quote it.

Here is a typical example what confused users when they read such crap:

"well, how do I handle koi8-r data?"

well, you cannot, because perl, accoridng to that manpage, forces all strings
to either native character encoding or unicode. This is fortunately not so:
manpage wrong.

another gem:

The principle is that Perl tries to keep its data
as eight-bit bytes for as long as possible, but as soon as
Unicodeness cannot be avoided, the data is transparently upgraded to
Unicode.

"how does perl know how my string is encoded when it transparently upgrades
it?"

well, it doesn't, this is why perl doesn't change the string w.r.t. the
perl level when it "transparently" upgrades. Oh yes, except in open (where
it suddenly becomes utf-8, althoguh it is not utf-8 in perl), many xs
modules (where you don't know what it does) or in the Win32 module, where
is interprets filenames either as unicode (not utf-8 as open does) or a
local encoding.

perluniintro is confusing, self-contradicting, and not helpful. you cna do
a lot with it, but not prove a point.

> If we expected all external data to always be converted to Latin1, then
> we could have saved us the trouble of having 2 different internal
> representations and always gone straight to UTF8.

Welcome to the real world: handling binary data as utf-8 is extremely
inefficient. The trouble was made so perl is still capable of producing
something thats not extreemly slow by design, not to allow diferent
encodings.

--
The choice of a Deliantra, the free code+content MORPG
-----==- _GNU_ http://www.deliantra.net
----==-- _ generation
---==---(_)__ __ ____ __ Marc Lehmann
--==---/ / _ \/ // /\ \/ / pcg[at]goof.com
-=====/_/_//_/\_,_/ /_/\_\


schmorp at schmorp

May 19, 2008, 1:44 PM

Post #24 of 52 (442 views)
Permalink
Re: on the almost impossibility to write correct XS modules [In reply to]

> I dont know about that. We make such assumptions in the regex engine,
> and possibly in terms of the expected encoding of source files
> without use locale, but i dont think we actually do mandate that it is
> latin-1 generally. And Im unconvinced that the suggestion made by Jan
> is as problematic as either your or Marc have said. If we used Win32
> API calls to convert/acccess system data as widechar (UTF16) and then
> converted the result to utf8 then we should be in the clear.

Except that all programs expecting commandline arguments, or filenames
stored in files, will not work and there will be no way to make it work?

see below.

I don't want to become personal, but did you work with unicode a while in
perl? introducing silent encoding changes as proposed by jan is deadly, as
there is a total lack of documentatioon on when it happens.

Changing user data is *evil*.

> And I dont believe that the problem is in reading data from a *file*.
> That type of issue is a) not win32 specific b) extremely common and c)
> well soved by the proper application of Encode and friends.

So give us a working example that works on windows, and another that works
on unix (preferably the same, actually). Assume you have a file that stores
a filename in the current locale (unix) or in ansi encoding (windows) and
want to open that file.

Just try it.

> Currently as far as I know there is no way using perl to use the Win32
> widechar apis to create unicode filenames and directories. And if i
> understood Jan right then his suggestion would resolve that problem.

And utterly break perl even more than it currently is.

--
The choice of a Deliantra, the free code+content MORPG
-----==- _GNU_ http://www.deliantra.net
----==-- _ generation
---==---(_)__ __ ____ __ Marc Lehmann
--==---/ / _ \/ // /\ \/ / pcg[at]goof.com
-=====/_/_//_/\_,_/ /_/\_\


schmorp at schmorp

May 19, 2008, 2:16 PM

Post #25 of 52 (442 views)
Permalink
Re: on the almost impossibility to write correct XS modules [In reply to]

On Mon, May 19, 2008 at 06:37:11PM +0200, Tels <nospam-abuse[at]bloodgate.com> wrote:
> > This is the correct way to approach unicode because it frees the
> > programmer from tracking both external and internal encodings.
>
> Uhm, excuse me? I don't think this actually frees the programmer from
> tracking internal encodings and especially not tracking of external
> encodings.

I didn't claim that. I said it frees him from tracking *both*.

> Perl's "one-encoding-for-all" approach has the real world problem that
> you cannot easily mix strings without being very very very very
> careful, or you get garbage. Automatically and without warning.

Yes, and these bugs need to be fixed. If I have a string, then its
interpretation must not silently change just because I did soem oepration
that forced perl to upgrade it internally, without documentation where
exactly this happens or what I can do about it.

Or to make an example, if I have a string that contains a single character
with codepoint 200, then I do not want it to change in any way (on the
perl level), regardless of any upgrades or downgrades that perl *silently*
applies.

Whereever and whenever this happens, this is a bug.

One could fix this bug by putting the burden on the programmer, by
documentation all functions that cause such silent encoding changes (that
includes xs module documentation).

Another way is to apply the same interpretation everywhere: a string is
simply a concatenation of codepoints, encoding (external encoding) is
supplied by the programmer.

> And most of the problems when you want to work with Unicode (even if you
> _only_ want to use UTF-8, not even throwing UTF-16 into the mix)

One of the major problem in this discussion is when people confuse
"unicode" and an encoding such as "utf-8" or "latin-1": I can encode
(some) unicode in latin1, that doesn't make it less unicode, and I can
encode all unicode in utf-8 or utf-16, but that does't make either valid
unicode codepoints.

For perl utf-8 and utf-16 are just byte strings. Yes, when you mix them
together, you might shoot yourself in the foot (or not, there are valid
reasons to do so), but the programmer is responsible for it.

Now throw in some of the "transparent ansi encoding" in it, and you will
find that perl sometimes re-intreprets your utf-8 bytes as, say, koi8-r
while upgrading it. In extreme cases this transformation might not even be
reversible.

A programmer must not have to track this, this is insane. a programmer
has his hands full in tracking his own string encodings, he must not be
bothered to also track what pelr or perl could not do to his strings when
he isn't expecting it (because this is not documented).

> Or in other words, Perls "frees the programmer from traking encodings"
> by making him carefully track all strings as they come in and go out

exactly. perl itself does _not_ attach an encoding to its strings, thats what
I am saying all the time.

atatching one (namely ANSI depending in internal flags) is just plain
broken, as this would make it imppossible to handle "the" non-ansi encoding
in a defined way.

> Not to mention that you actually lose the information what original
> encoding the string had - "aa" looks the same in latin1 and utf-8, but
> depending on which encoding it "has", acts differently. (at least thats
> what I remember from regexps discussions)

exactly, thats what forcing the interpretation to ansi is simply wrong -
my binary data string isn't ansi-encoded, my utf-8 string from that file
isn't either, etc. etc.

> It would be _much_ easier if all strings in Perl carried their encoding
> with them, and Perl would be able to simple mix two strings by
> automatically upgrading them according to their encoding. Then you'd
> also be able to query the encoding, btw. No more guesswork based upon a
> single bit.

That would be interesting, but you would still have to mark all such data
accordingly, so the programm still has to track changes.

It would also break perl w.r.t. earlier versions completely, as suddenly you
needed to tag your data wether it is a string or not.

Perl's current (mostly-implemented) unicode model is to treat strings as
concatenations of codepoints, and it is up to you to interpret them.

regexes are simply *buggy* when they use different interpretations of my
string data depending on some earlier siletn/transparent upgrade operation
that isn't refectled in my code.

> The current way (everything is either Latin-1 or UTF-8 and we only have

This is simply not the current way. If it were, I couldn't handle euc-jp
or binary data in perl, but obviously, I can.

> a single bit to distinguish between these two cases) is just a pain,
> especially if you need something else than utf-8.

It would be a pain, I totally agree, but this doesn't reflect reality.

> You have a UTF-8 regeps like the following:
>
> my $skip = qr/Quarantäne/i;

What is an "utf-8 regexp"? from the code, one cnanot tell (is the source
encoded in 8-bit or utf-8, does it use utf-8 or not?)

> You read in data and manually decode it to utf-8 to match it against the
> regexp:

You decode it *from* utf-8, not *to* utf-8 in perl.

> pre-parses the data. As a side-effect, the data now comes already
> decoded in UTF-8 format.

You completely get it backwards, the data starts as utf-8 encoded on the
perl level and stops being encoded as utf-8 after the decode. It simply
isn't utf-8 encoded anymore.

> The second decode() then destroys the data,

Actually, it might also croak because you are decoding utf-8 and this must
not have any characters >255.

> because Perl does not know that the data was already in UTF-8 and
> encodes it twice.

But the data isn't already in utf-8.

> Oops, new bug.

No, just gross misunderstand on your part. And one cannot blame you, even so
many people get it wrong.

> And this bug could have been prevented entirely if the string was
> properly tagged with its encoding, and thus a double encoding would have
> been never possible.

Indeed, at the cost of losing backwards compatibility to all earlier
versions and all XS modules.

It would also force programmers to declare everything, not a very perlish
way.

But it would solve that issue.

> So while the current situation is "working" somehow, please do not
> describe it as "ideal" :)

I didn't describe the current situation as ideal at all, please read my
postings and you will see that the opposite is the case. If the situation
were ideal I wouldn't ask for a lots of changes and wouldn't point out the
problems we have.

If the model as originally planned and mostly implemented in 5.005_x
(_after_ the clearly bad camel model with no flags at all) were completely
implemented, then it would be easy to explain the pelr unicode model to
people:

1) strings are basically lists of characters
2) a character is an integer in the range 0..2**63 or wherever
perls functions officially stop (no no mention of utf-8).
3) if you use a function or construct that deals with character
encodings, then that function defines what happens. examples:

open ..., $filename character values must be in ANSI (windows)
unicode (possible alternative on windows),
or whatever encoding your filesystem/env expects
(unix), just as in any other language.
regex match perls regexes interpret your characters as
unicode (alternative: unicode or, with use
locale, locale-specific encoding).
JSON::XS::decode_json $s $s must be in utf-8
print whatever either the file expects (raw)
or unicode (when you set an encoding).
$a = $b . $c it just concatenates characters, no encoding required
$a = substr $b, ... it just gives you a substr, no encoding required
etc...

this is how it *mostly* works nowadays. while not perfect, it gives you
a very simple string model (basically, in perl 5.005 you could have
characters 0..255, in 5.6 and higher the range was simply extended).

You cna explain this to anybody in just a few minutes. Sure, he will need to
find out what open wants on his platform, and how to convert, but it is far
simpler than all the crap perluniintro throws at the unsuspecting user, *and*
would completely get rid of that mysterious utf-8 flag:

The principle is that Perl tries to keep its data as eight-bit
bytes for as long as possible, but as soon as Unicodeness cannot be
avoided, the data is transparently upgraded to Unicode.

"how the hell should I know when it becomes necessary"?

Specifically, if all code points in the string are 0xFF or less,
Perl uses the native eight-bit character set. Otherwise, it uses
UTF-8.

"how about other encodings, or binary data?"
"so if i shave off one charaxcter of a string it might suddenly change its
encoding?"

[This] produces a fairly useless mixture of native bytes and UTF-8,
as well as a warning:

perl -e 'print "\x{DF}\n", "\x{0100}\x{DF}\n"'

"how do i mix latin1 characters such as ä in Quarantäne with unicode
characters outside the latin1 range when it is useless?"
"is this per character per string?"
"I don't get this warning?" (%ENV!)

etc. etc.

Note that perluniintro has lots of stuff like the "as long as possible" that
is completely incomprehensible to most perl programmers, yet still it claims
that it depends on such internal magic when it goes to interpret your string.

This is a totally broken design that cannot be explained to *anybody* because
it isn't logical at all.

So while simply extending strings to higher character ranges and still
asking the user to keep track of the encoding he wants (which is limited
to input and output places) is *far* better than forcing "one 8-bit
encoding on everybody" or forcing the user to *also* keep track of what
"as long as possible" means in perl, or when the perl interpreter silently
changes his data from ansi to unicode, with a corresponding change in
interprettaions.

--
The choice of a Deliantra, the free code+content MORPG
-----==- _GNU_ http://www.deliantra.net
----==-- _ generation
---==---(_)__ __ ____ __ Marc Lehmann
--==---/ / _ \/ // /\ \/ / pcg[at]goof.com
-=====/_/_//_/\_,_/ /_/\_\

First page Previous page 1 2 3 Next page Last page  View All Perl porters RSS feed   Index | Next | Previous | View Threaded