Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Perl: porters

on the almost impossibility to write correct XS modules

 

 

First page Previous page 1 2 3 Next page Last page  View All Perl porters RSS feed   Index | Next | Previous | View Threaded


schmorp at schmorp

May 19, 2008, 2:22 PM

Post #26 of 52 (205 views)
Permalink
Re: on the almost impossibility to write correct XS modules [In reply to]

On Mon, May 19, 2008 at 01:34:13PM -0700, Glenn Linderman <perl[at]NevCal.com> wrote:
> The gist of the problem here is that
>
> 1) The "automatic" conversion of 8-bit to UTF-8 "assumed" Latin1 because
> it was (a) easy numerically (b) worked well on platforms that use Latin1
> as their native encoding.

Which platform is that? I really don't know *any* such platform.

Note also that the automatic conversion in perl doesn't assume any
encoding *at all*, so this is simply not true.

> 2) Windows assumes ANSI code page for 8-bit data, but Perl on Windows,
> for quite a few releases now, has not... instead, it "assumes" Latin1
> when "automatically" converting 8-bit to UTF-8.

This is not what happens. Perl simply does not assume any encoding. If
you have an 8-bit filename encoded in latin1 then perl doesn't treat it
any different than an 8-bit filename encoded in koi8-r (another "ANSI"
encoding).

upgrading and downgrading doesn't change that, or at least shouldn't
change that. where it does, it affects unix as much as any other platform.

> Retrofitting Perl on Windows to assume 8-bit data is ANSI will break all
> code that attempts to work with the constraints of 1 and 2.

This would probably be true if 1) and 2) were real, but they are not.

> somewhat lower performance than assuming Latin1. And it would possibly
> have prevented, by example of a widely-used platform, the assumption
> throughout lots of Perl code, that all 8-bit data is assumed to be
> Latin1 implicitly.

Perl doesn't do that anywhere on any platform, to my knowledge. Make an
example of a platform that expects filenames as latin1.

(you can select this under unix, yes, but you can do so under windows as
well).

(the rest of the mail is either true, or depends on these critical but
wrong assumptions. It is still use that decodes encoding).

--
The choice of a Deliantra, the free code+content MORPG
-----==- _GNU_ http://www.deliantra.net
----==-- _ generation
---==---(_)__ __ ____ __ Marc Lehmann
--==---/ / _ \/ // /\ \/ / pcg[at]goof.com
-=====/_/_//_/\_,_/ /_/\_\


schmorp at schmorp

May 19, 2008, 9:20 PM

Post #27 of 52 (204 views)
Permalink
Re: on the almost impossibility to write correct XS modules [In reply to]

On Mon, May 19, 2008 at 06:28:12PM -0700, Glenn Linderman <perl[at]NevCal.com> wrote:
> >>1) The "automatic" conversion of 8-bit to UTF-8 "assumed" Latin1 because
> >>it was (a) easy numerically (b) worked well on platforms that use Latin1
> >>as their native encoding.
> >
> >Which platform is that? I really don't know *any* such platform.
>
> You don't have to know of one to figure out that the present scheme
> works fine on such a platform if it exists.

True, but you stated those platforms were the reason for why the automatic
conversion worked that way.

If no such platform exists, your argument is moot, because nobody would
implement a scheme because it is useful on no platform.

> Since it was done this way, I would assume it must have been useful
> somewhere... but perhaps it was just ASCII platforms for which it worked
> well.

Whatever an ascii platform is would be fine with about any other such
conversion.

> >Note also that the automatic conversion in perl doesn't assume any
> >encoding *at all*, so this is simply not true.
>
> Perl assumes an encoding for various operations; you've stated that. My

Yes, but automatic upgrade is _not_ one of them.

> saying that Perl assumes an encoding, is simply a collection: the set of
> all Perl operations that assume an encoding.

fine, but automatic upgrade, what you were talking about, is not in that
set. Point being?

> The conversion of internal string formats does assume that all the
> characters representable by various numbers in the octet format
> (internal UTF8 flag turned off) convert to the same number in the
> multi-bytes format (internal UTF8 flag turned on).

Yes.

> This is equivalent to converting from Latin1 to Unicode (UTF-8) for the
> range of numbers corresponding to Unicode code points (which applies to
> all the numbers that are representable in the octet format).

No, it is not. If the source data isn't latin1-encoded to begin with than
converting from latin1 to unicode is not a sensible operation to apply.

automatic upgrade, however, is, and thats because it does not apply any such
interpretation to the scalar. this is a subtle but crucial difference.

> If you are able to disagree with that, then you are simply being
> disagreeable, which doesn't help get the bugs fixed.

"If you don't agree to me you are not helpful"? Now that's a nice strawmen
argument :/

> >This is not what happens. Perl simply does not assume any encoding. If
> >you have an 8-bit filename encoded in latin1 then perl doesn't treat it
> >any different than an 8-bit filename encoded in koi8-r (another "ANSI"
> >encoding).
>
> The conversion of numeric characters from an 8-bit representation to a
> UTF8 multi-byte representation within Perl is often referred to as
> "assuming a latin1 encoding" by many discussions on this list.

In an informal way, you may well do that. When talking about unicode
semantics in perl, then being so sloppy will not do, however, because it
is important that the upgrade process works regardless of any encoding
(and is reversible).

> know, and I know, that it is simply two different representations of the
> list of numbers that make up a string.

Unfortunately, perl doesn't really handle it that way. regexes for example
treat the same number on the perl level differently depending on how its
encoded internally.

And this is a problem.

> But describing it the other way
> helps other people understand it, and it is not particularly false.

In my (not small) experience in explaining it to people, telling them
"not particular wrong" things about perels unicode handling scares them,
because they do not want that perl interprets their, say, koi8-r data as
latin1 in any way.

> you want to convince people of things, you should attempt to use their
> terminology as much as possible, and explain the problems in a way
> they'll understand it, rather than telling them they don't know what
> they are talking about...

Well, some people, like jan, clearly don't understand the issues.

Also, my terminology is their terminology. Perl simply doesn't interpret
your string as latin1 when upgrading. Thats a fact. In your or my
terminology.

> >upgrading and downgrading doesn't change that, or at least shouldn't
> >change that. where it does, it affects unix as much as any other platform.
>
> It could; are you referring to a particular version of Unix heres

No, all versions are the same here, right down to good old POSIX, or even
ISO-C.

> And what is its native 8-bit encoding?

Unix, by specification, has no native (or preferred) 8-bit encoding,
just like windows. There really isn't much of a difference in the 8-bit
apartment, except that unix interprets your data much less then windows
(for example, filesystem interaction doesn't check nor care for character
encodings).

> I can neither agree nor disagree with your statement here, without
> knowing more facts about the unix you are referring to.

There is only one, really, because all work the same.

> >>Retrofitting Perl on Windows to assume 8-bit data is ANSI will break all
> >>code that attempts to work with the constraints of 1 and 2.
> >
> >This would probably be true if 1) and 2) were real, but they are not.
>
> They are real; they are just stated in different terms than you prefer
> to use.

Sorry, but thtas bullshit. 1) for example claims this was implemented for the
sake of platforms that don't exist, which is not a sensible argument. This
has nothing to do with terminology.

It also has nothing to do with my person.

> >>have prevented, by example of a widely-used platform, the assumption
> >>throughout lots of Perl code, that all 8-bit data is assumed to be
> >>Latin1 implicitly.
> >
> >Perl doesn't do that anywhere on any platform, to my knowledge. Make an
> >example of a platform that expects filenames as latin1.
>
> Every time Perl alters the internal UTF8 flag, and correspondingly the
> representation of the string data, it makes the assumption that there is
> no numeric difference between the octet encoding and the multi-bytes
> encoding.

Exactly. It makes no assumption about the character encoding itself, because
the function is encoding-agnostic. It doesn't interpret your data as latin1.

> The only character sets for which this is true is Latin1 and
> Unicode, AFAIK

It is true for other encodings as well, such as ascii.

In fact, here is a good example why forcing an encoding interpretation on
upgrading/downgrading is wrong: the assumption of no numeric difference
between upgrading and downgrading is true for *any* 8-bit encoding
and it is also true for *any* codeset, simply because the numbers do not
change.

If you have koi8-r data (which is not compatible to latin1), then
upgrading and downgrading will not alter the fact that it is koi8-r data
(in current perls, and outside e.g. the buggy win32 module which enforces
different interpretation and breaks if strings get upgraded).

This is why your enforcing of such an interpretation is scary, because many
people still handle such data, and they need the safety that perl doesn't
tinker with their characters silently.

The transformation as it is was not chosen because of your two reasons. It
was chosen because it doesn't alter the string on the perl level. If you take
a string and upgrade it and dissect it, it will contain the same codepoints.

Any other transformation (like the one proposed by jan) doesn't have this
property, and since it isn't documented when perl does these upgrades and
downgrades, this is exactly why that proposal is broken by design.

If it was accompanied by making upgrades and downgrades *explicit*, i.e. perl
would die when you concatenated an upgraded and a downgraded string, or would
never silently upgrade/downgrade on its own and you always have to force it
manually, then this model would become workable, at the expense of making
perl strings type-ful, as there will be two incompatible string types.

Of course, this wouldn't be very perl.

> >(you can select this under unix, yes, but you can do so under windows as
> >well).
>
> So there you have answered your own question about platforms.

Yes, all humans are chinese because all chinese are humans. I said as a
special case, you can make it true, but in general it is not.

> issue arises because Perl for Windows does not require Windows to be
> configured to use Latin1 as the default code page;

And neither does it under unix. So your argument is wrong again, because the
issue does not arise because of "anything windows" at all.

> neither does it
> convert to or from Latin1 (or anything else) when calling Windows APIs;

and neither on unix.

> but it does assume numerical equality when converting between octet and
> multibytes strings, and that is only valid for Latin1 and Unicode.

No, numeric equality is true for every encoding on the world that uses
codepoints <256 - think about it. For example, the number 177 means the same
koi-8 characterm regardless of wether it was upgraded or not.

This is the property that is useful, not having latin1, that is not that
useful, and not implemented in perl anyways (see regexes for example).

> Hence, it assumes Latin1 during that conversion.

Wrong. It doesn't do so. The conversion used was chosen because it doesn't
change codepoints - character 177 stays character 177, not because latin1 is
particularly important for any specific platform.

> you read it... but I would be interested, if, setting aside the
> disagreements you stated above, if you think a scheme such as I outlined
> could be a helpful solution for Perl, using your mental model of
> strings, implicit internal format conversions, and such, which I think
> is reasonably accurate, even if it doesn't use the same terminology that
> most people on this forum use.

Well, to me, this is a mailinglist, but maybe my terminology is wrong
there, too. I do think I use the same terminology as everybody else here,
I am just being more exact in what I say, because if you are sloppy, you
fail to communicate the important differences and fall into traps like you
did above, because you couldn't escape the "character encoding" mental
model.

As for your points...

as I outlined, the problem is not so much backwards compatibility - perl 5.6
is totally difeernt to 5.8, 5.8 is different in many such encoding issues as
5.10.

The problem is mainly bugs, so while valid, I don't see how one could keep
compatibility, because the question is what to keep compatibility to -
5.8, 5.6, 5.005, 5.10? choose one, all are different.

I am alos not sure wether programs rely on the broken semantics - my
experience is that e.g. reading a filename (%APPDATA%) from an environment
variable and trying to access files that way doesn't work when ansi and
unicode disagree on encoding (which is the case even on my latin1
system, btw.)

But then, perl on windows is differently broken depending on which perl
you use - activestate has a really broken fork for example, and handles
filenames differently than other perls on windows.

I am not sure how many people really rely on that behaviour, and I am not
sure if this couldn't be just fixed by enforcing a single encoding.

But my experience is limited - I know the windows APIs and the problems
associated with not having a single format in which to store filenames.

On unix, this is positively better, as there is only ever a single format
to store filenames in that works regardless of locale (the problems start
when you interpret these filenames).

So I will only comment on E and F.

I think the pragma already exists, namely "use locale".

If I "use locale" in my program, I would expect perl to apply the current
locale to any strings, in regexes or elsewhere (to the extent possible).

If I don't "use locale", then I would expect regexes to interpret my strings
as unicode, regardless of the utf-8 flag, which I can't see in my source.
(the "surprising" behaviour).

Regarding filenames, this is very easy on unix: all filenames are
interpreted as octte strings, no specific encoding (perl cnanot know the
encoding of filenames on unix), so the functions all have to downgrade,
and if that fails, we have a bug (filenames are not locale-dependent
on unix, they are simply octet strings where only "/" and \000 are
interpreted).

(if it does not fail, it might still be a bug, we we cannot detect this).

I know "use locale" has weird side effects, but it basically boils down to
what perluniintro calls "native 8-bit encoding" (fortunately, it is not
even limited to 8-bit).

even if there were need for a new pragma, I wouldn't call it
"compatibility", because both behaviours are useful. The difference is
that I can control which interpretation is applied to my strings and do
not have to rely on an invisible flag on my scalars.

But then, "locale" maps exactly on the concept of "native encoding",
because my unix process might run ina locale using koi8-r, and then I
would want a way to take advantage of the locale w.r.t. to interpreting my
koi8-r data. (do not get confused by the mention of POSIX in the locale
manpage, locales are an ISO-C thing and ought to exist on windows as
well.

So for me, this is not a compatibility issue - right now, I don't think
anybody relies on the utf-8 flag behaviour in perl (a great deal has
changed between 5.6 and 5.8, and less has changed between 5.8 and 5.10, so
those programs need fixing already).

--
The choice of a Deliantra, the free code+content MORPG
-----==- _GNU_ http://www.deliantra.net
----==-- _ generation
---==---(_)__ __ ____ __ Marc Lehmann
--==---/ / _ \/ // /\ \/ / pcg[at]goof.com
-=====/_/_//_/\_,_/ /_/\_\


rgarciasuarez at gmail

May 20, 2008, 1:55 AM

Post #28 of 52 (203 views)
Permalink
Re: on the almost impossibility to write correct XS modules [In reply to]

2008/5/20 Marc Lehmann <schmorp[at]schmorp.de>:
> Unfortunately, perl doesn't really handle it that way. regexes for example
> treat the same number on the perl level differently depending on how its
> encoded internally.
>
> And this is a problem.

You could add uc/lc to the list.

> I think the pragma already exists, namely "use locale".
>
> If I "use locale" in my program, I would expect perl to apply the current
> locale to any strings, in regexes or elsewhere (to the extent possible).
>
> If I don't "use locale", then I would expect regexes to interpret my strings
> as unicode, regardless of the utf-8 flag, which I can't see in my source.
> (the "surprising" behaviour).
>
> Regarding filenames, this is very easy on unix: all filenames are
> interpreted as octte strings, no specific encoding (perl cnanot know the
> encoding of filenames on unix), so the functions all have to downgrade,
> and if that fails, we have a bug (filenames are not locale-dependent
> on unix, they are simply octet strings where only "/" and \000 are
> interpreted).
>
> (if it does not fail, it might still be a bug, we we cannot detect this).
>
> I know "use locale" has weird side effects, but it basically boils down to
> what perluniintro calls "native 8-bit encoding" (fortunately, it is not
> even limited to 8-bit).
>
> even if there were need for a new pragma, I wouldn't call it
> "compatibility", because both behaviours are useful. The difference is
> that I can control which interpretation is applied to my strings and do
> not have to rely on an invisible flag on my scalars.
>
> But then, "locale" maps exactly on the concept of "native encoding",
> because my unix process might run ina locale using koi8-r, and then I
> would want a way to take advantage of the locale w.r.t. to interpreting my
> koi8-r data. (do not get confused by the mention of POSIX in the locale
> manpage, locales are an ISO-C thing and ought to exist on windows as
> well.

I think we need a *new* pragma. I don't want to mix locales and
Unicode. Their purposes are different, and they come from different
worlds.

A locale is mostly intended to indicate in which language a string is,
and applyinig language specific rules to it. (UTF8 locales just indicate
that the strings returned by or passed to the C locale API are encoded
in UTF8 instead of latin1 or anything else.) For example, under a
Turkish locale, you'll get different rules for uppercasing "i".

Unicode is a different matter. Now Unicode *also* specifies rules for
collation and casing, and special rules for some languages. Those
special rules are not used in Perl (as far as I know) (But we need
a way to implement them in the language.)

Now, at the perl language level, I think the problem we have is that
we sometimes want uc, lc or //i to have Unicode semantics, and sometimes
not. (other operations here ?)

For those two cases we can:
* Add a pragma that says "in this block, apply Unicode semantics".
Additionally, we can add a regexp flag qr//u, that says "this
regexp matches with Unicode semantics". (I'm thinking out loud
here) (Also, probably any regexp that uses \p should be considered
"in Unicode mode")
* Drop relying on the SvUTF8 flag to choose whether Unicode semantics
should be applied. Big change, not backwards compatible, but IMO
needed for sanity.

But sometimes we want perl to magically switch between Unicode and
non-Unicode semantics depending on the data it's handling. Does that
mean that we need to add a new kind of data to perl, "Unicode SV" ?
Will that solve problems ? What problems will this create ?


rgarciasuarez at gmail

May 20, 2008, 2:22 AM

Post #29 of 52 (203 views)
Permalink
Re: on the almost impossibility to write correct XS modules [In reply to]

2008/5/20 Glenn Linderman <perl[at]nevcal.com>:
> On approximately 5/20/2008 1:55 AM, came the following characters from the
> keyboard of Rafael Garcia-Suarez:
>
>> But sometimes we want perl to magically switch between Unicode and
>> non-Unicode semantics depending on the data it's handling. Does that
>> mean that we need to add a new kind of data to perl, "Unicode SV" ?
>> Will that solve problems ? What problems will this create ?
>
>
> Could you elaborate on when such a magical switcharoo is desirable?

Not really. That's an RFC. It's just that I imagine some people would
like it. That last proposal is not here to fix something, but to add
conveniency.

It might help XS writers too: if we get a "UnicodeSVPV" (in bytes or
in utf-8, that's orthogonal), and feed it to a C char* function that
expects a GIF image or something, then we know something is wrong in
the calling and we can throw an exception.


ben at morrow

May 20, 2008, 3:33 AM

Post #30 of 52 (202 views)
Permalink
Re: on the almost impossibility to write correct XS modules [In reply to]

Quoth rgarciasuarez[at]gmail.com ("Rafael Garcia-Suarez"):
>
> For those two cases we can:
> * Add a pragma that says "in this block, apply Unicode semantics".
> Additionally, we can add a regexp flag qr//u, that says "this
> regexp matches with Unicode semantics". (I'm thinking out loud
> here) (Also, probably any regexp that uses \p should be considered
> "in Unicode mode")
> * Drop relying on the SvUTF8 flag to choose whether Unicode semantics
> should be applied. Big change, not backwards compatible, but IMO
> needed for sanity.

++

> But sometimes we want perl to magically switch between Unicode and
> non-Unicode semantics depending on the data it's handling. Does that
> mean that we need to add a new kind of data to perl, "Unicode SV" ?
> Will that solve problems ? What problems will this create ?

This seems sane to me. While we're there we can make the new type (QV?
UPV?) a wchar_t* instead of a utf8-encoded char*. That way we get
autoconversion as needed, with cacheing. We also get the ability to
declare 'all my 8-bit strings are in $encoding' rather than being fixed
to ISO8859-1. (This must be a *different* option from the one that says
'my source code is in $encoding', though they could default to the same
thing. I don't know what to do about literal strings: reencode them,
probably.)

A lot of care will be needed to get all the cases right. For instance,
what happens when a (POK, UPOK) SV is string-compared with a (POK) SV?
I think the right answer is

- by default, if any argument of a string operation is UPOK then all
of them are upgraded to UPOK and the operation occurs on SvUPV; if
all are !UPOK then they are all upgraded to POK and the operation
occurs on SvPV. (This assumes all numbers can be represented in
the current character set :).)

This 'upgrade' may in fact be a 'downgrade' by current SvUTF8
terminology, from UPOK->POK, in which case any characters that
can't be encoded elict a warning. Ideally all of Encode's options
should be applicable.

What to do about chr/ord/"\x", especially given that some
encodings have more that 256 characters, I'm not sure. I suspect
the current 'assume numbers <256 are byte values and go in SvPV,
and numbers >255 are Unicode codepoints and go in SvUPV' is a
decent compromise, *given that users can ask for sane semantics if
they want them*.

- under 'use bytes', all string operations upgrade all SVs to POK,
'upgrade' as above. chr stuffs literal bytes into SvPV.

- under 'use unicode', all string operations upgrade all SVs to
UPOK, and chr takes a Unicode codepoint and returns a string that
is UPOK only. This means that the numbers passed to chr mean
different things under 'unicode' and 'bytes'. This is a feature :).

- regexes know which of SvPV and SvUPV they should be matching
against. I think we need two new flags, /u and /U (or maybe /b),
with the default being bytes if use-bytes, unicode if use-unicode,
and guess if neither.

- 'use locale' can probably be made to work again, if it is only
applied to SvPV and never to SvUPV. 'use locale' should probably
imply 'use bytes', and set the current encoding.

This would at least allow user to specify that they understand Unicode
and want consistent semantics, without losing the ability to manipulate
binary data.

Ben

--
We do not stop playing because we grow old;
we grow old because we stop playing.
ben[at]morrow.me.uk


rgarciasuarez at gmail

May 20, 2008, 5:32 AM

Post #31 of 52 (202 views)
Permalink
Re: on the almost impossibility to write correct XS modules [In reply to]

2008/5/20 Ben Morrow <ben[at]morrow.me.uk>:
>> But sometimes we want perl to magically switch between Unicode and
>> non-Unicode semantics depending on the data it's handling. Does that
>> mean that we need to add a new kind of data to perl, "Unicode SV" ?
>> Will that solve problems ? What problems will this create ?
>
> This seems sane to me. While we're there we can make the new type (QV?
> UPV?) a wchar_t* instead of a utf8-encoded char*. That way we get
> autoconversion as needed, with cacheing. We also get the ability to
> declare 'all my 8-bit strings are in $encoding' rather than being fixed
> to ISO8859-1. (This must be a *different* option from the one that says
> 'my source code is in $encoding', though they could default to the same
> thing. I don't know what to do about literal strings: reencode them,
> probably.)

You're mixing Unicode and encodings, there.

Here's my position :
- to deal with encodings, use Encode.
- no encoding-aware strings in core perl. (of course, you can still
use magic, ties, etc. to add behaviour)
- the "Unicodeness" of a string would be independent of its SvUTF8 flag.
If will just indicate that <some list of perl built-ins> must apply
Unicode semantics when dealing with it.
- the "unicode" pragma (or whatever name is chosen) will be needed to
say that <same list of perl built-ins> in its scope must apply
Unicode semantics to Perl strings. (as opposed to newfangled Unicode
strings)

Currently we have :

$ bleadperl -wle 'print uc "ß"'
ß
$ bleadperl -wle 'use utf8; print uc "ß"'
SS

That's wrong: the pragma utf8 indicates internal encoding, but modifies
semantics. What I've in mind is : make those two one-liners output an ß.
Under the "unicode" pragma, make them both output SS.

And, assuming we add a new flag on SV (let's call it UPOK like you did
below) for Unicode strings, and that a new quotelike operator qu// is
added to create them, have "uc qu/ß/" return an UPOK SV containing "SS"
in the PV slot. That PV slot could be SvUTF8 or not, that should not
matter and should not be visible from perl. ("SS" is perfectly
representable in pure ASCII so SvUTF8 isn't needed there.)

On the other hand C<use unicode; uc qq/ß/> would return "SS" without
the UPOK flag set.

> A lot of care will be needed to get all the cases right. For instance,
> what happens when a (POK, UPOK) SV is string-compared with a (POK) SV?

Indeed, we'll need a matrix there.

> I think the right answer is
>
> - by default, if any argument of a string operation is UPOK then all
> of them are upgraded to UPOK and the operation occurs on SvUPV; if
> all are !UPOK then they are all upgraded to POK and the operation
> occurs on SvPV. (This assumes all numbers can be represented in
> the current character set :).)

With my proposed outlined implementation, that's upgraded as per
sv_upgrade.

Do we really want this upgrade to be done transparently ? Like, in
concatenating an SV and a USV ? Remember why we needed
encoding::warnings ? Because we can't know what encoding a
Perl string is in.

We could do it the hard way (also known as the python way) : forbid
any mix between Unicode strings and Perl strings. Force people to
write C<$foo = qu/$foo/> to get Unicode strings. (*regardless* of
any encoding issue or PerlIO layer or locale or pragma.) Make qq/$foo/
warn if $foo is a Unicode string (being thus downgraded to a Perl
string).

We could do it the dwimmy way: apply heuristics when mixing POK SVs and
UPOK SVs and guess games about encodings, and end up with complicated
rules that will duplicate the current bugs with the UTF8 flag.

I would prefer the hard way.

> This 'upgrade' may in fact be a 'downgrade' by current SvUTF8
> terminology, from UPOK->POK, in which case any characters that
> can't be encoded elict a warning. Ideally all of Encode's options
> should be applicable.
>
> What to do about chr/ord/"\x", especially given that some
> encodings have more that 256 characters, I'm not sure. I suspect
> the current 'assume numbers <256 are byte values and go in SvPV,
> and numbers >255 are Unicode codepoints and go in SvUPV' is a
> decent compromise, *given that users can ask for sane semantics if
> they want them*.

You're confusing Unicode and encoding again.

> - under 'use bytes', all string operations upgrade all SVs to POK,
> 'upgrade' as above. chr stuffs literal bytes into SvPV.

chr moduloes its argument under bytes, and I'd like to keep that:

$ bleadperl -Mbytes -le 'print ord chr 258'
2

To my understanding "use bytes" means "don't look at the SvUTF8 flag".
See in utf8.h :

#define IN_BYTES (CopHINTS_get(PL_curcop) & HINT_BYTES)
#define DO_UTF8(sv) (SvUTF8(sv) && !IN_BYTES)

> - under 'use unicode', all string operations upgrade all SVs to
> UPOK, and chr takes a Unicode codepoint and returns a string that
> is UPOK only. This means that the numbers passed to chr mean
> different things under 'unicode' and 'bytes'. This is a feature :).

I still prefer the hard way.

> - regexes know which of SvPV and SvUPV they should be matching
> against. I think we need two new flags, /u and /U (or maybe /b),
> with the default being bytes if use-bytes, unicode if use-unicode,
> and guess if neither.

Ah yes. What about regexps (now type SVt_REGEXP) with the UPOK flag set?

I think that one flag /u is enough. m//u would be equivalent to
C<use unicode; m//>. It would be forbidden to mix qr// and qr//u.

Also, captures would retain the UPOK flag from the matched string.

> - 'use locale' can probably be made to work again, if it is only
> applied to SvPV and never to SvUPV. 'use locale' should probably
> imply 'use bytes', and set the current encoding.

I'd be happy to set locale to rest. In peace.

Gosh, did I just come up with a big plan to save Unicode in perl ?

Now, I've this slight feeling that all this plan might be bullshit
because I've overlooked something obvious. I'll have to think a bit,
read replies, and maybe summarize and post a model proposal cc:ing
all the experts.


juerd at convolution

May 20, 2008, 6:42 AM

Post #32 of 52 (203 views)
Permalink
Re: on the almost impossibility to write correct XS modules [In reply to]

Jan Dubois skribis 2008-05-18 8:40 (-0700):
> No, "\xff" is guaranteed to have byte semantics for backwards compatibility:

"byte semantics" is a dangerous term, partly because different people
use it for different things. Some people use it to refer to functions
and operators acting on the bytes in the PV's buffer regardless of the
SvUTF8 flag's state, but those functions are generally broken and in
need of repair (as announced in perl5100delta, this would break
compatibility).

By default, "\xff" by itself will indeed create a string that
*internally* is a single byte 0xff.

A Perl string is a Unicode string. Or actually, a sequence of almost
arbitrary integer values that most operations ought to interpret as
unicode codepoints. If it contains only characters < 256, it may be
"encoded as latin1" (represented as 8 bit with a straight mapping)
internally, both for efficiency and for backwards compatibility. When
strings are sent or received with system calls, that has to occur in
bytes. If a string only contains characters < 256, it can be used as a
byte string. (Note: I originally believed otherwise and was wrong.)

Still it can be useful to write your program in a way that avoids that a
string that will be used as a byte string, is ever upgraded to UTF8
*internally*: upgrading and downgrading it again might be a performance
issue. There should be no difference in semantics, regardless of the
internal encoding of the string. It is a bug that there is.

I believe that this snippet:

> perluniintro.pod:
> | Note that C<\x..> (no C<{}> and only two hexadecimal digits), C<\x{...}>,
> | and C<chr(...)> for arguments less than C<0x100> (decimal 256)
> | generate an eight-bit character for backward compatibility with older
> | Perls. For arguments of C<0x100> or more, Unicode characters are
> | always produced. If you want to force the production of Unicode
> | characters regardless of the numeric value, use C<pack("U", ...)>
> | instead of C<\x..>, C<\x{...}>, or C<chr()>.

is misleading. It suggests that Perl has two kinds of strings
technically, which is not true. There is a single string type with two
*internal* representations. The word *internal* is notably missing in
the quoted part of perluniintro.

Let's change "generate an eight-bit character" to "generate a string
that has an eight-bit encoding internally".

In any case, CHARACTERS DO NOT HAVE BITS. Bytes have 8 bits, characters
just have a number.

> As perluniintro.pod above points out, the only reliable way to do this
> is pack("U", $codepoint). Or you can use named characters using
> charnames.pm.

If the remaining bugs in Perl (see also Unicode::Semantics) are fixed,
then there is no longer any *need* for forcing the internal encoding to
UTF8.

This said, I think that pack("U", $codepoint) is not a very good idea.
Without degressing into details, I would like to point out that it's
usually better to associate the upgrade with the buggy operator, rather
than the string itself.

So instead of:

my $char = pack("U", $codepoint);

... # perhaps lots of code here

my $uc = uc($char);

I would suggest using:

my $char = chr($codepoint);

... # perhaps lots of code here

utf8::upgrade($char); # work around bug
my $uc = uc($char);

> [perluniintro]
> | Internally, Perl currently uses either whatever the native eight-bit
> | character set of the platform (for example Latin-1) is

This is simply not true. Perl uses either latin1 or ebcdic for its
internally eight-bit strings. Not Windows-1252, for example.

> | defaulting to UTF-8, to encode Unicode strings.

defaulting to UTF-8, WITH A WARNING, for strings that could not be
downgraded, i.e. strings that contain characters > 255.

The warning is there for a reason: it says you're doing it wrong. You're
forcing a byte-incompatible string on a byte operation (system call),
and forgot to encode.
--
Met vriendelijke groet, Kind regards, Korajn salutojn,

Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig>
Convolution: ICT solutions and consultancy <sales[at]convolution.nl>
1;


juerd at convolution

May 20, 2008, 7:15 AM

Post #33 of 52 (202 views)
Permalink
Re: on the almost impossibility to write correct XS modules [In reply to]

Rafael Garcia-Suarez skribis 2008-05-20 10:55 (+0200):
> 2008/5/20 Marc Lehmann <schmorp[at]schmorp.de>:
> > Unfortunately, perl doesn't really handle it that way. regexes for example
> > treat the same number on the perl level differently depending on how its
> > encoded internally.
> > And this is a problem.
> You could add uc/lc to the list.

If you're looking for a list, Unicode::Semantics has documentation that
has such a list. It's probably not complete, but a starting point.

* uc, lc, ucfirst, lcfirst, \U, \L, \u, \l
* \d, \s, \w, \D, \S, \W
* /.../i, (?i:...)
* /[[:posix:]]/

> Now, at the perl language level, I think the problem we have is that
> we sometimes want uc, lc or //i to have Unicode semantics, and sometimes
> not. (other operations here ?)

Er, why "sometimes not"?

Why would you uppercase something that's not text?

I suggest that we keep the possibility to uppercase only the ASCII
character range, and call that ASCII::uc(), while the normal uc() is
made Unicode compliant regardless of the PV's state.

Maybe this should even be called Unicode::uc(), and uc() should
"default" to Unicode, with "use ASCII qw(uc);" and "use Unicode qw(uc);"
as ways to override the default.

> For those two cases we can:
> * Add a pragma that says "in this block, apply Unicode semantics".

There are three ways of dealing with text data in Perl:

1. The text is unicode (i.e. uc("aä") eq "AÄ")
2. The text is ASCII (i.e. uc("aä") eq "Aä")
3. It's determined by the UTF8 flag. It is now widely agreed that 3 is
wrong. However, many parts of perl use option 3 now.

I suggest that there'll also be a "use feature" called
unicode_by_default, that does no more than include the new pragma to
enable unicode semantics. This, to make "use v5.12;" include the pragma,
so to avoid that you forget to request a certain behaviour.

> Additionally, we can add a regexp flag qr//u, that says "this
> regexp matches with Unicode semantics". (I'm thinking out loud
> here)

I have suggested /u(nicode), /a(scii) before. These are "needed" in
addition to the pragma, because of qr//: there must be a way to
stringify the lexically selected behavior so it survives the end of the
lexical scope.

> * Drop relying on the SvUTF8 flag to choose whether Unicode semantics
> should be applied. Big change, not backwards compatible, but IMO
> needed for sanity.

Yes!

However, there's also a way to

> But sometimes we want perl to magically switch between Unicode and
> non-Unicode semantics depending on the data it's handling.

No, we don't want this to happen MAGICALLY. Or at least I really do not
want Perl to do that. This is one place where DWIM heuristics simply
cannot work.

> Does that mean that we need to add a new kind of data to perl,
> "Unicode SV" ? Will that solve problems ? What problems will this
> create ?

Indeed there could be a way to indicate "I intend this string to be a
byte string". I have a module, called BLOB.pm, in the works that makes
this very easy. I'll try to release it really soon so you can have a
look.

Because of the way BLOB works, it could probably be used by XS and core
code too. BLOB assumes that everything is text until explicitly marked
as binary.
--
Met vriendelijke groet, Kind regards, Korajn salutojn,

Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig>
Convolution: ICT solutions and consultancy <sales[at]convolution.nl>
1;


juerd at convolution

May 20, 2008, 7:22 AM

Post #34 of 52 (202 views)
Permalink
Re: on the almost impossibility to write correct XS modules [In reply to]

Marc Lehmann skribis 2008-05-19 22:19 (+0200):
> Such a model is workable, but there would need to be a defined way to convert
> external filenames (e.g. on the comamndline) into something perl's open
> understands.

I suggest one way in Message-ID: <20070919101638.GL28915[at]c4.convolution.nl>
--
Met vriendelijke groet, Kind regards, Korajn salutojn,

Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig>
Convolution: ICT solutions and consultancy <sales[at]convolution.nl>
1;


rgarciasuarez at gmail

May 20, 2008, 7:39 AM

Post #35 of 52 (203 views)
Permalink
Re: on the almost impossibility to write correct XS modules [In reply to]

2008/5/20 Juerd Waalboer <juerd[at]convolution.nl>:
>> Now, at the perl language level, I think the problem we have is that
>> we sometimes want uc, lc or //i to have Unicode semantics, and sometimes
>> not. (other operations here ?)
>
> Er, why "sometimes not"?
>
> Why would you uppercase something that's not text?
>
> I suggest that we keep the possibility to uppercase only the ASCII
> character range, and call that ASCII::uc(), while the normal uc() is
> made Unicode compliant regardless of the PV's state.
>
> Maybe this should even be called Unicode::uc(), and uc() should
> "default" to Unicode, with "use ASCII qw(uc);" and "use Unicode qw(uc);"
> as ways to override the default.

Likewise for char-classes ? So that's the unicode pragma I was talking
about, with possible sub-pragmas :
use unicode qw(uc);
no unicode qw(regex);

>> Additionally, we can add a regexp flag qr//u, that says "this
>> regexp matches with Unicode semantics". (I'm thinking out loud
>> here)
>
> I have suggested /u(nicode), /a(scii) before. These are "needed" in
> addition to the pragma, because of qr//: there must be a way to
> stringify the lexically selected behavior so it survives the end of the
> lexical scope.

What about (?u:...) ? What about mixing qr//u and qr//a in the same
match ?

>> * Drop relying on the SvUTF8 flag to choose whether Unicode semantics
>> should be applied. Big change, not backwards compatible, but IMO
>> needed for sanity.
>
> Yes!
>
> However, there's also a way to

Missing sentence ?

>> But sometimes we want perl to magically switch between Unicode and
>> non-Unicode semantics depending on the data it's handling.
>
> No, we don't want this to happen MAGICALLY. Or at least I really do not
> want Perl to do that. This is one place where DWIM heuristics simply
> cannot work.

I see your point. Sometimes I'm thick.

>> Does that mean that we need to add a new kind of data to perl,
>> "Unicode SV" ? Will that solve problems ? What problems will this
>> create ?
>
> Indeed there could be a way to indicate "I intend this string to be a
> byte string". I have a module, called BLOB.pm, in the works that makes
> this very easy. I'll try to release it really soon so you can have a
> look.

Didn't you talk about it at one of the Amsterdam.pm meetings ?

> Because of the way BLOB works, it could probably be used by XS and core
> code too. BLOB assumes that everything is text until explicitly marked
> as binary.

Indeed, the symmetrical alternative to a "Unicode SV" would be a "Binary
SV". But Unicode SVs look less appealing to me if we force Unicode
semantics on everything and don't apply heuristics depending on the data
type.


juerd at convolution

May 20, 2008, 8:21 AM

Post #36 of 52 (203 views)
Permalink
Re: on the almost impossibility to write correct XS modules [In reply to]

Rafael Garcia-Suarez skribis 2008-05-20 16:39 (+0200):
> > Maybe this should even be called Unicode::uc(), and uc() should
> > "default" to Unicode, with "use ASCII qw(uc);" and "use Unicode qw(uc);"
> > as ways to override the default.
> Likewise for char-classes ? So that's the unicode pragma I was talking
> about, with possible sub-pragmas :
> use unicode qw(uc);
> no unicode qw(regex);

The problem with a "no unicode" is that it doesn't say what the other
alternative is. I'd rather have two separate (mutually incompatible)
pragmas than one with an overloaded meaning for the unimport case.

> >> Additionally, we can add a regexp flag qr//u, that says "this
> >> regexp matches with Unicode semantics". (I'm thinking out loud
> >> here)
> > I have suggested /u(nicode), /a(scii) before. These are "needed" in
> > addition to the pragma, because of qr//: there must be a way to
> > stringify the lexically selected behavior so it survives the end of the
> > lexical scope.
> What about (?u:...) ? What about mixing qr//u and qr//a in the same
> match ?

(?u:...) is the same as /u of course. That's what I meant with the
requirement of qr// being able to stringify any /u that it got (either
explicitly or through a pragma that makes it default).

I think it's straightforward. A certain set of character classes is
defined to behave in one way under /a, and another under /u. I'd say the
innermost defined one determines, so /foo(?a:\w)/u would match fool but
not fooł.

> > However, there's also a way to
> Missing sentence ?

I think that was a window focus problem, and this was supposed to be
said on IRC :)

> > Indeed there could be a way to indicate "I intend this string to be a
> > byte string". I have a module, called BLOB.pm, in the works that makes
> > this very easy. I'll try to release it really soon so you can have a
> > look.
> Didn't you talk about it at one of the Amsterdam.pm meetings ?

Perhaps. I don't remember. I initially announced it at the Dutch Perl
Workshop and then lacked tuits to make it a release.

I might release it in its current state this evening. It believe that
although it's not thoroughly tested yet, it'll work well.
--
Met vriendelijke groet, Kind regards, Korajn salutojn,

Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig>
Convolution: ICT solutions and consultancy <sales[at]convolution.nl>
1;


juerd at convolution

May 20, 2008, 2:24 PM

Post #37 of 52 (189 views)
Permalink
Re: on the almost impossibility to write correct XS modules [In reply to]

Glenn Linderman skribis 2008-05-20 13:44 (-0700):
> >If a string only contains characters < 256, it can be used as a
> >byte string. (Note: I originally believed otherwise and was wrong.)
> I'm glad to see that you have expanded your understanding of strings to
> realize that they are sequences of integer values.

Just for the record: I've believed this for quite some time now, and I
think my documentation patches are consistent with it. If you find any
inconsistency there, please let me know.

> I'm still a bit concerned by your "almost arbitrary" modifier, mostly
> because I'm not sure what you mean by that.

Perl does assume that the values are Unicode codepoints in some places,
and sometimes warns if they are deemed invalid by Unicode.

One example of many:

my $foo = chr 0xffff;

Warns:

Unicode character 0xffff is illegal at -e line 1.

Even though ord($foo) properly returns 65535 afterwards.

This is not consistent with the view that a string is made of characters
which are just integers, with no character set logic implied.

By "almost arbitrary" I mean that while it is possible to use these
values, Perl will complain about it.

See also Chris Hall's insightful posts about this subject.

> >In any case, CHARACTERS DO NOT HAVE BITS. Bytes have 8 bits, characters
> >just have a number.
> Except for the historical, inherited-from-C, concept of an 8-bit char, I
> could agree with this.

It's a modern computing fact that every byte has 8 bits. It has been
different, yes, but to my knowledge no computer system has non-8 bit
bytes. I'm not calling that a char, by the way. I don't know why you
think I'm using that concept.

> I _do_ agree that it would be good to develop a set of terminology that
> can be well-defined, used throughout the documentation as it is updated,

Have tried that, but it turned out to be impossible to reach concensus
over the terminology.

Specifically, "character" is a good name for what the Perl documentation
tries to communicate. If you want to store arbitrary integers in a
string, that's supported but entirely up to you. The normal Perl way of
doing that would be to use an array. Strings in Perl are mainly used for
text data and binary data. That's a somewhat limited view of this very
useful data type, but it helps to make teaching doable.

> I continue to use "blorf", but it needs a different name, preferably not
> "character" or "char", because those have too many semantics inherited
> from other programming languages and concepts.

"hash" also has a rather different meaning in general computing jargon:
an MD5 hash is not at all a key-value structure. Sometimes it is
practical to re-use existing words. But it needs to be done consistently
and there needs to be a huge corpus that actually uses the term in its
new meaning. Both requirements are met for "character" in Perl's
documentation.

Note that in a distant past, "lists" were called "arrays" in Perl's
documentation, even though they're very different from "arrays" like
@foo. It is possible to change a word, but there must be a very good
reason for it. In my opinion, inconsistency with Perl stuff is a good
reason, but inconsistency with other languages is not.

> chr and ord are inverse operations.

Only for characters, er blorfs, within the supported range.

> Byte strings are a subset of strings that contain only blorfs in the
> range [0..255].

Note that in general the *operation* determines the kind of string.
Operations involving system communication like print and readdir are
used with binary strings (or explicit encoding through encode(),
encode_utf8(), utf8::encode() or :encoding). Some operations don't care
about how the string will be used, and just work on the charac.. blorfs,
like length() and substr(), whereas others are specifically text
related, and impose textness on the string: uc(), /\w/...

In other words: if you use the string "5" as a number, it IS a number.

If you use the string $foo as a number, it is a number.
If you use the string $foo as binary data, it is binary data [**].
If you use the string $foo as text data, it is text data.

Perl handles this for you.

[**] Of course, just like "5" being perfectly usable as a number and
"hello" not even resembling one, using a string as binary data only
makes sense if it meets the condition for that: it has only ch^Wblorfs
with ordinal values that are less than 256.

But indeed it can be very useful to have a name for strings which are
intended to be used as byte strings later on. (Hm, let's call them
blobs!)

> Is there any argument about the above definitions? I think they are
> pretty universally agreed to, at least conceptually. It seems there are
> bugs where chr doesn't accept all legal blorfs (attempting to mix in
> Unicode semantics), and it seems there are cases where chr and ord are
> not inverse operations in the presence of certain "locales". I consider
> these bugs, does anyone disagree?

It's not a property of chr, but a property of Perl, to "not accept" (if
that's the correct phrase) certain characters:

my $bar = "\x{ffff}";

> The following may be a bit more controversial... but I think they are
> consistent, and would produce an easy to explain system...

I tend to believe that anything that's controversial will never be easy
to explain. (That's okay, though. Sometimes it's needed in order to fix
bigger issues.)

> So, all prior character set standards will, hereafter, be referred to as
> "encodings", meaning that they define a subset of Unicode characters,
> and also a way of representing those characters as bytes or byte sequences.

That's what Perl already does, although it's sometimes hard to convince
Yves that this is actually a /good/ idea. :)

> Encodings fall into several categories:

I don't agree that these categories are a useful distinction. It's very
complex, and only people working on the Encode module suite are served
by the level of detail IMO.

Don't get me wrong. Your list is interesting and educational, just not
of much use to most Perl programmers.

Instead I suggest the following two categories:

1. Single byte encodings: every character is a single byte. By
necessity, only a small subset of Unicode is supported.

2. Multibyte encodings: every character is a one or more bytes.
2a. Legacy: Only a subset of Unicode is supported.
2b. Unicode: The whole Unicode set is supported.
2c. Full: A larger range than Unicode is supported.

An encoding may or may not be ASCII-compatible.

> the only [unicode encoding] that has been put to widespread use
> is UTF-8.

Not true. Windows uses UTF-16 internally, and you can't deny that
Windows is widespread :)

> [It should be noted that Decode has a bug: it presently accepts non-byte
> strings, and treats them as byte strings. It should accept either byte
> or non-byte strings, and produce an error if any of the input blorfs are
> unknown to the expected encoding (generally, any blorf value > 256 is
> unknown to most byte-oriented encodings).

Agreed that decode should not accept any string that has a value > 255.

Note your off by one error in "> 256". Those are deadly! :)
--
Met vriendelijke groet, Kind regards, Korajn salutojn,

Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig>
Convolution: ICT solutions and consultancy <sales[at]convolution.nl>
1;


juerd at convolution

May 20, 2008, 5:49 PM

Post #38 of 52 (185 views)
Permalink
Re: on the almost impossibility to write correct XS modules [In reply to]

Glenn Linderman skribis 2008-05-20 17:16 (-0700):
> >Instead I suggest the following two categories:
> >1. Single byte encodings: every character is a single byte. By
> >necessity, only a small subset of Unicode is supported.
> >2. Multibyte encodings: every character is a one or more bytes.
> >2a. Legacy: Only a subset of Unicode is supported.
> >2b. Unicode: The whole Unicode set is supported.
> >2c. Full: A larger range than Unicode is supported.
> >An encoding may or may not be ASCII-compatible.
> There is only one "ASCII-compatible" encoding: ASCII itself. Other
> things are Extended ASCII, which is only somewhat compatible with
> 7-bit ASCII, not 8-bit ASCII. This is a fine point, but I think you
> can accept the term "Extended ASCII" here?

No, extended ASCII is a wildly confusing term, that many will associate
with IBM codepages. Also, 8 bit ASCII does not exist.

Latin1 is ASCII compatible in that every single byte that's possible in
ASCII, has the same meaning in latin1. Same goes for utf8, but not for
utf16. Although it's actually the other way around (ASCII is latin1
compatible), I think this is a non-confusing description.
--
Met vriendelijke groet, Kind regards, Korajn salutojn,

Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig>
Convolution: ICT solutions and consultancy <sales[at]convolution.nl>
1;


ben at morrow

May 20, 2008, 6:29 PM

Post #39 of 52 (174 views)
Permalink
Re: on the almost impossibility to write correct XS modules [In reply to]

Quoth rgarciasuarez[at]gmail.com ("Rafael Garcia-Suarez"):
> 2008/5/20 Ben Morrow <ben[at]morrow.me.uk>:
> >> But sometimes we want perl to magically switch between Unicode and
> >> non-Unicode semantics depending on the data it's handling. Does that
> >> mean that we need to add a new kind of data to perl, "Unicode SV" ?
> >> Will that solve problems ? What problems will this create ?
> >
> > This seems sane to me. While we're there we can make the new type (QV?
> > UPV?) a wchar_t* instead of a utf8-encoded char*. That way we get
> > autoconversion as needed, with cacheing. We also get the ability to
> > declare 'all my 8-bit strings are in $encoding' rather than being fixed
> > to ISO8859-1. (This must be a *different* option from the one that says
> > 'my source code is in $encoding', though they could default to the same
> > thing. I don't know what to do about literal strings: reencode them,
> > probably.)
>
> You're mixing Unicode and encodings, there.

If you're converting a string from bytes to Unicode or vice versa (as
part of sv_upgrade) you are doing so according to some encoding.
Currently Perl only allows that encoding to be ISO8859-1, for good
reasons; it seemed to me that your proposal allowed that to change, but
I think I may have misunderstood you, given that below you said...

> And, assuming we add a new flag on SV (let's call it UPOK like you did
> below) for Unicode strings, and that a new quotelike operator qu// is
> added to create them, have "uc qu/ß/" return an UPOK SV containing "SS"
> in the PV slot. That PV slot could be SvUTF8 or not, that should not
> matter and should not be visible from perl. ("SS" is perfectly
> representable in pure ASCII so SvUTF8 isn't needed there.)
>
> On the other hand C<use unicode; uc qq/ß/> would return "SS" without
> the UPOK flag set.

...which isn't what I thought you meant at all.

OK, new proposal; this is how I have thought Unicode *ought* to work in
Perl for some time. It's entirely possible there's some serious flaw
with it, of couse... I was (assuming you were) intending UPOK to
represent a new entry in the SV, so struct xpv would become something
like

struct xpv {
char * xpv_pv; /* byte string */
STRLEN xpv_cur;
STRLEN xpv_len;

wchar_t * xpv_upv; /* Unicode string */
STRLEN xpv_ucur;
STRLEN xpv_ulen;
}

(or perhaps xpv would stay as-is, and we'd have an xpvupv like that).
POK says the PV slot is valid, UPOK says the UPV slot is valid, and you
can have both valid at once so you don't have to keep converting a given
string between bytes and characters. You have to keep track of which
representation is canonical, of course, exactly as with string<->number
conversions.

Then you have two forms of (say) 'eq', each of which sv_upgrades its
arguments to the appropriate type; except that (for compatibility) if
you specify neither 'bytes' nor 'unicode' it guesses which you wanted. A
new quote-like qu// would be useful, but it would just be a shortcut for
do { use unicode; qq// }; similarly, a qb// would be useful to get
binary strings when under 'unicode'.

> Here's my position :
> - to deal with encodings, use Encode.
> - no encoding-aware strings in core perl. (of course, you can still
> use magic, ties, etc. to add behaviour)

I agree here: while strings that knew what encoding they started out as
sound like a cool idea, I suspect it would quickly become unmanageable.

> - the "Unicodeness" of a string would be independent of its SvUTF8 flag.
> If will just indicate that <some list of perl built-ins> must apply
> Unicode semantics when dealing with it.

This feels wrong, to me. Perl has always had polymorphic values and
monomorphic operators; allowing the string to choose which version of
the operator it gets seems like going the other way. In an ideal world,
I would advocate a new set of operators: ueq, ult, u., usubstr, and so
on; since this is obviously impractical, a pragma to choose which 'eq'
you want seems like the way to go.

> - the "unicode" pragma (or whatever name is chosen) will be needed to
> say that <same list of perl built-ins> in its scope must apply
> Unicode semantics to Perl strings. (as opposed to newfangled Unicode
> strings)

Converting said strings to Unicode how? ISO8859-1, as perl does now?

> Currently we have :
>
> $ bleadperl -wle 'print uc "ß"'
> ß
> $ bleadperl -wle 'use utf8; print uc "ß"'
> SS
>
> That's wrong: the pragma utf8 indicates internal encoding, but modifies
> semantics. What I've in mind is : make those two one-liners output an ß.
> Under the "unicode" pragma, make them both output SS.

Yes. the conflation of 'source-file encoding' with 'operator semantics'
was clearly a mistake.

> > - by default, if any argument of a string operation is UPOK then all
> > of them are upgraded to UPOK and the operation occurs on SvUPV; if
> > all are !UPOK then they are all upgraded to POK and the operation
> > occurs on SvPV. (This assumes all numbers can be represented in
> > the current character set :).)
>
> With my proposed outlined implementation, that's upgraded as per
> sv_upgrade.

Yes, that was what I meant.

> Do we really want this upgrade to be done transparently ? Like, in
> concatenating an SV and a USV ? Remember why we needed
> encoding::warnings ? Because we can't know what encoding a
> Perl string is in.
>
> We could do it the hard way (also known as the python way) : forbid
> any mix between Unicode strings and Perl strings. Force people to
> write C<$foo = qu/$foo/> to get Unicode strings. (*regardless* of
> any encoding issue or PerlIO layer or locale or pragma.) Make qq/$foo/
> warn if $foo is a Unicode string (being thus downgraded to a Perl
> string).

I would want to forbid $foo = qu/$foo/ as well, assuming $foo was a byte
string to start with. If we do things this way, the only way to convert
between the two types of string should be with Encode.

<snip my stuff>
> You're confusing Unicode and encoding again.

Any conversion between a Unicode string and a string of bytes involves
an encoding, no? You seem to be saying the two are not related: am I
completely misunderstanding something, or are you simply stating 'Perl's
byte<->character conversions will always use ISO8859-1, as it's a subset
of Unicode[0], and if you want anything else use Encode' as a decision
you've made?

[0] Yes, yes, all character sets are subsets of Unicode... I mean
Unicode-the-numbered-list rather than -the-unordered-set. AFAIK there
isn't a separate name for it.

> chr moduloes its argument under bytes, and I'd like to keep that:
>
> $ bleadperl -Mbytes -le 'print ord chr 258'
> 2

Yes.

> To my understanding "use bytes" means "don't look at the SvUTF8 flag".
> See in utf8.h :
>
> #define IN_BYTES (CopHINTS_get(PL_curcop) & HINT_BYTES)
> #define DO_UTF8(sv) (SvUTF8(sv) && !IN_BYTES)

Well, that's what it means now. But that's *truly* evil: the user should
never be able to see the raw bytes perl happens to use to store
characters. That's like letting you see the bytes that make up a float.
Attempting to apply byte semantics to a Unicode string should either
auto-convert it or fail.

> > - 'use locale' can probably be made to work again, if it is only
> > applied to SvPV and never to SvUPV. 'use locale' should probably
> > imply 'use bytes', and set the current encoding.
>
> I'd be happy to set locale to rest. In peace.

OK. It's kinda handy to do things like sort correctly, but that kind of
thing is arguably better handled by a module (which can read the
standard locale database if it wants, of course).

Ben

--
All persons, living or dead, are entirely coincidental.
ben[at]morrow.me.uk Kurt Vonnegut


rgarciasuarez at gmail

May 21, 2008, 1:29 AM

Post #40 of 52 (166 views)
Permalink
Re: on the almost impossibility to write correct XS modules [In reply to]

2008/5/21 Ben Morrow <ben[at]morrow.me.uk>:
> OK, new proposal; this is how I have thought Unicode *ought* to work in
> Perl for some time. It's entirely possible there's some serious flaw
> with it, of couse... I was (assuming you were) intending UPOK to
> represent a new entry in the SV, so struct xpv would become something
> like
>
> struct xpv {
> char * xpv_pv; /* byte string */
> STRLEN xpv_cur;
> STRLEN xpv_len;
>
> wchar_t * xpv_upv; /* Unicode string */

(I'm under the impression that wchar_t is not portable and not suitable
to store Unicode, since its size is implementation-defined. However we
could use UTF-16 or -32 here, if that's more convenient.)

> STRLEN xpv_ucur;
> STRLEN xpv_ulen;
> }
>
> (or perhaps xpv would stay as-is, and we'd have an xpvupv like that).
> POK says the PV slot is valid, UPOK says the UPV slot is valid, and you
> can have both valid at once so you don't have to keep converting a given
> string between bytes and characters. You have to keep track of which
> representation is canonical, of course, exactly as with string<->number
> conversions.
>
> Then you have two forms of (say) 'eq', each of which sv_upgrades its
> arguments to the appropriate type; except that (for compatibility) if
> you specify neither 'bytes' nor 'unicode' it guesses which you wanted. A
> new quote-like qu// would be useful, but it would just be a shortcut for
> do { use unicode; qq// }; similarly, a qb// would be useful to get
> binary strings when under 'unicode'.
>
>> Here's my position :
>> - to deal with encodings, use Encode.
>> - no encoding-aware strings in core perl. (of course, you can still
>> use magic, ties, etc. to add behaviour)
>
> I agree here: while strings that knew what encoding they started out as
> sound like a cool idea, I suspect it would quickly become unmanageable.
>
>> - the "Unicodeness" of a string would be independent of its SvUTF8 flag.
>> If will just indicate that <some list of perl built-ins> must apply
>> Unicode semantics when dealing with it.
>
> This feels wrong, to me. Perl has always had polymorphic values and
> monomorphic operators; allowing the string to choose which version of
> the operator it gets seems like going the other way. In an ideal world,
> I would advocate a new set of operators: ueq, ult, u., usubstr, and so
> on; since this is obviously impractical, a pragma to choose which 'eq'
> you want seems like the way to go.

I now tend to agree with this.
Actually and moreoever I now tend to agree with Juerd: always apply
default Unicode semantics. Get alternative ops -- or a pragma -- to
get latin-1 semantics. Which makes the whole point of UPOK strings
rather unuseful now.

Some way to mark PVs as "binary" and not upgradeable to SvUTF8 would be
handy, though.

[snipping a lot of thoughtful stuff]


juerd at convolution

May 21, 2008, 5:50 AM

Post #41 of 52 (162 views)
Permalink
Re: on the almost impossibility to write correct XS modules [In reply to]

Rafael Garcia-Suarez skribis 2008-05-21 10:29 (+0200):
> Actually and moreoever I now tend to agree with Juerd: always apply
> default Unicode semantics. Get alternative ops -- or a pragma -- to
> get latin-1 semantics. Which makes the whole point of UPOK strings
> rather unuseful now.
>
> Some way to mark PVs as "binary" and not upgradeable to SvUTF8 would be
> handy, though.

Indeed. I had made BLOB for this, but it can't prevent the upgrades.

Though the thing that upgrades can be made to check ->isa("BLOB"). Would
that be possible; would that be desirable?
--
Met vriendelijke groet, Kind regards, Korajn salutojn,

Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig>
Convolution: ICT solutions and consultancy <sales[at]convolution.nl>
1;


rgarciasuarez at gmail

May 21, 2008, 5:56 AM

Post #42 of 52 (162 views)
Permalink
Re: on the almost impossibility to write correct XS modules [In reply to]

2008/5/21 Juerd Waalboer <juerd[at]convolution.nl>:
> Indeed. I had made BLOB for this, but it can't prevent the upgrades.
>
> Though the thing that upgrades can be made to check ->isa("BLOB"). Would
> that be possible; would that be desirable?

That would be a bit slow, but most strings aren't blessed anyway, so
it's only a matter of checking SvOBJECT in a first step.

However that would of course imply to pull BLOB in the CORE.


davidnicol at gmail

May 21, 2008, 8:02 AM

Post #43 of 52 (156 views)
Permalink
Re: on the almost impossibility to write correct XS modules [In reply to]

On Wed, May 21, 2008 at 7:56 AM, Rafael Garcia-Suarez
<rgarciasuarez[at]gmail.com> wrote:

>
> However that would of course imply to pull BLOB in the CORE.
>

Does this mean extending tie and overload to support BLOB methods?


rgarciasuarez at gmail

May 21, 2008, 8:07 AM

Post #44 of 52 (157 views)
Permalink
Re: on the almost impossibility to write correct XS modules [In reply to]

2008/5/21 David Nicol <davidnicol[at]gmail.com>:
> On Wed, May 21, 2008 at 7:56 AM, Rafael Garcia-Suarez
> <rgarciasuarez[at]gmail.com> wrote:
>
>>
>> However that would of course imply to pull BLOB in the CORE.
>>
>
> Does this mean extending tie and overload to support BLOB methods?

I'm not sure I see what for ?


demerphq at gmail

May 21, 2008, 8:56 AM

Post #45 of 52 (156 views)
Permalink
Re: on the almost impossibility to write correct XS modules [In reply to]

2008/5/21 Glenn Linderman <perl[at]nevcal.com>:
> On approximately 5/21/2008 1:29 AM, came the following characters from the
> keyboard of Rafael Garcia-Suarez:
>
>> Some way to mark PVs as "binary" and not upgradeable to SvUTF8 would be
>> handy, though.
>
>
> What's the goal?
>
> If, during the lifetime of a binary string, data gets attached to it that
> makes it get upgraded, and later that data is detached, and the storage
> format is truly transparent, then when the string is used in a context that
> needs bytes, it should be handled properly (if not, let's fix that bug),
> either by downgrading, or by accessing the data and validating that the
> values are each < 256 (which downgrading does as a side effect).

So how would that work exactly? Seriously. Give a general framework
about how it would work. Consider that if it makes things massively
slower that its probably not going to fly.

>
> If the goal is to prevent the cost of upgrading and downgrading, well, just
> fix the bug that attached the upgraded data... and the cost of doing so also
> vanishes.

I dont think its so easy. The code responsible may be very hard to identify.

Yves


--
perl -Mre=debug -e "/just|another|perl|hacker/"


juerd at convolution

May 21, 2008, 9:06 AM

Post #46 of 52 (155 views)
Permalink
Re: on the almost impossibility to write correct XS modules [In reply to]

Glenn Linderman skribis 2008-05-21 8:50 (-0700):
> On approximately 5/21/2008 1:29 AM, came the following characters from
> the keyboard of Rafael Garcia-Suarez:
> >Some way to mark PVs as "binary" and not upgradeable to SvUTF8 would be
> >handy, though.
> What's the goal?

Dual:

1. To provide a means of indicating that something is binary rather than
text. This can be useful in an encoding capable DBI drivers/wrappers for
example, to indicate that a "?" placeholder is already binary, and
should not be text-encoded. (You'd want to do this based on column
introspection but that's very slow and very hard to write portably.)
Another use case involves data serialization for exchange with languages
that have native binary strings, like Java.

2. To prevent programming errors; you should see this as a matter of
strictures. Most silly mistakes made in Unicode programming are related
to people who fail to understand the difference between binary and text
strings, and as a result from that, they sometimes add text strings to
binary strings. While conceptually that's always a mistake, it happens
so often and it's such an easy mistake te make (apparently) that it
would be nice to have language support that changes "upgrade entire
string to SvUTF8" to "add only the new portion as UTF8 (encoded, not
SvUTF8 marked), keep the original as it is"

> If the goal is to prevent the cost of upgrading and downgrading, well,
> just fix the bug that attached the upgraded data... and the cost of
> doing so also vanishes.

Detecting upgrades is hard. There's a module (encoding::warnings) that
enables warnings for it globally, but you often want it on a single
string instead. Indeed the bug where characters >255 are added to the
binary string should be fixed, but finding out where/when that happens
can be a lot of work and currently requires knowledge of internals.
--
Met vriendelijke groet, Kind regards, Korajn salutojn,

Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig>
Convolution: ICT solutions and consultancy <sales[at]convolution.nl>
1;


demerphq at gmail

May 21, 2008, 9:33 AM

Post #47 of 52 (155 views)
Permalink
Re: on the almost impossibility to write correct XS modules [In reply to]

2008/5/21 Juerd Waalboer <juerd[at]convolution.nl>:
> Glenn Linderman skribis 2008-05-21 8:50 (-0700):
>> On approximately 5/21/2008 1:29 AM, came the following characters from
>> the keyboard of Rafael Garcia-Suarez:
>> >Some way to mark PVs as "binary" and not upgradeable to SvUTF8 would be
>> >handy, though.
>> What's the goal?
>
> Dual:
>
> 1. To provide a means of indicating that something is binary rather than
> text. This can be useful in an encoding capable DBI drivers/wrappers for
> example, to indicate that a "?" placeholder is already binary, and
> should not be text-encoded. (You'd want to do this based on column
> introspection but that's very slow and very hard to write portably.)
> Another use case involves data serialization for exchange with languages
> that have native binary strings, like Java.
>
> 2. To prevent programming errors; you should see this as a matter of
> strictures. Most silly mistakes made in Unicode programming are related
> to people who fail to understand the difference between binary and text
> strings, and as a result from that, they sometimes add text strings to
> binary strings. While conceptually that's always a mistake, it happens
> so often and it's such an easy mistake te make (apparently)

Its a natural mistake to make because its the intended way to work
with binary data in perl. Just nobody really appreciated how this fact
and the introduction of widechar would make things complicated.

Cheers,
yves



--
perl -Mre=debug -e "/just|another|perl|hacker/"


juerd at convolution

May 21, 2008, 2:21 PM

Post #48 of 52 (146 views)
Permalink