Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Perl: porters

On the problem of strings and binary data in Perl.

 

 

First page Previous page 1 2 Next page Last page  View All Perl porters RSS feed   Index | Next | Previous | View Threaded


demerphq at gmail

May 20, 2008, 6:51 AM

Post #1 of 30 (436 views)
Permalink
On the problem of strings and binary data in Perl.

As we have seen in recent threads we have been somewhat schizophrenic
in how we deal with strings.

I believe I have a proposal which would allow us to bypass these
problems while at the same time maintaining backwards compatibility. I
believe that this solution is compatible with some other proposals
like adding better support for case modifying options and things like
"use unicode semantics" for regexes and stuff.

My proposal is this:

---------------------

Make it such that the utf8 flag on means that the string contains
unicode codepoints encoded as utf8.

When the utf8 flag is off an additional field in the SV would be used
to determine what type of string the data contained. (I guess this
would be a pointer to some struct or an offset into a table)

If a string was not explicitly marked to be something else it would be
default assumed to be Latin-1. (null pointer or offset=0)

Two strings would only be legally concatenable if they were of the
same type, or if there existed defined conversion routines from both
types to Unicode. In the case of a string type mismatch both would be
upgraded to utf8 according to their type. An exception to this rule
would be a binary string type which would be concatable with anything,
and which would never be modified nor cause anything else to be
modified when concatenated with it.

We would provide something like bless to mark strings as being of a
particular charset and encoding combination.

WRT Win32:

All strings would be forced to unicode* and the widecharacter apis
would be used (possibly unless the string was of type ANSI or the
string was of type Binary in which case the 8 bit apis would be used).

---------------------

Im not sure how this would impact XS. I think it would leave existing
XS unchanged, and make new XS easier to write. But im open to being
told im all wrong. :-)


Yves
* this would throw an error if the string was not of a type that can
be converted to unicode.
ps: I saw the proposal for a UPV type, im at a loss to understand how
this would do anythign more than make the situation worse.

--
perl -Mre=debug -e "/just|another|perl|hacker/"


rgarciasuarez at gmail

May 20, 2008, 7:30 AM

Post #2 of 30 (422 views)
Permalink
Re: On the problem of strings and binary data in Perl. [In reply to]

2008/5/20 demerphq <demerphq[at]gmail.com>:
> Make it such that the utf8 flag on means that the string contains
> unicode codepoints encoded as utf8.
>
> When the utf8 flag is off an additional field in the SV would be used
> to determine what type of string the data contained. (I guess this
> would be a pointer to some struct or an offset into a table)
>
> If a string was not explicitly marked to be something else it would be
> default assumed to be Latin-1. (null pointer or offset=0)

So, you could mark strings as unicode via this mechanism.
Then those strings (SvUTF8_off but yet unicode) would use unicode
semantics for //i, \d (etc.), uc and lc, if I understand correctly.

> Two strings would only be legally concatenable if they were of the
> same type, or if there existed defined conversion routines from both
> types to Unicode. In the case of a string type mismatch both would be
> upgraded to utf8 according to their type.

But how would that solve the problem of :

$ bleadperl -wle '$x="\xdf";print uc $x'
ß
$ bleadperl -wle '$x="\xdf";$x.=chr(0x100);chop $x;print uc $x'
SS

> An exception to this rule
> would be a binary string type which would be concatable with anything,

and the result would be binary ? I assume that's binary as in "string
of numbers that can be > 255".

> and which would never be modified nor cause anything else to be
> modified when concatenated with it.
>
> We would provide something like bless to mark strings as being of a
> particular charset and encoding combination.

How would you acquire this flag for string literals ? for strings read
via PerlIO layers ?

> WRT Win32:
>
> All strings would be forced to unicode* and the widecharacter apis
> would be used (possibly unless the string was of type ANSI or the
> string was of type Binary in which case the 8 bit apis would be used).

Also, see "Virtualize operating system access" in perltodo. Win32 is
just a specific case; we need to fix the problem in the more general
case.

> Im not sure how this would impact XS. I think it would leave existing
> XS unchanged, and make new XS easier to write. But im open to being
> told im all wrong. :-)

That would probably make writing typemaps easier, yes (including a
better default typemap)

> Yves
> * this would throw an error if the string was not of a type that can
> be converted to unicode.
> ps: I saw the proposal for a UPV type, im at a loss to understand how
> this would do anythign more than make the situation worse.

Waiting for comments. This was just a wild idea -- I'm not sure it
solves a real problem, and can certainly create new ones. It's
obviously inspired by what python calls Unicode strings.


moritz at casella

May 20, 2008, 8:14 AM

Post #3 of 30 (399 views)
Permalink
Re: On the problem of strings and binary data in Perl. [In reply to]

demerphq wrote:
> Make it such that the utf8 flag on means that the string contains
> unicode codepoints encoded as utf8.
>
> When the utf8 flag is off an additional field in the SV would be used
> to determine what type of string the data contained. (I guess this
> would be a pointer to some struct or an offset into a table)
>
> If a string was not explicitly marked to be something else it would be
> default assumed to be Latin-1. (null pointer or offset=0)

I know the plan is to preserve backwards compatibility, but is a default
of Latin-1 really sane?

> Two strings would only be legally concatenable if they were of the
> same type, or if there existed defined conversion routines from both
> types to Unicode. In the case of a string type mismatch both would be
> upgraded to utf8 according to their type. An exception to this rule
> would be a binary string type which would be concatable with anything,
> and which would never be modified nor cause anything else to be
> modified when concatenated with it.

So you would permit concatenation of text and binary strings? What would
the result be?
IMHO that only makes sense if you downgrade the text string at the same
time, and convert it into a defined character encoding. (What that is
could be defined by a pragma or a special var). I'd also like the option
to make that emit a warning or a to simply die.


> We would provide something like bless to mark strings as being of a
> particular charset and encoding combination.

If you know the encoding, why not just decode() it?

> WRT Win32:
>
> All strings would be forced to unicode* and the widecharacter apis
> would be used (possibly unless the string was of type ANSI or the
> string was of type Binary in which case the 8 bit apis would be used).

All strings? so no handling of binary data in windows anymore? I assume
you meant something else; could you please clarify what exactly you meant?

Generally this idea sounds sane*, insofar as the additional meta
information could greatly simplify debugging, especially if you can
ensure that all data from the outside is binary until you associate it
with a character encoding.

* (my understanding of perl core code is quite limited, so this judgment
doesn't really mean anything ;)

Cheers,
Moritz


nospam-abuse at bloodgate

May 20, 2008, 8:39 AM

Post #4 of 30 (418 views)
Permalink
Re: On the problem of strings and binary data in Perl. [In reply to]

On Tuesday 20 May 2008 15:51:30 demerphq wrote:
> As we have seen in recent threads we have been somewhat schizophrenic
> in how we deal with strings.
>
> I believe I have a proposal which would allow us to bypass these
> problems while at the same time maintaining backwards compatibility.
> I believe that this solution is compatible with some other proposals
> like adding better support for case modifying options and things like
> "use unicode semantics" for regexes and stuff.
>
> My proposal is this:
>
> ---------------------
>
> Make it such that the utf8 flag on means that the string contains
> unicode codepoints encoded as utf8.
>
> When the utf8 flag is off an additional field in the SV would be used
> to determine what type of string the data contained. (I guess this
> would be a pointer to some struct or an offset into a table)
>
> If a string was not explicitly marked to be something else it would
> be default assumed to be Latin-1. (null pointer or offset=0)
>
> Two strings would only be legally concatenable if they were of the
> same type, or if there existed defined conversion routines from both
> types to Unicode. In the case of a string type mismatch both would be
> upgraded to utf8 according to their type. An exception to this rule
> would be a binary string type which would be concatable with
> anything, and which would never be modified nor cause anything else
> to be modified when concatenated with it.
>
> We would provide something like bless to mark strings as being of a
> particular charset and encoding combination.

I think that basically solves my "complaint" in that you cannot mark
strings with the encoding they have and thus all the related problems
that stem from that.


So, Yves++.

All the best,

Tels

--
Signed on Tue May 20 17:38:46 2008 with key 0x93B84C15.
Get one of my photo posters: http://bloodgate.com/posters
PGP key on http://bloodgate.com/tels.asc or per email.

"Und jetzt kommen Sie!"

-- Stenkelfeld


demerphq at gmail

May 20, 2008, 9:06 AM

Post #5 of 30 (418 views)
Permalink
Re: On the problem of strings and binary data in Perl. [In reply to]

2008/5/20 Rafael Garcia-Suarez <rgarciasuarez[at]gmail.com>:
> 2008/5/20 demerphq <demerphq[at]gmail.com>:
>> Make it such that the utf8 flag on means that the string contains
>> unicode codepoints encoded as utf8.
>>
>> When the utf8 flag is off an additional field in the SV would be used
>> to determine what type of string the data contained. (I guess this
>> would be a pointer to some struct or an offset into a table)
>>
>> If a string was not explicitly marked to be something else it would be
>> default assumed to be Latin-1. (null pointer or offset=0)
>
> So, you could mark strings as unicode via this mechanism.
> Then those strings (SvUTF8_off but yet unicode) would use unicode
> semantics for //i, \d (etc.), uc and lc, if I understand correctly.

I think overall we agree here. Although my point about it being
compatible with other stuff also involved the possibility of making
those use "English" semantics and complementing them with ones that
followed unicode semantics.

>> Two strings would only be legally concatenable if they were of the
>> same type, or if there existed defined conversion routines from both
>> types to Unicode. In the case of a string type mismatch both would be
>> upgraded to utf8 according to their type.
>
> But how would that solve the problem of :
>
> $ bleadperl -wle '$x="\xdf";print uc $x'
> ß
> $ bleadperl -wle '$x="\xdf";$x.=chr(0x100);chop $x;print uc $x'
> SS

It wouldnt. The whole point of my proposal is it wouldnt change
existing behaviour, but provide for the possibility of sane behaviour
if you wanted it.

I think this is probably best solved by deciding that uc() applies
"english" rules of capitalization, similarly for \d and /i and etc.
And introduce complements that provide for other interpretations, most
importantly unicode ones.

Part of the problem here is that capitalization rules are language
based, not character set based. Accordingly character sets don't
define capitalization rules. Unicode is the exception in this regard,
and even for it there are rules which can only be properly applied if
you know the *language* the text is in. So it may be that if we want
to be really careful in our unicode support we need to be able to
track that as well.

>
>> An exception to this rule
>> would be a binary string type which would be concatable with anything,
>
> and the result would be binary ? I assume that's binary as in "string
> of numbers that can be > 255".

Yes. Exactly. If you had a string that contained Unicode-Utf8 data and
you concatenated it to one containing binary it would be a raw byte
copy resulting in another binary string.

>
>> and which would never be modified nor cause anything else to be
>> modified when concatenated with it.
>>
>> We would provide something like bless to mark strings as being of a
>> particular charset and encoding combination.
>
> How would you acquire this flag for string literals ? for strings read
> via PerlIO layers ?

I would provide both a mechanism like bless and a pragma interface. I
havent fully thought out how it would work with PerlIO layers, but i
would have thought they already have a lot of the information we need.

>> WRT Win32:
>>
>> All strings would be forced to unicode* and the widecharacter apis
>> would be used (possibly unless the string was of type ANSI or the
>> string was of type Binary in which case the 8 bit apis would be used).
>
> Also, see "Virtualize operating system access" in perltodo. Win32 is
> just a specific case; we need to fix the problem in the more general
> case.

How should (not necessarily does) this work on unix? What should
happen if someone wants to open a filename that is in big5? On windows
I would say "convert to unicode utf-16 and use the wide character
interface", on *nix I have no idea what the right answer would be.

>> Im not sure how this would impact XS. I think it would leave existing
>> XS unchanged, and make new XS easier to write. But im open to being
>> told im all wrong. :-)
>
> That would probably make writing typemaps easier, yes (including a
> better default typemap)
>
>> Yves
>> * this would throw an error if the string was not of a type that can
>> be converted to unicode.
>> ps: I saw the proposal for a UPV type, im at a loss to understand how
>> this would do anythign more than make the situation worse.
>
> Waiting for comments. This was just a wild idea -- I'm not sure it
> solves a real problem, and can certainly create new ones. It's
> obviously inspired by what python calls Unicode strings.

The UPV idea is fine if its "from the beginning" but it doesnt resolve
the issues of how the other type of string behaves. I think it would
just add confusion to already confused situation, where we want to
exact some order on a confused situation. If we were going to try to
solve this by adding a new type then i would argue it should be a
binary string type not a unicode one. Then we can truly say that
strings hold text of defined natures (depending on the utf8 flag) and
if people want to store or manipulate strings in any other form they
need to use the binary string type. Then at least manipulations of
those strings could be given well defined behaviour.

But i think both of these would be unperlish. I think that the idea of
actually tracking charset-encoding and possibly language on all
strings and using Unicode as a conversion point of last resort fits
better.


--
perl -Mre=debug -e "/just|another|perl|hacker/"


demerphq at gmail

May 20, 2008, 9:11 AM

Post #6 of 30 (418 views)
Permalink
Re: On the problem of strings and binary data in Perl. [In reply to]

2008/5/20 Moritz Lenz <moritz[at]casella.verplant.org>:
> demerphq wrote:
> > Make it such that the utf8 flag on means that the string contains
>> unicode codepoints encoded as utf8.
>>
>> When the utf8 flag is off an additional field in the SV would be used
>> to determine what type of string the data contained. (I guess this
>> would be a pointer to some struct or an offset into a table)
>>
>> If a string was not explicitly marked to be something else it would be
>> default assumed to be Latin-1. (null pointer or offset=0)
>
> I know the plan is to preserve backwards compatibility, but is a default
> of Latin-1 really sane?

Its not a question of sane. Its a question of what has to happen to a
string when it gets converted to unicode utf8 in order to preserve
back-compat.

>> Two strings would only be legally concatenable if they were of the
>> same type, or if there existed defined conversion routines from both
>> types to Unicode. In the case of a string type mismatch both would be
>> upgraded to utf8 according to their type. An exception to this rule
>> would be a binary string type which would be concatable with anything,
>> and which would never be modified nor cause anything else to be
>> modified when concatenated with it.
>
> So you would permit concatenation of text and binary strings? What would
> the result be?

Binary. Raw transcription of bytes.

> IMHO that only makes sense if you downgrade the text string at the same
> time, and convert it into a defined character encoding.

I dont follow you here. We know the charset/encoding of both items, as
thats what my whole proposal is about.

> (What that is
> could be defined by a pragma or a special var). I'd also like the option
> to make that emit a warning or a to simply die.

Well i guess that would be possible sure.

>
>
>> We would provide something like bless to mark strings as being of a
>> particular charset and encoding combination.
>
> If you know the encoding, why not just decode() it?

Maybe i dont want to decode it because im just going to write it out
later as is and arent going to use any features that would require
converting to unicode.

>
>> WRT Win32:
>>
>> All strings would be forced to unicode* and the widecharacter apis
>> would be used (possibly unless the string was of type ANSI or the
>> string was of type Binary in which case the 8 bit apis would be used).
>
> All strings? so no handling of binary data in windows anymore? I assume
> you meant something else; could you please clarify what exactly you meant?

No not all strings. We are talking about api calls here. The argument
to open for example.

>
> Generally this idea sounds sane*, insofar as the additional meta
> information could greatly simplify debugging, especially if you can
> ensure that all data from the outside is binary until you associate it
> with a character encoding.

I was thinking you would have a pragma for that.


Cheers,
yves



--
perl -Mre=debug -e "/just|another|perl|hacker/"


demerphq at gmail

May 20, 2008, 9:41 AM

Post #7 of 30 (418 views)
Permalink
Re: On the problem of strings and binary data in Perl. [In reply to]

2008/5/20 Moritz Lenz <moritz[at]casella.verplant.org>:
> demerphq wrote:
>> 2008/5/20 Moritz Lenz <moritz[at]casella.verplant.org>:
>>> demerphq wrote:
>>>> Two strings would only be legally concatenable if they were of the
>>>> same type, or if there existed defined conversion routines from both
>>>> types to Unicode. In the case of a string type mismatch both would be
>>>> upgraded to utf8 according to their type. An exception to this rule
>>>> would be a binary string type which would be concatable with anything,
>>>> and which would never be modified nor cause anything else to be
>>>> modified when concatenated with it.
>>>
>>> So you would permit concatenation of text and binary strings? What would
>>> the result be?
>>
>> Binary. Raw transcription of bytes.
>>
>>> IMHO that only makes sense if you downgrade the text string at the same
>>> time, and convert it into a defined character encoding.
>>
>> I dont follow you here. We know the charset/encoding of both items, as
>> thats what my whole proposal is about.
>
> I haven't wrapped my head fully around it.
> If we have to strings, one in latin-1 and on in utf-16. Now we
> concatenate them. The result is a unicode string (marked as such), but
> does it have a defined encoding? If so, which?

unicode utf8.

> Because if we allow concatenation of binary data and text strings, all
> text strings have to carry an unambiguous encoding information. So we
> need rules how to propagate those.

Binary + anything is binary.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"


moritz at casella

May 20, 2008, 9:51 AM

Post #8 of 30 (399 views)
Permalink
Re: On the problem of strings and binary data in Perl. [In reply to]

demerphq wrote:
> 2008/5/20 Moritz Lenz <moritz[at]casella.verplant.org>:
>> demerphq wrote:
>>> Two strings would only be legally concatenable if they were of the
>>> same type, or if there existed defined conversion routines from both
>>> types to Unicode. In the case of a string type mismatch both would be
>>> upgraded to utf8 according to their type. An exception to this rule
>>> would be a binary string type which would be concatable with anything,
>>> and which would never be modified nor cause anything else to be
>>> modified when concatenated with it.
>>
>> So you would permit concatenation of text and binary strings? What would
>> the result be?
>
> Binary. Raw transcription of bytes.
>
>> IMHO that only makes sense if you downgrade the text string at the same
>> time, and convert it into a defined character encoding.
>
> I dont follow you here. We know the charset/encoding of both items, as
> thats what my whole proposal is about.

I haven't wrapped my head fully around it.
If we have to strings, one in latin-1 and on in utf-16. Now we
concatenate them. The result is a unicode string (marked as such), but
does it have a defined encoding? If so, which?
Because if we allow concatenation of binary data and text strings, all
text strings have to carry an unambiguous encoding information. So we
need rules how to propagate those.

Cheers,
Moritz

--
Moritz Lenz
http://moritz.faui2k3.org/ | http://perl-6.de/
Attachments: signature.asc (0.25 KB)


juerd at convolution

May 20, 2008, 11:25 AM

Post #9 of 30 (411 views)
Permalink
Re: On the problem of strings and binary data in Perl. [In reply to]

demerphq skribis 2008-05-20 15:51 (+0200):
> Make it such that the utf8 flag on means that the string contains
> unicode codepoints encoded as utf8.

Is "string" here the internal buffer in the PV, or the language level
conceptual string?

I hope the former, but let's not allow any assumptions.

> When the utf8 flag is off an additional field in the SV would be used
> to determine what type of string the data contained. (I guess this
> would be a pointer to some struct or an offset into a table)
> If a string was not explicitly marked to be something else it would be
> default assumed to be Latin-1. (null pointer or offset=0)

Why do you feel that Perl 5 needs to have multiple string types?

I think that it currently does very well without, and am afraid that
adding types to Perl 5 is very invasive, and opens up lots of
opportunities for new errors.

In my opinion the problem with Unicode in Perl is that semantics are
decided based on the internal state of the string, and that the
documentation is (still) wildly misleading. Strings themselves work
very well and are adequate for most needs, including text and binary
handling.

> Two strings would only be legally concatenable if they were of the
> same type, or if there existed defined conversion routines from both
> types to Unicode.

It helps to have only Unicode text strings and binary strings, for a
very simple reason: mixing text and binary doesn't happen in correct
applications. However, having multiple character sets implemented in
Perl doesn't sound very attractive because it's a complex and almost
academic solution for a rather small problem.

Currently there is one string type and it is used for binary data and
for text data. It performs well and is thoroughly tested. It could help
in some cases to be able to explicitly state your intentions, e.g. "this
string is binary", if Perl can then help you keep that promise by
keeping characters > 255 away from it, but if that can be done
externally, I think that's a more interesting solution than adding a
whole system of types to Perl.

I believe that it can be done externally except for one slightly
problematic case: automatic upgrading. This, however, can be solved at
end points. See BLOB when it's there. I still intend to release it
really shortly. That is: when I find my working directory :)

> In the case of a string type mismatch both would be
> upgraded to utf8 according to their type. An exception to this rule
> would be a binary string type which would be concatable with anything,
> and which would never be modified nor cause anything else to be
> modified when concatenated with it.

I think this is a good idea, but without the possibility for indicating
arbitrary types. Having two string types, unicode and binary, or text
and binary, suffices. What I'm suggesting with BLOB is a tiny bit
different, and provides hybrid and binary, where "hybrid" is the current
string type as we know it, and "binary" too, but with a definite
indication that it shouldn't ever contain high codepoints.

> We would provide something like bless to mark strings as being of a
> particular charset and encoding combination.

In fact, BLOB does use bless! After all, it is the referent, not the
reference, that is blessed, so normal strings can carry a bit of extra
data like this.

> Im not sure how this would impact XS. I think it would leave existing
> XS unchanged, and make new XS easier to write. But im open to being
> told im all wrong. :-)

I think this is a very sane approach given the vast amounts of XS
already in production use.
--
Met vriendelijke groet, Kind regards, Korajn salutojn,

Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig>
Convolution: ICT solutions and consultancy <sales[at]convolution.nl>
1;


juerd at convolution

May 20, 2008, 11:27 AM

Post #10 of 30 (411 views)
Permalink
Re: On the problem of strings and binary data in Perl. [In reply to]

Tels skribis 2008-05-20 17:39 (+0200):
> I think that basically solves my "complaint" in that you cannot mark
> strings with the encoding they have and thus all the related problems
> that stem from that.

I think it is not the string itself that during its life needs to know
its encoding. Instead, when you're receiving the string (data always
comes from somewhere!), decode it to Unicode, and keep it as a unicode
string during your program. No changes to Perl are required for this,
just a tiny change in how you use Perl.

The :encoding layer already provides this functionality, as do many CPAN
modules. Let's NOT change the entire string model AGAIN, please!
--
Met vriendelijke groet, Kind regards, Korajn salutojn,

Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig>
Convolution: ICT solutions and consultancy <sales[at]convolution.nl>
1;


juerd at convolution

May 20, 2008, 11:31 AM

Post #11 of 30 (411 views)
Permalink
Re: On the problem of strings and binary data in Perl. [In reply to]

demerphq skribis 2008-05-20 18:41 (+0200):
> Binary + anything is binary.

binary + anything is problematic if not well defined, and no change to
Perl's string model can ever change that.

What is $jpeg_data . $gif_data? What is $pdf_data . $text_string?

In fact, this is very much the same category of problems as "hello" + 5.
Can't do anything about it without losing information, and a warning is
very welcome.

And it doesn't really matter which side wins. Binary + anything = binary
could work, but is NOT backwards compatible because currently Perl does
binary + text = text (it has no "anything"), if you must use those
terms. Of course, in reality, string + string = string is what really
happens.
--
Met vriendelijke groet, Kind regards, Korajn salutojn,

Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig>
Convolution: ICT solutions and consultancy <sales[at]convolution.nl>
1;


nospam-abuse at bloodgate

May 20, 2008, 11:38 AM

Post #12 of 30 (411 views)
Permalink
Re: On the problem of strings and binary data in Perl. [In reply to]

On Tuesday 20 May 2008 20:27:50 Juerd Waalboer wrote:
> Tels skribis 2008-05-20 17:39 (+0200):
> > I think that basically solves my "complaint" in that you cannot
> > mark strings with the encoding they have and thus all the related
> > problems that stem from that.
>
> I think it is not the string itself that during its life needs to
> know its encoding. Instead, when you're receiving the string (data
> always comes from somewhere!), decode it to Unicode, and keep it as a
> unicode string during your program. No changes to Perl are required
> for this, just a tiny change in how you use Perl.

But what if I _dont_ want to decode it to Unicode? Decoding it can be
wasteful, plus it can happen behind the scenes, and without warning.

All the best,

Tels

--
Signed on Tue May 20 20:38:03 2008 with key 0x93B84C15.
View my photo gallery: http://bloodgate.com/photos
PGP key on http://bloodgate.com/tels.asc or per email.

"Carpal Tunnel Syndrome is a non-fatal terminal disease."

-- Dr. Alexander Fisher


davidnicol at gmail

May 20, 2008, 12:56 PM

Post #13 of 30 (399 views)
Permalink
Re: On the problem of strings and binary data in Perl. [In reply to]

On Tue, May 20, 2008 at 9:30 AM, Rafael Garcia-Suarez

> Also, see "Virtualize operating system access" in perltodo. Win32 is
> just a specific case; we need to fix the problem in the more general
> case.

Does Apache Portable Runtime have everything that is required?

http://apr.apache.org/docs/apr/0.9/group__apr__filepath.html
documents support for file names to be encoded utf8, locale, or unknown.


abigail at abigail

May 20, 2008, 1:12 PM

Post #14 of 30 (400 views)
Permalink
Re: On the problem of strings and binary data in Perl. [In reply to]

On Tue, May 20, 2008 at 08:38:37PM +0200, Tels wrote:
> On Tuesday 20 May 2008 20:27:50 Juerd Waalboer wrote:
> > Tels skribis 2008-05-20 17:39 (+0200):
> > > I think that basically solves my "complaint" in that you cannot
> > > mark strings with the encoding they have and thus all the related
> > > problems that stem from that.
> >
> > I think it is not the string itself that during its life needs to
> > know its encoding. Instead, when you're receiving the string (data
> > always comes from somewhere!), decode it to Unicode, and keep it as a
> > unicode string during your program. No changes to Perl are required
> > for this, just a tiny change in how you use Perl.
>
> But what if I _dont_ want to decode it to Unicode?

Then use C!

> Decoding it can be
> wasteful, plus it can happen behind the scenes, and without warning.

So what? Garbage collection can be wasteful as well, plus it can happen
behind the scenes, and without warning. And you don't have the option to
disable it.

Really, I think that if the internal representation would be totally
hidden from the surface of the language (as I think Juerd is suggesting)
a request from the programmer to not decode it to Unicode is non-sensible.

Perl does a *ton* of things behind the scene, out of the control of the
programmer. If the programmer is uncomfortable with that, (s)he shouldn't
use Perl, but a language that gives him/her the control. C for instance.

I use Perl because I don't have to care what happens behind the scenes.
Perl does the memory management for me. Perl converts strings into numbers
for me. And perl should have some internal representation of strings that
shouldn't concern me when I'm coding Perl. It ought to be transparent.



Abigail


juerd at convolution

May 20, 2008, 1:40 PM

Post #15 of 30 (399 views)
Permalink
Re: On the problem of strings and binary data in Perl. [In reply to]

Tels skribis 2008-05-20 20:38 (+0200):
> But what if I _dont_ want to decode it to Unicode?

Then don't treat it as text. Treat it as the arbitrary binary data that
it is.
--
Met vriendelijke groet, Kind regards, Korajn salutojn,

Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig>
Convolution: ICT solutions and consultancy <sales[at]convolution.nl>
1;


juerd at convolution

May 20, 2008, 1:46 PM

Post #16 of 30 (399 views)
Permalink
Re: On the problem of strings and binary data in Perl. [In reply to]

Moritz Lenz skribis 2008-05-20 18:51 (+0200):
> The result is a unicode string (marked as such), but does it have a
> defined encoding? If so, which? Because if we allow concatenation of
> binary data and text strings, all text strings have to carry an
> unambiguous encoding information. So we need rules how to propagate
> those.

No, the binary string should have encoding information. Text strings,
also called unicode strings, are conceptually not encoded at all, and
there are good reasons to put the encoding information on the binary
side. The most important reason is that the binary side sets the
constraints, and that they will probably stay the same there, while the
text string can be used with multiple different binary receptors.

However, I still do believe that associating encoding with operations
makes more sense than associating it with strings; at least in Perl 5.
--
Met vriendelijke groet, Kind regards, Korajn salutojn,

Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig>
Convolution: ICT solutions and consultancy <sales[at]convolution.nl>
1;


juerd at convolution

May 20, 2008, 2:26 PM

Post #17 of 30 (399 views)
Permalink
Re: On the problem of strings and binary data in Perl. [In reply to]

Moritz Lenz skribis 2008-05-20 17:14 (+0200):
> I know the plan is to preserve backwards compatibility, but is a default
> of Latin-1 really sane?

Yes. Perl already does this, and uses the nice property of latin1 that
the entire mapping to unicode codepoints is 0=>U+0000 .. 255=>U+00FF,
which means that upgrading this when necessary is both cheap and
predictable.
--
Met vriendelijke groet, Kind regards, Korajn salutojn,

Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig>
Convolution: ICT solutions and consultancy <sales[at]convolution.nl>
1;


pagaltzis at gmx

May 20, 2008, 2:31 PM

Post #18 of 30 (399 views)
Permalink
Re: On the problem of strings and binary data in Perl. [In reply to]

* demerphq <demerphq[at]gmail.com> [2008-05-20 18:10]:
> Yes. Exactly. If you had a string that contained Unicode-Utf8
> data and you concatenated it to one containing binary it would
> be a raw byte copy resulting in another binary string.

Hmm. So concatenating a character string to an empty binary
string would yield the UTF-8-encoded binary string version of the
character string? My initial reaction to this was negative, but
after thinking about it for a while I think I like this.

It does mean that Perl would never ever mangle data.

One very visible effect would be what happens when people have
Unicode in UTF-8-encoded octet strings which they concatenate
with Unicode in character strings. Currently, what happens is
that they get a character string with in which the octet string
portion is double-encoded. Yves’ proposal means concatenation
would yield an octet string which contains Unicode in UTF-8
encoding.

I am not sure I like this insofar as it would make people even
more ignorant of charsets than they are now, because it would
hide a whole bunch of breakage where buggy code that deals in
Unicode would appear to work.

OTOH it would expose some other breakage more clearly. F.ex.,
note this consequence:

> 2008/5/20 Rafael Garcia-Suarez <rgarciasuarez[at]gmail.com>:
>> But how would that solve the problem of :
>>
>> $ bleadperl -wle '$x="\xdf";print uc $x'
>> ß
>> $ bleadperl -wle '$x="\xdf";$x.=chr(0x100);chop $x;print uc $x'
>> SS
>
> It wouldnt. The whole point of my proposal is it wouldnt change
> existing behaviour, but provide for the possibility of sane
> behaviour if you wanted it.

It wouldn’t *solve* this, but we can say what would *happen*
here (assuming a terminal that expects Latin-1):

$ hypothetiperl -wle '$x="\xdf";$x.=chr(0x100);chop $x;print uc $x'
Wide character in concatenation to $x at -e line 1.
ßÄ

Because the UTF-8 encoding of U+0100 is 0xC4 0x80, so we start
with the octet string 0xDF, then append 0xC4 0x80, then chop the
last byte, so we get 0xDF 0xC4.

Yes, I am making up a new warning there. I would advocate that
strongly, btw, since that would allow anyone who wants to ensure
correctness to `use warnings FATAL => 'utf8'` and basically get
encoding::warnings for free.

Regards,
--
Aristotle Pagaltzis // <http://plasmasturm.org/>


davidnicol at gmail

May 20, 2008, 3:19 PM

Post #19 of 30 (399 views)
Permalink
Re: On the problem of strings and binary data in Perl. [In reply to]

On Tue, May 20, 2008 at 4:31 PM, Aristotle Pagaltzis <pagaltzis[at]gmx.de> wrote:

> Because the UTF-8 encoding of U+0100 is 0xC4 0x80, so we start
> with the octet string 0xDF, then append 0xC4 0x80, then chop the
> last byte, so we get 0xDF 0xC4.

why wouldn't chop take the last character, when operating on characters?


pagaltzis at gmx

May 20, 2008, 4:49 PM

Post #20 of 30 (397 views)
Permalink
Re: On the problem of strings and binary data in Perl. [In reply to]

* David Nicol <davidnicol[at]gmail.com> [2008-05-21 00:25]:
> On Tue, May 20, 2008 at 4:31 PM, Aristotle Pagaltzis <pagaltzis[at]gmx.de> wrote:
> > Because the UTF-8 encoding of U+0100 is 0xC4 0x80, so we
> > start with the octet string 0xDF, then append 0xC4 0x80,
> > then chop the last byte, so we get 0xDF 0xC4.
>
> why wouldn't chop take the last character, when operating on
> characters?

It would, but it’s not operating on characters in the given
example. I presumed, anyway, that `chr 0xDF` would continue
to yield an octet string of length 1.

In Yves’ scheme, any time an octet string is involved, the
operation as a whole yields an octet string also. So
concatenating `chr 0x100` to it, which is a character string,
would yield an octet string, effectively concatenating the octets
that represent the UTF-8 encoding of the character U+0100, ie.
0xC4 0x80.

This is basically the opposite of the current behaviour, where
any time a character string is involved, the operation as a
whole yields a character string also, silently upgrading any
octet strings in the process.

I have not decided whether I really like this, but I think I
already know I like it better than the current approach.

Regards,
--
Aristotle Pagaltzis // <http://plasmasturm.org/>


rgarciasuarez at gmail

May 20, 2008, 11:33 PM

Post #21 of 30 (360 views)
Permalink
Re: On the problem of strings and binary data in Perl. [In reply to]

2008/5/21 Glenn Linderman <perl[at]nevcal.com>:
> Your tagging proposal could allow the data to be fully described; however,
> all the operations that understand text semantics, would need to be upgraded
> to understand the text semantics of every possible encoding that might be
> encountered... or else it would have to do implicit conversions to Unicode,
> which would reduce the performance benefit of not en/de-coding the text data
> at the I/O boundaries... and would cause an encoding to be done there again
> on output, likely, even if it was avoided on input.
>
> together with David Nicol's "tree structured strings" proposal,
> concatenating different types could retain their original types. Then the
> operators would have to know how to switch semantics as it scanned the
> string... or again, do implicit conversions.
>
> It seems the implementation cost of enhancing the operators to operate
> without conversion would be extremely complex (hence error-prone to
> implement).
>
> It seems the implementation of the operators to operate by converting to
> Unicode only defers the cost of conversion on input, and causes it on
> output, unless the data is not manipulated using operators with text
> semantics... but if it is not manipulated using operators with text
> semantics, then it can be manipulated as binary anyway, thus also avoiding
> the cost of conversion on I/O.

I must point out SpecialCasing.txt here. If we want to enhance in some way
the uc/lc operations to take into account Unicode special casing rules,
then encoding is not sufficient, you need to provide the language as well.

> This tagging proposal reminds me of various proposals I've seen (even made
> some) over the years to attach "units" to numeric values. Then arithmetic
> could be safer. Tom's recent diatribe about metric system may not be as
> off-topic as he hoped.

But calculus operates on homogenous quantities. Text can be multi-lingual.
How about uc($chunk_of_a_german_turkish_dictionary) ? It seems that the
rules to apply belong to the operator and not to the string.

> I think that Perl should restrict itself to Unicode semantics for operations
> that require text semantics, plus, for compatibility, English semantics
> should be available. But it should not be based on storage format, but
> rather on an option, parameter, or lexical pragma.
>
> I think that would be lots simpler. If the program requires more
> complexity, OO features can be used to provide them (at some performance
> cost). But if the programmer can keep track of it, then there is less
> performance cost, and less overall complexity in the operators.


rgarciasuarez at gmail

May 21, 2008, 1:01 AM

Post #22 of 30 (359 views)
Permalink
Re: On the problem of strings and binary data in Perl. [In reply to]

2008/5/21 Glenn Linderman <perl[at]nevcal.com>:
>> But calculus operates on homogenous quantities. Text can be multi-lingual.
>> How about uc($chunk_of_a_german_turkish_dictionary) ? It seems that the
>> rules to apply belong to the operator and not to the string.
>
>
> I'm not sure what you are saying here... it seems to me that the rules apply
> to the operator, but are affected by the knowledge of the string you wish to
> operate on.
>
> And regarding calculus, I could envision doing vector and/or matrix
> operations on vectors and/or matrices that contain different units for each
> element... which is about the same thing as multi-lingual strings...

But down to the simple element calculus units should match. And to keep
things complex, a true unit system will know that when you multiply,
say, a mass by a surface and divide by the square of a duration, you get
an energy. That's why unit type systems are best kept as specialized
modules, right ? :)

> So for your uc($chunk_of_a_german_turkish_dictionary), I assume that
> $chunk_of_a_german_turkish_dictionary is a "multi-lingual string object",

Actually not what I had in mind. No size fits all. In those cases I
expect the careful programmer to use his knowledge of the format of
the string (XML, whatever) to split it in chunks that are monolingual
and thus upper-caseable separately.

But then of course CPAN could hold classes that implement multi-lingual
strings :

> containing an ordered collection of text fragments represented as a Perl
> string, and that each fragment has corresponding meta data describing its
> language (and possibly dialect). I assume that uc can be and is overloaded
> by the "multi-lingual string class", and that the native Perl uc operator is
> invoked for the Unicode version of the string and including the appropriate
> pragma, option, or parameter to the operator to make it use the appropriate
> rules from SpecialCasing.txt
>
> The multi-lingual string class may understand how to parse data in some XML
> format such as
>
> <locale language="German">Sprechen Sie Deutsch?</locale>
> <locale language="Turkish">Türkçe biliyor musun?</locale>
>
> and/or other formats such as those defined by ODF, or perhaps even some
> proprietary word processor systems.
>
>
> But most programs and programmers aren't going to need to deal with that;
> they'll set the values of the lexical pragmas to the appropriate language at
> the top of their applications, and their program will appropriately handle
> the language for the default (or configured) locale. No need for
> multi-lingual string classes, overloaded string operators, or
> internally-tagged, tree-structured Perl strings. People that need them can
> make them, or (maybe someday) obtain them from CPAN. CPAN already has tools
> to parse the XML, and even some for ODF, so the job of creating the
> multi-lingual string class is partially complete.
>
> The important job for Perl-the-language is to fix the bugs with the
> presently exposed string storage format, to fix the other bugs where
> character semantics are inappropriately provided (chr, for example), provide
> features and facilities to normalize and validate Unicode for various
> purposes, and enhance the available operators with options, parameters or
> pragmas, that allow them to implement the necessary Unicode semantics for
> the specified language.
>
> If, after all those building blocks are in place, the language wants to
> implement a "multilingual string type" that has internal tags that override
> the pragmas, that might be nice, but it seems to me a poor use of CORE
> development time, when at the moment, we can't even write programs to handle
> one language at a time with proper semantics, and we can't even open and
> read all the files under Windows, because of them having non-ANSI characters
> in their names.
>
> That said, if there is something about multi-lingual strings that can't be
> done as an object, I'm all ears.


davidnicol at gmail

May 21, 2008, 10:32 AM

Post #23 of 30 (349 views)
Permalink
Re: On the problem of strings and binary data in Perl. [In reply to]

On Wed, May 21, 2008 at 11:09 AM, Glenn Linderman <perl[at]nevcal.com> wrote:
> Sure, but that solution can't be expressed by your code fragment; so I
> invented one that can :) Your solution is much more likely to be
> implemented, mine is probably more general.

Clearly what this project needs is general recognition as a valid
domain for graduate students looking for thesis topics.


rurban at x-ray

May 21, 2008, 1:31 PM

Post #24 of 30 (349 views)
Permalink
Re: On the problem of strings and binary data in Perl. [In reply to]

2008/5/20 Moritz Lenz <moritz[at]casella.verplant.org>:
> demerphq wrote:
> > Make it such that the utf8 flag on means that the string contains
>> unicode codepoints encoded as utf8.
>>
>> When the utf8 flag is off an additional field in the SV would be used
>> to determine what type of string the data contained. (I guess this
>> would be a pointer to some struct or an offset into a table)
>>
>> If a string was not explicitly marked to be something else it would be
>> default assumed to be Latin-1. (null pointer or offset=0)
>
> I know the plan is to preserve backwards compatibility, but is a default
> of Latin-1 really sane?

No. As Jan already explained is this assumption broken for Win32 filenames
(MSWin32 and cygwin)
8-bit Win32 paths are cp-1252 encoded (MS calls them ANSI - see
http://en.wikipedia.org/wiki/Windows-1252) and not Latin-1.
In fact not only filenames. All strings which come from a Win32 API
call and are not utf-8, are cp-1252 encoded and should be marked
such in the future.
All other strings are assumed to be Latin-1.
wchar_t strings are represented as utf-8, which is fine for perl and
space considerations, but slow.

>> Two strings would only be legally concatenable if they were of the
>> same type, or if there existed defined conversion routines from both
>> types to Unicode. In the case of a string type mismatch both would be
>> upgraded to utf8 according to their type. An exception to this rule
>> would be a binary string type which would be concatable with anything,
>> and which would never be modified nor cause anything else to be
>> modified when concatenated with it.
>
> So you would permit concatenation of text and binary strings? What would
> the result be?
> IMHO that only makes sense if you downgrade the text string at the same
> time, and convert it into a defined character encoding. (What that is
> could be defined by a pragma or a special var). I'd also like the option
> to make that emit a warning or a to simply die.

ops with multiple strings must silently upgrade any non-utf-8 arg
to utf-8 of course, unless there's an encoding pragma in effect and
1. there's one utf-8 arg, or
2. the args have not all the same encoding. Ignore "asis" which is
never converted.
With an encoding pragma, the pre-concat step would be to upgrade to utf-8,
utf-8 concat, and convert to the target encoding if possible.

>> WRT Win32:
>> All strings would be forced to unicode* and the widecharacter apis
>> would be used (possibly unless the string was of type ANSI or the
>> string was of type Binary in which case the 8 bit apis would be used).

I (cygwin-1.7) assume to use the W api only, if
1. a string argument is utf-8, or
2. the result could be wide. Which is true for all file and dir functions.

A wide result will be encoded as utf-8, only if the result is really not
representable in the current encoding, because utf-8 string operations
are slow. I rather use W calls and convert down to the current encoding
once, than to bother the perl ops with utf8 args.

For non-utf-8 strings as arguments and no chance to return a wide string,
I simply use the A api for speed.

The libwin32 libs and Win32::GUI et al. must be rewritten to support
the W api resp. utf-8 strings also.

> All strings? so no handling of binary data in windows anymore? I assume
> you meant something else; could you please clarify what exactly you meant?

Only strings will have an defined encoding, binary data not.
Binary data should be encoded asis, without any charset conversion.

Problem:
Which ops return "binary data" ie. strings without
encoding (an string encoding called "asis" or in clisp they are called "1:1"),
or only the default encoding (empty encoding slot).

This is important for the separation of binary data / strings:
Which strings should be converted on scoped encoding pragmas,
which not?
--
Reini Urban
http://phpwiki.org/ http://murbreak.at/


rurban at x-ray

May 21, 2008, 1:40 PM

Post #25 of 30 (349 views)
Permalink
Re: On the problem of strings and binary data in Perl. [In reply to]

2008/5/21 Rafael Garcia-Suarez <rgarciasuarez[at]gmail.com>:
> I must point out SpecialCasing.txt here. If we want to enhance in some way
> the uc/lc operations to take into account Unicode special casing rules,
> then encoding is not sufficient, you need to provide the language as well.

Such language cases are usually not stored per variable.
The current language is usually stored only globally (Normally just $ENV{LANG})
and might be overriden per scope with a lexical pragma if really needed.
For perl I would suggest to stay with $ENV{LANG}.
Even Lisp does use a global and uses special helper libs or a :language
optional arg for this special case.

>> I think that Perl should restrict itself to Unicode semantics for operations
>> that require text semantics, plus, for compatibility, English semantics
>> should be available. But it should not be based on storage format, but
>> rather on an option, parameter, or lexical pragma.
>>
>> I think that would be lots simpler. If the program requires more
>> complexity, OO features can be used to provide them (at some performance
>> cost). But if the programmer can keep track of it, then there is less
>> performance cost, and less overall complexity in the operators.
--
Reini Urban
http://phpwiki.org/ http://murbreak.at/

First page Previous page 1 2 Next page Last page  View All Perl porters RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.