Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Perl: porters

on broken manpages, trolling, inconsistent implementation and the difficulty to fix bugs

 

 

Perl porters RSS feed   Index | Next | Previous | View Threaded


schmorp at schmorp

May 19, 2008, 1:03 PM

Post #1 of 19 (306 views)
Permalink
on broken manpages, trolling, inconsistent implementation and the difficulty to fix bugs

On Mon, May 19, 2008 at 11:45:49AM -0700, Jan Dubois <jand[at]activestate.com> wrote:
> point out problems in the actual implementation clueless. Are you
> just trolling?

Also, from past disucssion, you should know that I know what I am talking
about when talking about unicode in perl.

I am trying to point out real issues and educate people on what is the right
way to tackle problems.

Note that you completely ignored the *real* issues I brought up.

Asking me wether I troll is so endlessly stupid and insulting. What are *you*
doing to fix the unicode problems in perl? You but in with totally idiotic
plans based on totally wrong assumptions of the perl core string handling.

Go and do something useful instead, even commenting on the issues I bring
up would be more useful than showing off your lack of knowledge regarding
perl internals (and the language).

This is of course symptomatic for perl5-porters regarding unicode handling.
Note how difficult it was for me to get a simple bugfix w.r.t. unpack into
the core (in the meantime, unpack "H*" has also been fixed - very nice).

It took me ages to explain why its a bug to those people who simply lack
the experience regarding string handling in perl (w.r.t. to wide chars).

I simply don't have the stamina to explain it again and again. Just research
a bit :(

If it is so extreemly hard to get even simple bugfixes into perl, how hard is
it to get more complciyted fixed (such as the Win32 module?).

I think this is a very bad attitude.

In case you didn't notice, perl has an extremely bad reputation for unicode
handling: most users fear unicode, because it is so complicated in perl.

The reason it is so complicated is because there are so many bugs, and it
takes insanely long discussions to fix those bugs.

If you are in disagreement with me (and also sarathy, which, as I found
out, has exactly the same model in mind as me, or actually vice versa, as he
is the principal architect afaics), then perl5-porters should, as quickly as
possible, find out how they want to implement unicode.

Having part of the codebase assuming that the utf-8 flag means it is utf-8
encoded and not having it set means ANSI/locale/latin1/random garbage, and
the perl core (the other part of the codebase) assuming this is just an
internal flag, as originally designed, will kill perl in the long run.

There is the "correct" model, where encoding is attached to operations
(because in most cases it already is, and perl cannot change this, despite
what garbage perlunicodeintro claims), and the utf-8 flag is only used to
change the internal interpretation of the codepoint encoding.

And there is the "wrong" model, where perl silently upgrades data at
undocumented points and also corrupts your string data while at it
(because a "ü" character might suddenly become a "ѿ") and the user has
to track these undocumented encoding changes. I call this quote openly
"wrong" because it is insanely complicated.

Currently most of the perl core implements the "correct" model.

Regarding the perlunicode manpage, it is basically a helpless case. For
example, it says:

In earlier releases the "utf8" pragma was used to declare that
operations in the current block or file would be Unicode-aware. This
model was found to be wrong, or at least clumsy: the "Unicodeness" is
now carried with the data, instead of being attached to the operations.

This is completely untrue: in earlier releases, "use utf8/use bytes" switched
between interpreting the strings as utf-8 vs. bytes, and did nothing about
unicode-awareness.

Unicodeness is *not* carried with the data currently, as the manpage
wrongly claims, and that is absolutely the correct way.

Encoding *is* a question of operations. Of course, not all operations are
equal: open on unix for example enforces interpretation of the string as
locale-dependent, regardless of the data is "unicode" or not: the encoding is
tied to the operation, inherently. The perlunicode manpage is wrong.

I work with users daily, and I lecture people about unicode in perl a lot.
And having *bad* documentation that clashes with the *implementation* is bad.

Perl currently implements a model where encoding is *not* attached to perl
scalars, and neitehr is *unicodeness* attached to perl scalars.

The fatc that some people and some manpages claim otherwise is the source of
the confusion.

Now, even implementing the "wrong" model, where the encoding of a string
changes in undocumented ways during the lifetime of a program would be an
advantage, if it was done fully.

But it isn't: neither the correct now the wrong model are implemented, the
wrong model isn't, because it is basically unimplementable, and the correct
model isn't because nobody cares enough, and people actively disagree with
it. Leading to broken XS modules and worse.

This is the problem with perl and unicode: it is buggy no matter how you
put it, because some parts use different models than others.

--
The choice of a Deliantra, the free code+content MORPG
-----==- _GNU_ http://www.deliantra.net
----==-- _ generation
---==---(_)__ __ ____ __ Marc Lehmann
--==---/ / _ \/ // /\ \/ / pcg[at]goof.com
-=====/_/_//_/\_,_/ /_/\_\


tchrist at perl

May 19, 2008, 5:54 PM

Post #2 of 19 (296 views)
Permalink
Re: on broken manpages, trolling, inconsistent implementation and the difficulty to fix bugs [In reply to]

On "Mon, 19 May 2008 22:03:29 +0200.", Marc Lehmann <schmorp[at]schmorp.de>
in <20080519200329.GA28949[at]schmorp.de> flamed:

> In case you didn't notice, perl has an extremely bad reputation
> for Unicode handling: most users fear Unicode, because it is so
> complicated in perl.

Hm.

> The reason it is so complicated is because there are so many bugs, and
> it takes insanely long discussions to fix those bugs.

"Insanely long"? I'll charitably take that as rhetorical hyperbole rather
than some statement of your mental health--and invite correction should the
contrary case apply.

I shouldn't say that *that* at all the common concern of those who "fear"
using Unicode. Perhaps for XS writers it may be, but my experience has
been that users have other concerns that what you've just stated. Instead,
one or more of these four problems dominate, listed from highest to lowest,
but all are important.

(0) Knowing *what* all Perl documentation they should be reading about
Unicode, doing that reading, and then making the last bit of sense
out of what they've just read.

Why? Because they can't tell what is and is not important, or even
applicable. Part of this problem is the density of information as
presented, part derives from the important information unevenly
scattered across many documents, and, loth though I am to say this,
there may also be a language-barrier creating interference between
reader and writer.

(1) Understanding I/O Layers and encodings:

eg: encoding vs Encode, ::via::, binmode(), -C,
envariables, triadic open, etc

(2) The troubles of getting Unicodish action on codepoints
in the U+0080 .. U+OOFF range.

eg: % perl -E 'say chr(0xdf)'
ß
% perl -E 'say ucfirst chr(0xdf)'
ß
% perl -E 'use utf8; say ucfirst chr(0xdf)'
ß
% perl -E 'use encoding "latin1"; say ucfirst chr(0xdf)'
Ss

(3) Difficulties comparing and matching Unicode data that
hasn't been first laundered, er, normalized, into a
standard canonical from, generally due to combining
characters, ordering, and pre-combined characters.

Issues I don't number as lying within the provenance of legit Perl-
related Unicode troubles, but which certainly do occur, include:

* font troubles
* troubles with native system support (LC_* envars)
* problems finding, learning, and using Unicode-aware
editors & tools
* confusing I18N issues with those of G11N

Some fears may come from these, but there's not much we can
do about almost any of them.

Most frequent of all is that they've simply never been
consciously exposed to Unicode, whether at all or whether in a
sufficiently intimate fashion as to get their fingers dirty.

Once these things we can't do anything about (native system
stuff) are discounted, almost always I find any Unicode phobia is
because they are at least one, and usually all, of these things:

* monoglot speakers of English alone

* not usually all that educated in the Humanities
and/or of limited travel experience

* overly accustomed to the very impoverished character
repertoire found in 7-bit ASCII codes, mistakenly
believing it sufficient for writing even English correctly

Such people therefore feel no need to "bother" learning about
Unicode, whether in Perl or anywhere. So *their* fears, if any,
might be more closely related to ethnocentrism, xenophobia, or
neophobia than to what you've described or which I've myself
encountered.

Anybody who fears Unicode because of Perl's internals is clearly
in the minority of an already-minority set. I mean, come on,
many people fear Perl's externals, and most fear its internals--
and that's not even letting Unicode into the picture yet.

And while there *are* people troubled by UTF8 != "utf-8", I
suspect there to be next to none such outside this list--and
even within it, still very few.

> Having part of the codebase assuming that the utf-8 flag means
> it is utf-8 encoded and not having it set means
> ANSI/locale/latin1/random garbage, and the perl core (the other
> part of the codebase) assuming this is just an internal flag,
> as originally designed, will kill perl in the long run.

The end of the world is near, eh?

> Regarding the perlunicode manpage, it is basically a
> helpless case.

"Helpless"?

More rhetorical hyperbole, I presume, since if it is incorrectly
worded, the road to helping it is obvious: just send patches.

> For example, it says:

> In earlier releases the "utf8" pragma was used to declare
> that operations in the current block or file would be Unicode-
> aware. This model was found to be wrong, or at least clumsy:
> the "Unicodeness" is now carried with the data, instead of
> being attached to the operations.

> This is completely untrue: in earlier releases, "use utf8/use
> bytes" switched between interpreting the strings as utf-8 vs.
> bytes, and did nothing about Unicode-awareness.

*That*, I think, is more a matter of casuistry than of correctness.

> Unicodeness is *not* carried with the data currently, as the
> manpage wrongly claims, and that is absolutely the correct way.

Hm...

> Perl currently implements a model where encoding is *not*
> attached to perl scalars,

Right: it's supposed to be attached to the I/O Layer alone.

> and neither is *unicodeness* attached to perl scalars.

And now you've lost me.

If that is universally true, then might you gently explain how
an SV set to chr(500) always has its UTF8 flag turned on?

DB<1> p $]
5.010000
DB<2> use Devel::Peek
DB<3> $x = "string"
DB<4> p Dump ($x)
SV = PV(0x3c267378) at 0x3c367680
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x3c022e10 "string"\0
CUR = 6
LEN = 8
DB<5> $y = chr(500)
DB<6> p Dump($y)
SV = PV(0x3c2f0aa0) at 0x3c367ac0
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x3c03d180 "\307\264"\0 [UTF8 "\x{1f4}"]
CUR = 2
LEN = 4

Armed with that output, it seems to me that if you are correct,
then UTF8 is not "unicodeness", but that if it is, then you are
not correct. And I am unable to discern which of those two here
obtains: either, both, or neither, nor in what degree. In both
scenarios, the issue remains unclear, or you do, or I am--and
quite possibly more than one of these may apply.

So let's fix that, shall we?

I politely request that you kindly explain three things:

* Start by explain just what it is that you are calling unicodeness.

* Now explain the presence and purpose of the UTF8 flag on an SV.

* Finally, please demonstrate, especially in light of my Dump
above and the two answers you just gave, how your last-quoted
statement can in its second half be deemed all of reasonable,
accurate, and correct.

Thank you.

-tom
--

"Those who know more than me will correct me if I'm wrong.
Those who know less than me will correct me if I'm right."


schmorp at schmorp

May 19, 2008, 9:56 PM

Post #3 of 19 (286 views)
Permalink
Re: on broken manpages, trolling, inconsistent implementation and the difficulty to fix bugs [In reply to]

On Mon, May 19, 2008 at 06:54:39PM -0600, Tom Christiansen <tchrist[at]perl.com> wrote:
[Thanks for summarising all the possible fears :]

> > part of the codebase) assuming this is just an internal flag,
> > as originally designed, will kill perl in the long run.
>
> The end of the world is near, eh?

I meet a lot of people who would like to use unicode in perl, but fail to do
so because they run intot he problems mentioned and claim it should be much
easier (yes, it should, and certainly less random).

But almost all of the issues they run into, *iff* they really want to use
unicode and are open to learning a bit about it first originate in the utf-8
flag that they cannot see in their perl sources yet that affects so many
things.

Basically, if you don't have it set right, stuff breaks everywhere, wether
its perl core functions or xs modules, but the brakage is not universal
(if it were, it would simply be a different model, the problem is the
inconsistency).

> > Regarding the perlunicode manpage, it is basically a
> > helpless case.
>
> "Helpless"?
>
> More rhetorical hyperbole, I presume, since if it is incorrectly
> worded, the road to helping it is obvious: just send patches.

I certainly won't send patches if people tlel me before I submit them that
the current manpage is correct. I can waste my time in better ways.

I would probably submit patches if the process to do so would be easier, and
the first step would be an agreement of the existing perl5-porters on how
strings are to be interpreted.

Note I am not asking for agreement on how it *should* be done or what
would be better, but an agreement on which semantics will be acceptable
and which are not.

> > This is completely untrue: in earlier releases, "use utf8/use
> > bytes" switched between interpreting the strings as utf-8 vs.
> > bytes, and did nothing about Unicode-awareness.
>
> *That*, I think, is more a matter of casuistry than of correctness.

Maybe, but I sitll expect *one* manpage to be consistent to itself - if it
defines operation as one thing and contrasts it with some other meaning of
operation then they better should be the same thing - if you compare, then
apples to apples and oranges to oranges.

> > Perl currently implements a model where encoding is *not*
> > attached to perl scalars,
>
> Right: it's supposed to be attached to the I/O Layer alone.

More correctly to the interfaces communicating with the outside world,
which includes other things as well (for example filenames, or XS
modules).

But the basic theme is indeed "I/O" here - one way to treat characters is
to encode/decode them when they leave/enter perl and use unicode semantics
within.

> > and neither is *unicodeness* attached to perl scalars.
>
> And now you've lost me.
>
> If that is universally true, then might you gently explain how
> an SV set to chr(500) always has its UTF8 flag turned on?

Easy, it is the only way for perl to internally represent characters with a
value of 500.

If that 500 is for example the second character in v5.500, then this might
simply be the perl 5.500 version string (or part of an ip address in the
game "uplink" stored in a compact string form, e.g. v478.321.571.277).

No unicode anywhere in sight.

Note also that I can have the unicode character 32 stored with or without the
UTF8 flag, which doesn't change the fact that it is still the unicode
character 32.

Summary: the UTF8 flag says very little about wether the string contains
unicodce or not. However, when I do this:

my $s = v5.500;
$s =~ /ü/;

Then one could expect that $s indeed contains a unicode string, because =~
forces the interpretation of the string to "characters" and in this very case
to "unicode".

Of course, this gets you in trouble:

my $s = chr 200; # not unicode, but native 8-bit(??)
substr $s, 0, 0, chr 500;
$s =~ /ü/; # now interpreted as unicode

This is the insane part - I wouldn't expect even an expert perl programmer
to predict how $s gets interpreted here.

So, my "end of the world" in a more verbose way and less drastic, would be
"as long as perl has this totally unpredictable rules on character
interpretation it will not gain wide acceptance for unicode usage".

> Armed with that output, it seems to me that if you are correct,
> then UTF8 is not "unicodeness", but that if it is, then you are
> not correct.

I think the examples above made it clear that the UTF8 flag is not
"unicodeness".

A different example you might like even more :) is this: mark some filehandle
as utf-8 encoded, then print downgraded and upgraded data to it. In both
cases it will be interpreted as unicode, so the utf-8 flag again is no
indicator for "unicodeness".

> I politely request that you kindly explain three things:
>
> * Start by explain just what it is that you are calling unicodeness.

Not sure what you mean - unicodeness is the state of being unicode.

More explicitly, a perl string contains unicode characters when it contains
unicode characters - the interpreter itself does not know this currently, nor
do I see a way for it to do so except by forcing the user to make this
explicit.

As for operations, "unicodeness" would be how certain operations would
interpret character data.

For example, "open" (at least on unix) cannot support unicodeness, because
the system interfaces do not allow for unicode to be used - you have to
encode it, and it is not clear which encoding is the right one.

So open would be an operation that enforces octet semantics, because thats
what the system interface relies on.

A clearer example would be crypt: crypt has to force octet semantics
because the C interface is only defined in terms of octets (i.e. the
salt).

regex matching would also by default (maybe) apply unicode semantics, but it
would be somewhat important to be able to apply locale interpretation to it.

> * Now explain the presence and purpose of the UTF8 flag on an SV.

The UTF8 flag on an SV specifies wether the codepoints stored in the
string are stored in octet form (only possible if all are <256 of course)
or in perl's variant of utf-8 (which is very similar, but not the same, as
the utf-8 defined by the unicode consortium for example).

To put it differently, the UTF8 flag only states how the character values are
encoded internally. It does not say anything about wether the string contains
unicode data or not.

(This is what is mostly implemented right now in perl, regexes are the
notable exception).

> * Finally, please demonstrate, especially in light of my Dump
> above and the two answers you just gave, how your last-quoted
> statement can in its second half be deemed all of reasonable,
> accurate, and correct.

I think I demonstrated that in my examples already. It is me, the
programmer, who defines what string contains unicode or not.

And in a lot of important cases, this unicodeness of a string only has a very
superficial correspondance to the UTF8 flag - they are not totally
independent. For example if all my unicode data happens to consists of
latin1 characters that I store as such, then perl *might*, using wondrous
optimisations I don't care about as long as it is fast, never hit a string
with the UTF8 flag set.

On the other hand, if I store my unicode data directly as codepoints, and
some of those happen to be >255, then the correspdoning scalar will have the
UTF8 flag set. The converse is not true, however, not every scalar having the
UTF8-flag set contains unicode charcters.

And lastly, if I store my unicode data in utf-8, then it would still be
unicode data. It would be reasonable to call a string containing utf-8 data
a unicode string (encoded, however), in which case, again, the UTF8 flag
could be set or not (usually not), which has nothing to say about the
unicodeness of the data stored in the string.

> Thank you.

Nice to have you around again.

--
The choice of a Deliantra, the free code+content MORPG
-----==- _GNU_ http://www.deliantra.net
----==-- _ generation
---==---(_)__ __ ____ __ Marc Lehmann
--==---/ / _ \/ // /\ \/ / pcg[at]goof.com
-=====/_/_//_/\_,_/ /_/\_\


john-perl at o-rourke

May 19, 2008, 10:39 PM

Post #4 of 19 (286 views)
Permalink
Re: on broken manpages, trolling, inconsistent implementation and the difficulty to fix bugs [In reply to]

Marc Lehmann wrote:
>
>>> part of the codebase) assuming this is just an internal flag,
>>> as originally designed, will kill perl in the long run.
>>>
>> The end of the world is near, eh?
>>
>
> I meet a lot of people who would like to use unicode in perl, but fail to do
> so because they run intot he problems mentioned and claim it should be much
> easier (yes, it should, and certainly less random).
>

</lurk>
Just to throw in some more user experience - I'm very happy with perl's
utf8 implementation. I implemented utf8 handling in a large mod_perl
application last year. I knew nothing about unicode beforehand, read
up, pored over the relevant man pages and was very happy with the results.

The worst that happens is my code warns about wide characters or invalid
utf8 in input files, but the default settings handle everything quite
gracefully.

Perhaps it's 20 years of coding experience but I didn't find the utf8
flag difficult to work with - I've never had to touch a Devel::* module
and I couldn't tell you what a dump of an SV looks like - all I use for
utf8 manipulation is:

use utf8; (I ignore the advice on the man page and use it everywhere, so
constant strings have the flag set)
encode_utf8, decode_utf8 (allowing me to force the flag either way and
detect failures)
open($fh, "<:utf8", $file)

The only thing I'd consider a 'gotcha' is that for an easy life you
either implement utf8 accross the whole application, or you don't
implement it - but that's common sense really.

perl++

John


tchrist at perl

May 19, 2008, 11:07 PM

Post #5 of 19 (286 views)
Permalink
Re: on broken manpages, trolling, inconsistent implementation and the difficulty to fix bugs [In reply to]

In his epistle of "Tue, 20 May 2008 06:56:10 +0200."
<20080520045610.GB16896[at]schmorp.de>
Marc Lehmann <schmorp[at]schmorp.de> graciously explained:

> On Mon, May 19, 2008 at 06:54:39PM -0600,
> Tom Christiansen <tchrist[at]perl.com> wrote:

> [Thanks for summarising all the possible fears :]

Oh, there might be more, you know. Haven't thought much on it.
Those were just the ones that came to mind, both the relevant
and the ir-.

>>> part of the codebase) assuming this is just an internal flag,
>>> as originally designed, will kill perl in the long run.

>> The end of the world is near, eh?

> I meet a lot of people who would like to use Unicode in perl, but fail
> to do so because they run into the problems mentioned and claim it
> should be much easier (yes, it should, and certainly less random).

> But almost all of the issues they run into, *iff* they really want to
> use Unicode and are open to learning a bit about it first originate in
> the utf-8 flag that they cannot see in their perl sources yet that
> affects so many things.

I see you've been talking with Phil Harvey again. :-)

> Basically, if you don't have it set right, stuff breaks everywhere,
> whether it's perl core functions or XS modules, but the brakage [.SIC:
> probably meant to read "breakage" unless one's foot hits the brakes
> instead of the accelerator --tchrist] is not universal (if it were, it
> would simply be a different model; the problem is the inconsistency).

But is it a foolish one for little minds to worry about, or a great
one for bigger minds to mull over?

I believe that Phil, for example, due perhaps to such things as you
allude to, tries quite hard to be Unicode-agnostic. By that I means he
insistently uses byte-interfaces only, even though he sometimes has to
encode or decode byte-data into Unicodepoints.

That said, I always feel there's something *WRONG* if I find
myself having to resort to Encode's encode or decode functions.
Can't quite say why.

>>> Regarding the perlunicode manpage, it is basically a helpless case.

>> "Helpless"?

>> More rhetorical hyperbole, I presume, since if it is incorrectly
>> worded, the road to helping it is obvious: just send patches.

> I certainly won't send patches if people tell me before I submit
> them that the current manpage is correct. I can waste my time in
> better ways.

Wise lesson, that. However, it never stopped me from doing so.
It's like how all change to the world comes from unreasonable people.

> I would probably submit patches if the process to do so would be
> easier, and the first step would be an agreement of the existing perl5-
> porters on how strings are to be interpreted.

That might require further instruction so that we can all be on the
same, um, code page.

> Note I am not asking for agreement on how it *should* be done or what
> would be better, but an agreement on which semantics will be acceptable
> and which are not.

>>> This is completely untrue: in earlier releases, "use utf8/use
>>> bytes" switched between interpreting the strings as utf-8 vs.
>>> bytes, and did nothing about Unicode-awareness.

>> *That*, I think, is more a matter of casuistry than of correctness.

> Maybe, but I still expect *one* manpage to be consistent to itself - if it
> defines operation as one thing and contrasts it with some other meaning of
> operation then they better should be the same thing - if you compare, then
> apples to apples and oranges to oranges.

First, those are hardly category errors. After all, would you not
agree that:

* Both are fruit.
* Both are juicy.
* Both are often served juiced with breakfast.
* Both start green and then usually fall somewhere into red-orange
area of the visible spectrum.
* Both are usually of a similar size.
* Both are of topologically equivalent shape

So for more punch, you might sometimes consider trying the alternate
aphorism of comparing apples with avarice, or oranges with oratory.

Just an idea. :-)

>>> Perl currently implements a model where encoding is *not*
>>> attached to perl scalars,

Second, I am in need of deeper understanding, or more sleep,
to see how my statement regarding casuistry does not apply.

>> Right: it's supposed to be attached to the I/O Layer alone.

> More correctly to the interfaces communicating with the outside world,
> which includes other things as well (for example filenames, or XS
> modules).

Well...

At one level, nearly all meaningful communication "with the outside
world" falls within the category of being I/O, with signals and exit
status being the most common exceptions. And timing attacks don't
count. :-)

But I still think that you are asking a lot if you want to make the
claim that filenames as used to access the system's underlying files
VIA ITS OWN INTERFACES are data rather than metadata. And I don't think
that filesystem metadata is reliably treated as anything but bytes, at
least on systems with which I am conversant.

Sure, their contents are certainly data, but even that has its limits.
The restrictions under the BUGS section of the perl(1) manpage still
apply: you are for the most part at your system's mercy. If it provides
byte-access seeks only, not variable-width utf-8 encoded positions, you
can't do much about that. Well, not much that I'd care to do, at least.

> But the basic theme is indeed "I/O" here - one way to treat characters
> is to encode/decode them when they leave/enter perl and use Unicode
> semantics within.

That sounds sane.

>>> and neither is *Unicodeness* attached to perl scalars.

>> And now you've lost me.

>> If that is universally true, then might you gently explain how
>> an SV set to chr(500) always has its UTF8 flag turned on?

> Easy, it is the only way for perl to internally represent characters
> with a value of 500.

Of course.

> If that 500 is for example the second character in v5.500, then this
> might simply be the perl 5.500 version string (or part of an ip
> address in the game "uplink" stored in a compact string form, e.g.
> v478.321.571.277).

I'm a little bit queasy about v-strings, thank you very much.

> No Unicode anywhere in sight.

Did I say there was?

> Note also that I can have the Unicode character 32 stored with or
> without the UTF8 flag, which doesn't change the fact that it is
> still the Unicode character 32.

Oh, now that I'm not sure I agree with. But I fear we may be back
to casuistry again.

> This is the insane part - I wouldn't expect even an expert perl
> programmer to predict how $s gets interpreted here.

No, neither would I.

> So, my "end of the world" in a more verbose way and less drastic,
> would be "as long as perl has this totally unpredictable rules on
> character interpretation it will not gain wide acceptance for
> Unicode usage".

>> Armed with that output, it seems to me that if you are correct,
>> then UTF8 is not "Unicodeness", but that if it is, then you are
>> not correct.

> I think the examples above made it clear that the UTF8
> flag is not "Unicodeness".

I think I'd better sign off. Perhaps sleep will make your statement
obviously true to me. It isn't now.

>> * Start by explain just what it is that you are calling
>> Unicodeness.

> Not sure what you mean - Unicodeness is the state of being Unicode.

That's either a trivial tautology of no significance, or something
deeper than I can now fathom.

> More explicitly, a perl string contains Unicode characters when it
> contains Unicode characters

Now *THAT* belongs in the formerly mentioned set.

> - the interpreter itself does not know this currently, nor do I
> see a way for it to do so except by forcing the user to make
> this explicit.

Ever read in a string, or grabbed something from @ARGV or %ENV, that
you had to do this to:

$num =~ oct($num) if $num =~ /^0/;

And if you did, did this "bother" you?

> For example, "open" (at least on Unix) cannot support Unicodeness,
> because the system interfaces do not allow for Unicode to be used -
> you have to encode it, and it is not clear which encoding is the
> right one.

> So open would be an operation that enforces octet semantics, because
> that's what the system interface relies on.

Well, there you have it then, don't you?

Good night.

/* HIC JACENT VERBA DELETA */

>> Thank you.

> Nice to have you around again.

Oh sure, *now* you say that. Just wait. :-)

Anyway, it's SUMMER, and this is a fluke; I shouldn't even be here.

--tom


ben at morrow

May 20, 2008, 2:48 AM

Post #6 of 19 (284 views)
Permalink
Re: on broken manpages, trolling, inconsistent implementation and the difficulty to fix bugs [In reply to]

Quoth tchrist[at]perl.com (Tom Christiansen):
>
> But I still think that you are asking a lot if you want to make the
> claim that filenames as used to access the system's underlying files
> VIA ITS OWN INTERFACES are data rather than metadata. And I don't think
> that filesystem metadata is reliably treated as anything but bytes, at
> least on systems with which I am conversant.

But this is exactly where the thread started... Win32 (and, IIUC, other
systems such as VMS) don't treat filenames as (sequences of) bytes, but
as sequences of Unicode characters. Win32 at least also has two sets of
APIs: one takes parameters in some currently-selected encoding and
converts to Unicode for you (the 'ANSI' API), and one which takes
arguments in Unicode (the 'Unicode' API).

This leaves three possibilities. At the moment (I think), all IO happens
through the ANSI API, which leaves Perl in the unfortunate position of
being unable to open files with names that don't fit in the current
character set.

I believe what Jan was suggesting (please correct me if I've
misunderstood) was

- filenames which are !SvUTF8 use the ANSI API,
- filenames which are SvUTF8 use the Unicode API.

However, since this would mean that

my $fn = "\xe0";
open my $F, '<', $fn;

and

my $fn = substr "\xe0\x{100}", 0, 1;
open my $F, '<', $fn;

potentially opened different files, Perl's auto-upgrading on Win32 would
also have to be changed to use the current ANSI encoding instead of
ISO8859-1. My complaint with this is that it means that when a string is
upgraded, the values reported by 'chr' (for instance) will mysteriously
and silently change.

The potential alternative I was proposing was that all filenames should
be upgraded to SvUTF8 (using ISO8859-1, as currently) and then passed to
the Unicode API. This has the advantage of maintaining current in-Perl
string semantics, and the disadvantage of breaking all Win32 programs
that currently use non-ASCII filenames.

I don't think there's any way forward without breaking *something*. The
question is what will cause least damage.

Ben

--
Like all men in Babylon I have been a proconsul; like all, a slave ... During
one lunar year, I have been declared invisible; I shrieked and was not heard,
I stole my bread and was not decapitated.
~ ben[at]morrow.me.uk ~ Jorge Luis Borges, 'The Babylon Lottery'


john.peacock at havurah-software

May 20, 2008, 4:06 AM

Post #7 of 19 (283 views)
Permalink
Re: on broken manpages, trolling, inconsistent implementation and the difficulty to fix bugs [In reply to]

Marc Lehmann wrote:
> Easy, it is the only way for perl to internally represent characters with a
> value of 500.
>
> If that 500 is for example the second character in v5.500, then this might
> simply be the perl 5.500 version string (or part of an ip address in the
> game "uplink" stored in a compact string form, e.g. v478.321.571.277).

Just to be a pedant, v5.500 is a v-string, and since 5.8.1 has been magical:

$ perl -MDevel::Peek -e '$v = v5.500; Dump($v);'
SV = PVMG(0x81ae810) at 0x816e8b8
REFCNT = 1
FLAGS = (RMG,POK,pPOK,UTF8)
IV = 0
NV = 0
PV = 0x81874c8 "\5\307\264"\0 [UTF8 "\x{5}\x{1f4}"]
CUR = 3
LEN = 4
MAGIC = 0x818ee80
MG_VIRTUAL = &PL_vtbl_utf8
MG_TYPE = PERL_MAGIC_utf8(w)
MG_LEN = 2
MAGIC = 0x81b0b78
MG_VIRTUAL = 0
MG_TYPE = PERL_MAGIC_v-string(V)
MG_LEN = 6
MG_PTR = 0x8184710 "v5.500"

Perl version objects (native in v5.10.0 and available from CPAN for earlier
releases) uses a very different storage format:

$ perl -MDevel::Peek -Mversion -e '$v = qv(v5.500); Dump($v);'
SV = PV(0x816eae8) at 0x816dcdc
REFCNT = 1
FLAGS = (ROK,OVERLOAD)
RV = 0x816e8ac
SV = PVHV(0x8172760) at 0x816e8ac
REFCNT = 1
FLAGS = (OBJECT,SHAREKEYS)
IV = 3
NV = 0
STASH = 0x8197220 "version"
ARRAY = 0x81b6c20 (1:1, 2:1)
hash quality = 90.0%
KEYS = 3
FILL = 2
MAX = 1
RITER = 0
EITER = 0x0
Elt "original" HASH = 0xb45e44f2
SV = PV(0x816eba8) at 0x81b4228
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x8197760 "v5.500"\0
CUR = 6
LEN = 8
Elt "qv" HASH = 0x18c4b28a
SV = IV(0x818de44) at 0x816dcf4
REFCNT = 1
FLAGS = (IOK,pIOK)
IV = 1
Elt "version" HASH = 0x68c27e33
SV = RV(0x819a54c) at 0x81c6730
REFCNT = 1
FLAGS = (ROK)
RV = 0x816dd90
SV = PVAV(0x8172c64) at 0x816dd90
REFCNT = 1
FLAGS = ()
IV = 0
NV = 0
ARRAY = 0x81c0988
FILL = 2
MAX = 3
ARYLEN = 0x0
FLAGS = (REAL)
Elt No. 0
SV = IV(0x818e05c) at 0x816e930
REFCNT = 1
FLAGS = (IOK,pIOK)
IV = 5
Elt No. 1
SV = IV(0x818e060) at 0x816e900
REFCNT = 1
FLAGS = (IOK,pIOK)
IV = 500
Elt No. 2
SV = IV(0x818e068) at 0x81b421c
REFCNT = 1
FLAGS = (IOK,pIOK)
IV = 0
PV = 0x816e8ac ""
CUR = 0
LEN = 0

The fact that the former (v-strings) were used for a while as $VERSION
initializers is an historical aberration - a failure of imagination - that
relied on the behavior of Perl's internals in a somewhat unhealthy fashion.

> Of course, this gets you in trouble:
>
> my $s = chr 200; # not unicode, but native 8-bit(??)
> substr $s, 0, 0, chr 500;
> $s =~ /ü/; # now interpreted as unicode
>
> This is the insane part - I wouldn't expect even an expert perl programmer
> to predict how $s gets interpreted here.

This is a contrived example because you are going out of your way to manufacture
bad code. Just because you *can* use chr() with values > 255 and Perl turns on
the UTF8 flag in the supreme hope that you knew what you were doing, doesn't
make this irredeemably broken. You broke $s by mixing your string-types using a
low-level function that has no knowledge of unicode semantics, *nor should it*.

A more realistic example is a PV containing ASCII text has a UTF8 string
concatanated to it. This works as designed - the original string is upgraded to
UTF8 and the second string appended and well-formed UTF8 is assured.

John


demerphq at gmail

May 20, 2008, 5:10 AM

Post #8 of 19 (279 views)
Permalink
Re: on broken manpages, trolling, inconsistent implementation and the difficulty to fix bugs [In reply to]

2008/5/20 John Peacock <john.peacock[at]havurah-software.org>:
> Marc Lehmann wrote:
..
>> Of course, this gets you in trouble:
>>
>> my $s = chr 200; # not unicode, but native 8-bit(??)
>> substr $s, 0, 0, chr 500;
>> $s =~ /ü/; # now interpreted as unicode
>>
>> This is the insane part - I wouldn't expect even an expert perl programmer
>> to predict how $s gets interpreted here.
>
> This is a contrived example because you are going out of your way to
> manufacture bad code. Just because you *can* use chr() with values > 255
> and Perl turns on the UTF8 flag in the supreme hope that you knew what you
> were doing, doesn't make this irredeemably broken. You broke $s by mixing
> your string-types using a low-level function that has no knowledge of
> unicode semantics, *nor should it*.
>
> A more realistic example is a PV containing ASCII text has a UTF8 string
> concatanated to it. This works as designed - the original string is
> upgraded to UTF8 and the second string appended and well-formed UTF8 is
> assured.

I think Marcs point was that Perl really has no business assuming the
string is actually latin 1.

As Glen said part of the problem with dealing with Marc on this
subject is he doesn't use the terms most commonly used here or he uses
them in different ways than tend to be used here, and he doesn't
explain precisely how he is using them until after the debate has
become heated.

Hopefully i can try to summarize his point, which I think I finally
get. (With help from Glen and Ben)

He says: string data has no character set association at all. It is
either an array of octets or it is an array of integers encoded as
utf8. the fact that the string may be encoded using utf8 sequences
does not mean that it actually contains Unicode data.

So for instance if i took a string that contained the bytes which
represented "Hello World" in Chinese using big5 and concatenated a
string containing char(256) to it, the octets would be reencoded as
utf8 directly, octet for octet, without an understanding of how big5
actually represents strings, and on an abstract level the string still
contains big5, just now strangely double encoded as utf8.

Where this gets confusing is that Perl does in fact assume Latin-1
semantics for its octet based strings in a number of common cases,
such as case insensitive matching and upper and lower case. Etc. This
is OK because these are places where the programmer explicitly says
"assume that this is character data encoded somehow or another". But
the "auto upgrade" behaviour is dangerous as it means that binary data
is sometimes blindly re-encoded as utf8, even though it may have been
pure binary data.

The core of the problem is that the old C habit of conflating arrays
of octets with strings of characters has carried over to Perl in such
a way that we have a big mess, and it doesn't look easily resolvable.
Although i suspect that we are making a mountain out of a mole hill
about the Win32 aspect of this problem.

I think Marc is right, the utf8 flag being off doesn't say "this data
is latin1" and the utf8 flag being on doesn't say "this data is
Unicode". The flag instead says (when off) "this is array of
characters" or "this is an array of integers encoded as utf8" (when
on). The additional step of ascribing a character set to the encoding
is incorrect, and one that evolves out of the heritage of supporting
character set style operations on pure octet encodings.

Basically we have to remember that encoding and character set are
different. ANSI is a character set, Latin-1 is a character set.
Unicode is a character set. Octets are an encoding, and utf8 is an
encoding. We can have Latin-1 data encoded as utf8, indistinguisable
from Unicode encoded as utf8, and we can have ANSI data encoded as
utf8, which is not the same thing as converting ANSI to Unicode stored
as utf8.

Its all very ripe for confusion. I think Marc is right. We should
really think about this. We have different parts of the code base
thinking about these issues in different ways and a lot of confusion
involved. I personally think that if we can sort them out, even in a
not 100% backwards compatible way then we will have made good
progress.

The issues i see are this:

1. We don't have a binary data type. (We dont distinguish character
data from octet data and its easy to inadvertently cause one to be
treated as the other with surprising results.)
2. We don't associated character set to a string we associate encoding
to a string. Character set and encoding are orthogonal concepts
despite being related.
3. We use the name of an encoding of Unicode as the name of for the
encoding of a string causing confusion.

Im not sure how we get out of this mess. Maybe by making PV's store
more information about their character set. With that information we
can convert strings correctly to Unciode when we need to.

Cheers,
yves


--
perl -Mre=debug -e "/just|another|perl|hacker/"


demerphq at gmail

May 20, 2008, 5:13 AM

Post #9 of 19 (279 views)
Permalink
Re: on broken manpages, trolling, inconsistent implementation and the difficulty to fix bugs [In reply to]

2008/5/20 demerphq <demerphq[at]gmail.com>:
> I think Marc is right, the utf8 flag being off doesn't say "this data
> is latin1" and the utf8 flag being on doesn't say "this data is
> Unicode". The flag instead says (when off) "this is array of
> characters" or "this is an array of integers encoded as utf8" (when
^^^^^^^^^^

I meant octets/bytes there.

> on). The additional step of ascribing a character set to the encoding
> is incorrect, and one that evolves out of the heritage of supporting
> character set style operations on pure octet encodings.

Cheers,
yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"


juerd at convolution

May 20, 2008, 7:39 AM

Post #10 of 19 (275 views)
Permalink
Re: on broken manpages, trolling, inconsistent implementation and the difficulty to fix bugs [In reply to]

John ORourke skribis 2008-05-20 6:39 (+0100):
> open($fh, "<:utf8", $file)

Please note that <:utf8 is unsafe and that you should use
:encoding(utf8) instead.

See also
http://www.perlfoundation.org/perl5/index.cgi?the_utf8_perlio_layer and
http://www.perlmonks.org/?node_id=644786
--
Met vriendelijke groet, Kind regards, Korajn salutojn,

Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig>
Convolution: ICT solutions and consultancy <sales[at]convolution.nl>
1;


juerd at convolution

May 20, 2008, 8:01 AM

Post #11 of 19 (276 views)
Permalink
Re: on broken manpages, trolling, inconsistent implementation and the difficulty to fix bugs [In reply to]

demerphq skribis 2008-05-20 14:10 (+0200):
> Where this gets confusing is that Perl does in fact assume Latin-1
> semantics for its octet based strings in a number of common cases,

I think you mean "ASCII semantics" there. In these cases, the second
half of latin1 is ignored and left alone.

Latin1 was (re)defined as a Unicode encoding in 1998, which means that
0xE9 is no longer just something that looks like é, but defined as
U+00E9 LATIN SMALL LETTER E WITH ACUTE. This does, of course, have the
implication that the letter now has an uppercase variant in U+00C9 LATIN
CAPITAL LETTER E WITH ACUTE which is encoded as the 0xC9 byte in latin1.
Perl ignores this part of the specification, and that's why I think it's
incorrect to call what Perl does "Latin-1 semantics".

In fact, latin1 semantics are pretty hard to describe because uc("\xff")
(\xFF is U+00FF LATIN SMALL LETTER Y WITH DIAERESIS) cannot be expressed
in latin1, because the uppercase of U+00FF is U+0178 which has no
representation in latin1.

> I think Marc is right, the utf8 flag being off doesn't say "this data
> is latin1" and the utf8 flag being on doesn't say "this data is
> Unicode". The flag instead says (when off) "this is array of
> characters" or "this is an array of integers encoded as utf8" (when
> on).

You're making a distinction between "characters" (SvUTF8 off) and
"integers" (SvUTF8 on) that I don't understand. Could you explain why
there is a difference and what that is?

> Latin-1 is a character set.

Latin-1 is both a character set and an encoding. The character set is
defined as equal to the first 256 characters in Unicode (U+0000 ..
U+00FF), and the encoding is defined as a straight forward 8 bit
encoding: U+0000 => 0x00 .. U+00FF => 0xFF. They even went as far as
describing how the individual bits are to be layed out in the byte. Not
surprisingly, the 8 bits have weights from 128 to 1, where each
subsequent bit is half the value of the one before it :)

The specification uses the term "coded representation" rather than
"encoding".

> The issues i see are this:
> 1. We don't have a binary data type.

I intend to release a module that handles this in Perl space in a way
that is backward compatible to 5.000. Its name is BLOB.

One thing that it doesn't do, is avoid concatenation with non-BLOBs. I'd
like to learn if this can be done at all.

> 3. We use the name of an encoding of Unicode as the name of for the
> encoding of a string causing confusion.

Indeed. Maybe it would be wise to start calling the internal
representation SvUTF8 encoding, rather than UTF8 encoding. Or maybe a
wholly different name.

> Maybe by making PV's store more information about their character set.

The Encode suite treats character sets as properties of encodings; the
user only has to deal with a single character set, namely Unicode. I
think that's the only sane approach. Information about the
charset/encoding does not have to be in the string, but belongs to
operations as Marc aptly describes the first post carrying this subject.
--
Met vriendelijke groet, Kind regards, Korajn salutojn,

Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig>
Convolution: ICT solutions and consultancy <sales[at]convolution.nl>
1;


demerphq at gmail

May 20, 2008, 9:03 AM

Post #12 of 19 (274 views)
Permalink
Re: on broken manpages, trolling, inconsistent implementation and the difficulty to fix bugs [In reply to]

2008/5/20 Juerd Waalboer <juerd[at]convolution.nl>:
> demerphq skribis 2008-05-20 14:10 (+0200):
>> Where this gets confusing is that Perl does in fact assume Latin-1
>> semantics for its octet based strings in a number of common cases,
>
> I think you mean "ASCII semantics" there. In these cases, the second
> half of latin1 is ignored and left alone.

Yeah something like that. English capitalization rules as applied to latin1.

> Latin1 was (re)defined as a Unicode encoding in 1998, which means that
> 0xE9 is no longer just something that looks like é, but defined as
> U+00E9 LATIN SMALL LETTER E WITH ACUTE. This does, of course, have the
> implication that the letter now has an uppercase variant in U+00C9 LATIN
> CAPITAL LETTER E WITH ACUTE which is encoded as the 0xC9 byte in latin1.
> Perl ignores this part of the specification, and that's why I think it's
> incorrect to call what Perl does "Latin-1 semantics".
>
> In fact, latin1 semantics are pretty hard to describe because uc("\xff")
> (\xFF is U+00FF LATIN SMALL LETTER Y WITH DIAERESIS) cannot be expressed
> in latin1, because the uppercase of U+00FF is U+0178 which has no
> representation in latin1.

Arent charset encoding issues fun?

>> I think Marc is right, the utf8 flag being off doesn't say "this data
>> is latin1" and the utf8 flag being on doesn't say "this data is
>> Unicode". The flag instead says (when off) "this is array of
>> characters" or "this is an array of integers encoded as utf8" (when
>> on).
>
> You're making a distinction between "characters" (SvUTF8 off) and
> "integers" (SvUTF8 on) that I don't understand. Could you explain why
> there is a difference and what that is?

Sorry, i slipped into c speak there. I meant to say that the SvUTF8
flag just tells us whether we have an array of octets (values 0..255)
or a "stream" of integers (0..N for some large value of N
hypothetically unbounded). I say stream here because its not really an
array at that point.

>
>> Latin-1 is a character set.
>
> Latin-1 is both a character set and an encoding. The character set is
> defined as equal to the first 256 characters in Unicode (U+0000 ..
> U+00FF), and the encoding is defined as a straight forward 8 bit
> encoding: U+0000 => 0x00 .. U+00FF => 0xFF. They even went as far as
> describing how the individual bits are to be layed out in the byte. Not
> surprisingly, the 8 bits have weights from 128 to 1, where each
> subsequent bit is half the value of the one before it :)
>
> The specification uses the term "coded representation" rather than
> "encoding".

Ok. Fine. It specifies both. Latin-1 was probably a bad example. I
meant that if you had some arbitrary stream of octets you could encode
those octets as utf8 without losing information. But you wouldnt
(except in the case of latin1) have converted it to unicode.

>> The issues i see are this:
>> 1. We don't have a binary data type.
>
> I intend to release a module that handles this in Perl space in a way
> that is backward compatible to 5.000. Its name is BLOB.
>
> One thing that it doesn't do, is avoid concatenation with non-BLOBs. I'd
> like to learn if this can be done at all.

Sounds interesting.

>
>> 3. We use the name of an encoding of Unicode as the name of for the
>> encoding of a string causing confusion.
>
> Indeed. Maybe it would be wise to start calling the internal
> representation SvUTF8 encoding, rather than UTF8 encoding. Or maybe a
> wholly different name.
>
>> Maybe by making PV's store more information about their character set.
>
> The Encode suite treats character sets as properties of encodings;

Given how perl works internally does Encode have any other choice?

> the user only has to deal with a single character set, namely Unicode.

Except er, they dont. As weve been discussing for ages now.

> I think that's the only sane approach. Information about the
> charset/encoding does not have to be in the string, but belongs to
> operations as Marc aptly describes the first post carrying this subject.

I dont get you really. If you dont know what type of a data is
contained in a string how can you know what the correct behaviour is
for it for a given operation?

Yves


--
perl -Mre=debug -e "/just|another|perl|hacker/"


juerd at convolution

May 20, 2008, 11:05 AM

Post #13 of 19 (268 views)
Permalink
Re: on broken manpages, trolling, inconsistent implementation and the difficulty to fix bugs [In reply to]

demerphq skribis 2008-05-20 18:03 (+0200):
> > The Encode suite treats character sets as properties of encodings;
> Given how perl works internally does Encode have any other choice?

Sure. Since a string in Perl is just a sequence of numbered characters,
it could theoretically be used to represent any character set, not just
Unicode. We tend to call Perl strings Unicode strings, but in reality
the unicode-ness is not part of the string, but of the operation done on
it. It's a fair coincidence that the multibyte encoding chosen happens
to be a unicode encoding ;)

> > the user only has to deal with a single character set, namely Unicode.
> Except er, they dont. As weve been discussing for ages now.

Encode combines "character set" and "byte encoding" into a single
mapping, which it calls "encoding". Perl users can treat binary data as
encoded text.

A Perl programmer decodes the binary data, and later encodes the text
data back to binary. They only specify the "encoding", and the character
set is handled transparently.

Let's call the latin1 character set "l1cs" and the latin1 encoding
"l1enc". The real transformation from UTF-8 to l1enc would be:

UTF-8 -> unicode -> l1cs -> l1enc

However, Perl provides a unified view of encodings, and bundles the
charset in them. What you're actually doing is

UTF-8 -> (string of unicode codepoints) -> latin1

And you don't have to care about the difference between l1cs and l1enc.
That's what I meant by: the character set is Unicode, and all other
character sets are handled by their encoding implementations.

> > I think that's the only sane approach. Information about the
> > charset/encoding does not have to be in the string, but belongs to
> > operations as Marc aptly describes the first post carrying this subject.
> I dont get you really. If you dont know what type of a data is
> contained in a string how can you know what the correct behaviour is
> for it for a given operation?

By declaring what you expect, so you don't have to know or guess. Perl
operators would expect unicode text.

uc(), lc(), character classes, etcetera are all text operations. You
don't use them on binary data. Perl assumes that the character set of
the string is Unicode, and uses Unicode semantics. Or, it should.

In fact, I couldn't even *find* any other character set with clearly
defined semantics for things like upper/lower case. Unicode appears to
be unique in that. Oh, and ASCII of course :).
--
Met vriendelijke groet, Kind regards, Korajn salutojn,

Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig>
Convolution: ICT solutions and consultancy <sales[at]convolution.nl>
1;


davidnicol at gmail

May 20, 2008, 12:27 PM

Post #14 of 19 (266 views)
Permalink
Re: on broken manpages, trolling, inconsistent implementation and the difficulty to fix bugs [In reply to]

On Tue, May 20, 2008 at 7:10 AM, demerphq <demerphq[at]gmail.com> wrote:

> 1. We don't have a binary data type. (We dont distinguish character
> data from octet data and its easy to inadvertently cause one to be
> treated as the other with surprising results.)

I have been for some time in favor of introducing a
string-of-characters data type which would be a tree of some kind
rather than a contiguous block. So if new data is to be added to the
internal representation of character-data data type, please include
"make it a tree instead of a contiguous block" when considering
architecture options. Advantages include: concatenation without
copying; cleaner lvalue substring semantics; the possibility of
internally compressing strings with lots of common bits in them; and
certainly more (especially after four cups of coffee, he
self-defeatingly blurted)


pagaltzis at gmx

May 20, 2008, 2:34 PM

Post #15 of 19 (267 views)
Permalink
Re: on broken manpages, trolling, inconsistent implementation and the difficulty to fix bugs [In reply to]

* Tom Christiansen <tchrist[at]perl.com> [2008-05-20 03:00]:
> (2) The troubles of getting Unicodish action on codepoints
> in the U+0080 .. U+OOFF range.
>
> eg: % perl -E 'say chr(0xdf)'
> ß
> % perl -E 'say ucfirst chr(0xdf)'
> ß
> % perl -E 'use utf8; say ucfirst chr(0xdf)'
> ß
> % perl -E 'use encoding "latin1"; say ucfirst chr(0xdf)'
> Ss

Just as a sidenote, the consensus in a recent thread was that the
`encoding` pragma is broken and deprecated.

Regards,
--
Aristotle Pagaltzis // <http://plasmasturm.org/>


rvtol+news at isolution

May 20, 2008, 4:52 PM

Post #16 of 19 (264 views)
Permalink
Re: on broken manpages, trolling, inconsistent implementation and the difficulty to fix bugs [In reply to]

Juerd Waalboer schreef:

> I intend to release a module that handles this in Perl space in a way
> that is backward compatible to 5.000. Its name is BLOB.
>
> One thing that it doesn't do, is avoid concatenation with non-BLOBs.
> I'd like to learn if this can be done at all.

Maybe stringize is as undef? (or as an empty string)


Somewhat related: a bigstring type would be nice. Probably implemented
as an array of hashrefs. (Or as the tree that D.Nicol envisions.)

Each hash contains the value: ( _ => "some text" ), and can contain meta
information for that value: ( mimetype => "plain/ascii", language =>
"English", etc. ).

For joining the strings, there could be a concatenation separator, for
example a path separator like "/".
Each part of the path could then have its own
characterset/contenttype/encoding/semantics/etc.

--
Affijn, Ruud

"Gewoon is een tijger."


juerd at convolution

May 20, 2008, 5:16 PM

Post #17 of 19 (265 views)
Permalink
Re: on broken manpages, trolling, inconsistent implementation and the difficulty to fix bugs [In reply to]

Dr.Ruud skribis 2008-05-21 1:52 (+0200):
> Maybe stringize is as undef? (or as an empty string)

Can you please rephrase that? I have no idea what you mean.
--
Met vriendelijke groet, Kind regards, Korajn salutojn,

Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig>
Convolution: ICT solutions and consultancy <sales[at]convolution.nl>
1;


jim.cromie at gmail

Jun 3, 2008, 7:03 PM

Post #18 of 19 (182 views)
Permalink
Re: on broken manpages, trolling, inconsistent implementation and the difficulty to fix bugs [In reply to]

David Nicol wrote:
> On Tue, May 20, 2008 at 7:10 AM, demerphq <demerphq[at]gmail.com> wrote:
>
>
>> 1. We don't have a binary data type. (We dont distinguish character
>> data from octet data and its easy to inadvertently cause one to be
>> treated as the other with surprising results.)
>>
>
> I have been for some time in favor of introducing a
> string-of-characters data type which would be a tree of some kind
> rather than a contiguous block. So if new data is to be added to the
>

what you want is Ropes


davidnicol at gmail

Jun 4, 2008, 8:09 AM

Post #19 of 19 (180 views)
Permalink
Re: on broken manpages, trolling, inconsistent implementation and the difficulty to fix bugs [In reply to]

On Tue, Jun 3, 2008 at 9:03 PM, Jim Cromie <jim.cromie[at]gmail.com> wrote:
>
> what you want is Ropes

Yes. "Perl on Ropes."

Perl porters RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.