
pagaltzis at gmx
May 12, 2012, 10:13 PM
Post #7 of 10
(129 views)
Permalink
|
* Karl Williamson <public [at] khwilliamson> [2012-05-12 18:40]: > On 05/11/2012 11:49 PM, Aristotle Pagaltzis wrote: > >* hv [at] crypt<hv [at] crypt> [2012-05-11 01:50]: > >> From 'perldoc -f vec': > >> If the string happens to be encoded as UTF-8 internally (and > >> thus has the UTF8 flag set), this is ignored by "vec", and it > >> operates on the internal byte string, not the conceptual > >> character string, even if you only have characters with values > >> less than 256. > > > >Aaaaaaaargh. *Documented* to have the Unicode Bug behaviour. > > So what should vec actually do? One possibility is to extend feature > 'unicode_strings' yet again to cover it, in which case it would be > best to change our documents in 5.16 to indicate that is coming. Hold on a moment. The reality is better than I thought… but at the same time worse. Get a load of this: use 5.014; use Test::More; use Encode (); my $p = my $v = chr 0xf0; utf8::upgrade $v; is Encode::is_utf8($p), !1, 'original string should not have UTF-8 flag'; is Encode::is_utf8($v), !0, 'upgraded copy should have UTF-8 flag'; sub bits { join '', map { vec $_[0], $_, 1 } 0..$_[1]-1 } is bits($p, 16), bits($v, 16), 'upgrading should not change semantics'; is Encode::is_utf8($p), !1, 'original string should not have UTF-8 flag'; is Encode::is_utf8($v), !0, 'upgraded copy should have UTF-8 flag'; $v = chr 0x100; utf8::downgrade $v, 1; is Encode::is_utf8($v), !0, 'string with wide chars cannot be downgraded'; eval { vec $v, 0, 1 }; done_testing; This will yield two failures: not ok 5 - upgraded copy should have UTF-8 flag # Failed test 'upgraded copy should have UTF-8 flag' # at t.pl line 14. # got: '' # expected: '1' not ok 7 - vec refuses to operate on wide-character strings # Failed test 'vec refuses to operate on wide-character strings' # at t.pl line 21. So on one hand `vec` will downgrade strings when they can be, so that it operates on a packed octet buffer whenever it can, no matter whether the input was upgraded or not. This is good! But – insanely – if it *cannot* downgrade the string, it just shrugs and trods on blithely, operating on the UTF-8-encoded variable width buffer. This combination is not ultimately any better than if it had just the plain Unicode Bug. You still have to protect every call to `vec` with a `utf8::downgrade` with a non-true FAIL_OK flag (at least notionally, somewhere in the code path that leads to it). But the faux convenience of `vec` makes it DTRT with a significantly larger range of inputs, so if you forget a necessary `utf8::downgrade` somewhere you are much less likely to realize your mistake. Worse, whether the breakage manifests will depend on your data, which means it can cause Heisenbugs. > So what should vec actually do? So to come back to this most essential query, the question that I guess we ultimately have to answer is: What does it mean to look at the bits of a text string? Well, what is a text string as a pure concept? It is a sequence of codepoints, i.e. of platonic ideas of numbers with no particular representation. What does `vec` do, as a pure concept? It looks at units of something represented in a string of bits. So at a purely semantic level, it actually makes no sense to put these together: trying to look at the representation of something that has none is nonsense. From that perspective, `vec` on a text string should simply throw Does Not Compute. Unfortunately, in Perl you cannot tell byte strings apart from text strings without non-Latin1 characters – not even by looking at that most attractive of nuisances, the UTF8 flag. Byte strings may be upgraded and text strings may be stored in non-upgraded format. So the best `vec` can do is throw an exception for non-downgradeable strings. That means that for their mere fit inside the Latin1 charset, `vec` will still accept a wide range of text strings that it should not notionally operate on. I don’t know what to do about that. This is not the first time I’ve thought we need some way to mark strings specifically as text vs bytes. Regards, -- Aristotle Pagaltzis // <http://plasmasturm.org/>
|