Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Perl: porters

vec v. ord

 

 

Perl porters RSS feed   Index | Next | Previous | View Threaded


hv at crypt

May 10, 2012, 12:05 PM

Post #1 of 10 (204 views)
Permalink
vec v. ord

Hmm, is this just showing it takes longer to stack 3 arguments than 1,
or does it show room for some optimizations in vec()?

% bleadperl -MBenchmark=cmpthese -wl
$s = "test";
cmpthese(-1, {
"ord" => q{$t=ord($s)},
"vec" => q{$t=vec($s, 0, 8)},
})
__END__
Name "main::s" used only once: possible typo at -e line 1.
Rate vec ord
vec 3044813/s -- -40%
ord 5041230/s 66% --
%

(Timings are very unreliable on this machine, but the numbers above
are close to the median I saw.)

It seems to me ord() needs to do extra work (eg to consider utf8), so
vec() ought to be the faster.

Hugo


sprout at cpan

May 10, 2012, 1:06 PM

Post #2 of 10 (204 views)
Permalink
Re: vec v. ord [In reply to]

hv asked:
> Hmm, is this just showing it takes longer to stack 3 arguments
> than 1...?

Yes.

$ ./perl -Ilib -MBenchmark=cmpthese -wl
$s = "test";
cmpthese(-1, {
"ord" => q{@t=(0,8,ord($s))},
"vec" => q{@t=vec($s, 0, 8)},
})
__END__
Name "main::s" used only once: possible typo at - line 1.
Rate ord vec
ord 1102769/s -- -31%
vec 1587376/s 44% --


ikegami at adaelis

May 10, 2012, 1:27 PM

Post #3 of 10 (204 views)
Permalink
Re: vec v. ord [In reply to]

On Thu, May 10, 2012 at 3:05 PM, <hv [at] crypt> wrote:

> Hmm, is this just showing it takes longer to stack 3 arguments than 1,
> or does it show room for some optimizations in vec()?
>

Hum, that's really the only two options in your mind?


> Rate vec ord
> vec 3044813/s -- -40%
> ord 5041230/s 66% --
>

To put those numbers into perspective, vec takes an extra 130ns.(1/3044813
- 1/5041230)

It seems to me ord() needs to do extra work (eg to consider utf8), so
> vec() ought to be the faster.
>

Both have to deal with the UTF8 flag, and I doubt either does anything
after quickly ascertaining that the flag is off.

- Eric


hv at crypt

May 10, 2012, 4:20 PM

Post #4 of 10 (195 views)
Permalink
Re: vec v. ord [In reply to]

Eric Brine <ikegami [at] adaelis> wrote:
:On Thu, May 10, 2012 at 3:05 PM, <hv [at] crypt> wrote:
[...]
:It seems to me ord() needs to do extra work (eg to consider utf8), so
:> vec() ought to be the faster.
:>
:
:Both have to deal with the UTF8 flag, and I doubt either does anything
:after quickly ascertaining that the flag is off.

From 'perldoc -f vec':
If the string happens to be encoded as UTF-8 internally (and
thus has the UTF8 flag set), this is ignored by "vec", and it
operates on the internal byte string, not the conceptual
character string, even if you only have characters with values
less than 256.
.. which assertion I had vaguely encoded in my head somewhere.

But oh, it turns out to be untrue as your comment implies:

/* currently converts input to bytes if possible, but doesn't sweat failure */
UV
Perl_do_vecget(pTHX_ SV *sv, I32 offset, I32 size)
[...]
if (SvUTF8(sv))
(void) Perl_sv_utf8_downgrade(aTHX_ sv, TRUE);

.. and looks to have been that way since September 2000. The pod paragraph
was added in 2007 (but replaced something at least as wrong).

I'm not sure how perlfunc should better describe this; I'd be tempted
to replace the quoted paragraph with:
Behaviour if the string contains codepoints greater than 255 is
undefined.

Hugo


pagaltzis at gmx

May 11, 2012, 10:49 PM

Post #5 of 10 (193 views)
Permalink
Re: vec v. ord [In reply to]

* hv [at] crypt <hv [at] crypt> [2012-05-11 01:50]:
> From 'perldoc -f vec':
> If the string happens to be encoded as UTF-8 internally (and
> thus has the UTF8 flag set), this is ignored by "vec", and it
> operates on the internal byte string, not the conceptual
> character string, even if you only have characters with values
> less than 256.

Aaaaaaaargh. *Documented* to have the Unicode Bug behaviour.

Regards,
--
Aristotle Pagaltzis // <http://plasmasturm.org/>


public at khwilliamson

May 12, 2012, 9:36 AM

Post #6 of 10 (194 views)
Permalink
Re: vec v. ord [In reply to]

On 05/11/2012 11:49 PM, Aristotle Pagaltzis wrote:
> * hv [at] crypt<hv [at] crypt> [2012-05-11 01:50]:
>> From 'perldoc -f vec':
>> If the string happens to be encoded as UTF-8 internally (and
>> thus has the UTF8 flag set), this is ignored by "vec", and it
>> operates on the internal byte string, not the conceptual
>> character string, even if you only have characters with values
>> less than 256.
>
> Aaaaaaaargh. *Documented* to have the Unicode Bug behaviour.
>
> Regards,


So what should vec actually do? One possibility is to extend feature
'unicode_strings' yet again to cover it, in which case it would be best
to change our documents in 5.16 to indicate that is coming.


pagaltzis at gmx

May 12, 2012, 10:13 PM

Post #7 of 10 (192 views)
Permalink
Re: vec v. ord [In reply to]

* Karl Williamson <public [at] khwilliamson> [2012-05-12 18:40]:
> On 05/11/2012 11:49 PM, Aristotle Pagaltzis wrote:
> >* hv [at] crypt<hv [at] crypt> [2012-05-11 01:50]:
> >> From 'perldoc -f vec':
> >> If the string happens to be encoded as UTF-8 internally (and
> >> thus has the UTF8 flag set), this is ignored by "vec", and it
> >> operates on the internal byte string, not the conceptual
> >> character string, even if you only have characters with values
> >> less than 256.
> >
> >Aaaaaaaargh. *Documented* to have the Unicode Bug behaviour.
>
> So what should vec actually do? One possibility is to extend feature
> 'unicode_strings' yet again to cover it, in which case it would be
> best to change our documents in 5.16 to indicate that is coming.

Hold on a moment. The reality is better than I thought… but at the same
time worse. Get a load of this:

use 5.014;
use Test::More;
use Encode ();

my $p = my $v = chr 0xf0;
utf8::upgrade $v;
is Encode::is_utf8($p), !1, 'original string should not have UTF-8
flag';
is Encode::is_utf8($v), !0, 'upgraded copy should have UTF-8 flag';

sub bits { join '', map { vec $_[0], $_, 1 } 0..$_[1]-1 }
is bits($p, 16), bits($v, 16), 'upgrading should not change
semantics';

is Encode::is_utf8($p), !1, 'original string should not have UTF-8
flag';
is Encode::is_utf8($v), !0, 'upgraded copy should have UTF-8 flag';

$v = chr 0x100;
utf8::downgrade $v, 1;
is Encode::is_utf8($v), !0, 'string with wide chars cannot be
downgraded';
eval { vec $v, 0, 1 };

done_testing;

This will yield two failures:

not ok 5 - upgraded copy should have UTF-8 flag
# Failed test 'upgraded copy should have UTF-8 flag'
# at t.pl line 14.
# got: ''
# expected: '1'

not ok 7 - vec refuses to operate on wide-character strings
# Failed test 'vec refuses to operate on wide-character strings'
# at t.pl line 21.

So on one hand `vec` will downgrade strings when they can be, so that it
operates on a packed octet buffer whenever it can, no matter whether the
input was upgraded or not.

This is good!

But – insanely – if it *cannot* downgrade the string, it just shrugs and
trods on blithely, operating on the UTF-8-encoded variable width buffer.

This combination is not ultimately any better than if it had just the
plain Unicode Bug. You still have to protect every call to `vec` with
a `utf8::downgrade` with a non-true FAIL_OK flag (at least notionally,
somewhere in the code path that leads to it).

But the faux convenience of `vec` makes it DTRT with a significantly
larger range of inputs, so if you forget a necessary `utf8::downgrade`
somewhere you are much less likely to realize your mistake. Worse,
whether the breakage manifests will depend on your data, which means it
can cause Heisenbugs.

> So what should vec actually do?

So to come back to this most essential query, the question that I guess
we ultimately have to answer is:

What does it mean to look at the bits of a text string?

Well, what is a text string as a pure concept?
It is a sequence of codepoints, i.e. of platonic ideas of numbers
with no particular representation.

What does `vec` do, as a pure concept?
It looks at units of something represented in a string of bits.

So at a purely semantic level, it actually makes no sense to put these
together: trying to look at the representation of something that has
none is nonsense.

From that perspective, `vec` on a text string should simply throw Does
Not Compute.

Unfortunately, in Perl you cannot tell byte strings apart from text
strings without non-Latin1 characters – not even by looking at that most
attractive of nuisances, the UTF8 flag. Byte strings may be upgraded and
text strings may be stored in non-upgraded format.

So the best `vec` can do is throw an exception for non-downgradeable
strings.

That means that for their mere fit inside the Latin1 charset, `vec` will
still accept a wide range of text strings that it should not notionally
operate on. I don’t know what to do about that.

This is not the first time I’ve thought we need some way to mark strings
specifically as text vs bytes.

Regards,
--
Aristotle Pagaltzis // <http://plasmasturm.org/>


sprout at cpan

May 12, 2012, 10:27 PM

Post #8 of 10 (193 views)
Permalink
Re: vec v. ord [In reply to]

Aristotle wrote:
> So the best `vec` can do is throw an exception for non-downgradeable
> strings.

Or be backward-compatible and do the same thing as print and warn.

(The word "warn" can be interpreted as both a noun and a verb in the
previous sentence. :-)


pagaltzis at gmx

May 12, 2012, 11:14 PM

Post #9 of 10 (193 views)
Permalink
Re: vec v. ord [In reply to]

* Father Chrysostomos <sprout [at] cpan> [2012-05-13 07:30]:
> Aristotle wrote:
> > So the best `vec` can do is throw an exception for non-downgradeable
> > strings.
>
> Or be backward-compatible and do the same thing as print and warn.
>
> (The word "warn" can be interpreted as both a noun and a verb in the
> previous sentence. :-)

That’s good as a non-invasive shim, yes. I’ve never been particularly
happy with what `print` and friends do – but it’s at least something,
if only the maximally minimal thing.

Regards,
--
Aristotle Pagaltzis // <http://plasmasturm.org/>


perl.p5p at rjbs

May 29, 2012, 8:12 AM

Post #10 of 10 (178 views)
Permalink
Re: vec v. ord [In reply to]

* Father Chrysostomos <sprout [at] cpan> [2012-05-13T01:27:31]
> Aristotle wrote:
> > So the best `vec` can do is throw an exception for non-downgradeable
> > strings.
>
> Or be backward-compatible and do the same thing as print and warn.

+1

> (The word "warn" can be interpreted as both a noun and a verb in the
> previous sentence. :-)

+2

--
rjbs

Perl porters RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.