Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Perl: porters

[perl #117355] [lu]cfirst don't respect 'use bytes'

 

 

Perl porters RSS feed   Index | Next | Previous | View Threaded


perlbug-followup at perl

Aug 11, 2013, 7:56 PM

Post #1 of 10 (16 views)
Permalink
[perl #117355] [lu]cfirst don't respect 'use bytes'

On Sun Jul 14 23:54:35 2013, sprout wrote:
> > From the top of the pod in bytes.pm, added for 5.12.0:
> >
> > =head1 NOTICE
> >
> > This pragma reflects early attempts to incorporate Unicode into perl and
> > has since been superseded. It breaks encapsulation (i.e. it exposes the
> > innards of how the perl executable currently happens to store a string),
> > and use of this module for anything other than debugging purposes is
> > strongly discouraged. If you feel that the functions here within might be
> > useful for your application, this possibly indicates a mismatch between
> > your mental model of Perl Unicode and the current reality. In that case,
> > you may wish to read some of the perl Unicode documentation:
> > L<perluniintro>, L<perlunitut>, L<perlunifaq> and L<perlunicode>.
>
> What can we do to upgrade this to a deprecation?

I'm not sure.

The question is: do we propose to allow bytes.pm to become an external library? Can we do
this usefully, since bytes currently works (as I understand it) by tweaking $^H and letting CORE
sort out the rest? Can its behavior be reimplemented as something entirely without core
support. I think so, by making copies and downgrading. (four arg substr won't be exactly that
simple, but should be doable.)

I haven't given this a lot of thought, but I think that if we can make bytes.pm ejectable, we
should do so. It's okay if it gets slower, since we've been telling people for years that it's only a
debugging tool, if that.

Thoughts? Objections?

--
rjbs

---
via perlbug: queue: perl5 status: open
https://rt.perl.org:443/rt3/Ticket/Display.html?id=117355


perlbug-followup at perl

Aug 11, 2013, 10:30 PM

Post #2 of 10 (10 views)
Permalink
[perl #117355] [lu]cfirst don't respect 'use bytes' [In reply to]

On Sun Jul 14 23:22:33 2013, tonyc wrote:
> On Thu Jul 04 18:40:10 2013, tonyc wrote:
> > On Sat May 04 20:03:46 2013, public [at] khwilliamson wrote:
> > > Attached is a patch that fixes the original report. The code it
> > > changes
> > > is a small portion of this commit:
> >
> > As much as I despise use bytes, I think this patch could go in, but it
> > would need tests.
> >
> > If no-one else provides tests I'll write some over the next few days.
> >
> > Or not, if people object to the change, in which case they should
> > propose an alternative.
>
> Attached, some very basic tests, bytes.pm doesn't deserve much more :)

Applied as ac993614a7619e3e09c31ed0d7721bede551376a,
93e088e883c48d3aa622b15ae335940abb05a48f,
ae5c28e814c627cae6eff15424529d46decc4366.

I've leave the ticket open for the deprecation discussion, though
perhaps that belongs in a different ticket.

Tony



---
via perlbug: queue: perl5 status: open
https://rt.perl.org:443/rt3/Ticket/Display.html?id=117355


perlbug-followup at perl

Aug 11, 2013, 11:22 PM

Post #3 of 10 (10 views)
Permalink
[perl #117355] [lu]cfirst don't respect 'use bytes' [In reply to]

On Sun Aug 11 19:56:53 2013, rjbs wrote:
> On Sun Jul 14 23:54:35 2013, sprout wrote:
> > > From the top of the pod in bytes.pm, added for 5.12.0:
> > >
> > > =head1 NOTICE
> > >
> > > This pragma reflects early attempts to incorporate Unicode into
> perl and
> > > has since been superseded. It breaks encapsulation (i.e. it
> exposes the
> > > innards of how the perl executable currently happens to store a
> string),
> > > and use of this module for anything other than debugging purposes
> is
> > > strongly discouraged. If you feel that the functions here within
> might be
> > > useful for your application, this possibly indicates a mismatch
> between
> > > your mental model of Perl Unicode and the current reality. In that
> case,
> > > you may wish to read some of the perl Unicode documentation:
> > > L<perluniintro>, L<perlunitut>, L<perlunifaq> and L<perlunicode>.
> >
> > What can we do to upgrade this to a deprecation?
>
> I'm not sure.
>
> The question is: do we propose to allow bytes.pm to become an
> external library? Can we do
> this usefully, since bytes currently works (as I understand it) by
> tweaking $^H and letting CORE
> sort out the rest? Can its behavior be reimplemented as something
> entirely without core
> support. I think so, by making copies and downgrading. (four arg
> substr won't be exactly that
> simple, but should be doable.)
>
> I haven't given this a lot of thought, but I think that if we can make
> bytes.pm ejectable, we
> should do so. It's okay if it gets slower, since we've been telling
> people for years that it's only a
> debugging tool, if that.
>
> Thoughts? Objections?

How much of it would we reimplement?

If we want to keep its current behaviour, we would end up having to
override almost every op in what would become bytes.xs. Just search for
uses of DO_UTF8 throughout the core. DO_UTF8 means SvUTF8 unless bytes
is turned on, in which case we pretend the flag is not set. That means
"\xff"."\xff" returns "\xff\xc3\xbf" if the rhs is in utf8. So we have
to override concatenation via PL_check hooks, which gets messy. It
seems like a lot of work for preserving broken behaviour.

Are you suggesting just a subset of the behaviour?

--

Father Chrysostomos


---
via perlbug: queue: perl5 status: open
https://rt.perl.org:443/rt3/Ticket/Display.html?id=117355


perlbug-followup at perl

Aug 12, 2013, 12:35 PM

Post #4 of 10 (8 views)
Permalink
[perl #117355] [lu]cfirst don't respect 'use bytes' [In reply to]

On Sun Aug 11 23:22:25 2013, sprout wrote:
> On Sun Aug 11 19:56:53 2013, rjbs wrote:
> > On Sun Jul 14 23:54:35 2013, sprout wrote:
> > > > From the top of the pod in bytes.pm, added for 5.12.0:
> > > >
> > > > =head1 NOTICE
> > > >
> > > > This pragma reflects early attempts to incorporate Unicode into
> > perl and
> > > > has since been superseded. It breaks encapsulation (i.e. it
> > exposes the
> > > > innards of how the perl executable currently happens to store a
> > string),
> > > > and use of this module for anything other than debugging purposes
> > is
> > > > strongly discouraged. If you feel that the functions here within
> > might be
> > > > useful for your application, this possibly indicates a mismatch
> > between
> > > > your mental model of Perl Unicode and the current reality. In that
> > case,
> > > > you may wish to read some of the perl Unicode documentation:
> > > > L<perluniintro>, L<perlunitut>, L<perlunifaq> and L<perlunicode>.
> > >
> > > What can we do to upgrade this to a deprecation?
> >
> > I'm not sure.
> >
> > The question is: do we propose to allow bytes.pm to become an
> > external library? Can we do
> > this usefully, since bytes currently works (as I understand it) by
> > tweaking $^H and letting CORE
> > sort out the rest? Can its behavior be reimplemented as something
> > entirely without core
> > support. I think so, by making copies and downgrading. (four arg
> > substr won't be exactly that
> > simple, but should be doable.)
> >
> > I haven't given this a lot of thought, but I think that if we can make
> > bytes.pm ejectable, we
> > should do so. It's okay if it gets slower, since we've been telling
> > people for years that it's only a
> > debugging tool, if that.
> >
> > Thoughts? Objections?
>
> How much of it would we reimplement?
>
> If we want to keep its current behaviour, we would end up having to
> override almost every op in what would become bytes.xs. Just search for
> uses of DO_UTF8 throughout the core. DO_UTF8 means SvUTF8 unless bytes
> is turned on, in which case we pretend the flag is not set. That means
> "\xff"."\xff" returns "\xff\xc3\xbf" if the rhs is in utf8. So we have
> to override concatenation via PL_check hooks, which gets messy. It
> seems like a lot of work for preserving broken behaviour.

Just deprecating bytes.pm outright, not just deprecating it from core,
would be the easiest route, of course.

In 5.20 and 5.22 it warns ‘Use of bytes.pm is deprecated’.
In 5.24 it dies with ‘Can’t locate bytes.pm in @INC...’.

--

Father Chrysostomos


---
via perlbug: queue: perl5 status: open
https://rt.perl.org:443/rt3/Ticket/Display.html?id=117355


perlbug-followup at perl

Aug 12, 2013, 1:08 PM

Post #5 of 10 (7 views)
Permalink
[perl #117355] [lu]cfirst don't respect 'use bytes' [In reply to]

On Sun Aug 11 19:56:53 2013, rjbs wrote:
> I haven't given this a lot of thought, but I think that if we can make
> bytes.pm ejectable, we
> should do so. It's okay if it gets slower, since we've been telling
> people for years that it's only a
> debugging tool, if that.
>
> Thoughts? Objections?


Possible use bytes to speedup regexps by 20-40% at some cases:

use warnings;
use strict;
use utf8;
use bytes ();
use Encode;
use Benchmark qw/:all/;

sub test
{
my ($s) = @_;
$s = ($s x 400)."z";
$s =~ /z/ for (1..100);
}

sub try_drop_utf8_flag
{
Encode::_utf8_off($_[0]) if utf8::is_utf8($_[0]) &&
(bytes::length($_[0]) == length($_[0]));
}

my $ascii = "x";
my ($ascii_u, undef) = split(/ /, "x тест");

die unless $ascii_u eq $ascii;

cmpthese(-1,{
'ascii' => sub { test($ascii); },
'ascii with utf8 on' => sub { test($ascii_u);},
'ascii with utf8 bit cleared' => sub {
my $s = $ascii_u;
try_drop_utf8_flag($s);
die if utf8::is_utf8($s);
test($s);
},
});

__END__

perl-5.18.0
==========
Rate ascii with utf8 on ascii with utf8
bit cleared ascii
ascii with utf8 on 22635/s --
-27% -28%
ascii with utf8 bit cleared 30919/s 37%
-- -2%
ascii 31508/s 39%
2% --

perl-5.10.0
==========
Rate ascii with utf8 on ascii with utf8
bit cleared ascii
ascii with utf8 on 25831/s --
-19% -21%
ascii with utf8 bit cleared 31717/s 23%
-- -4%
ascii 32881/s 27%
4% --


---
via perlbug: queue: perl5 status: open
https://rt.perl.org:443/rt3/Ticket/Display.html?id=117355


perlbug-followup at perl

Aug 12, 2013, 1:38 PM

Post #6 of 10 (7 views)
Permalink
[perl #117355] [lu]cfirst don't respect 'use bytes' [In reply to]

On Sun Aug 11 19:56:53 2013, rjbs wrote:
>
> I haven't given this a lot of thought, but I think that if we can make
> bytes.pm ejectable, we
> should do so. It's okay if it gets slower, since we've been telling
> people for years that it's only a
> debugging tool, if that.
>
> Thoughts? Objections?

Another possible use of bytes are:

1) run-time, production-enabled assertions (
http://en.wikipedia.org/wiki/Assertion_%28software_development%29 ).
It's similar to debugging, except performance matters.

2) Unit tests (sometimes performance matters).

Below example contains a bug (from Perl point view this can be treated
as not-a-bug, but from programmer point of view it's a bug).

(bug marked with "# THIS LINE CONTAINS A BUG")

It does not affect anything, even program output, except
performance/memory usage.

bin_u is 7 bytes length, and bin_a is 4 bytes length.

if 7 vs 4 bytes looks unimportant, consider 700 vs 400 MiB of binary files.

And this bug can be caught (runtime or in unit tests) if line "#die if
is_wide_string($bin_u);" uncommented.

The only possible way to catch this is a use of bytes::length (or
similar function which count bytes), because final output is same with
or without bug.

=====

use Encode;
use utf8;
use bytes ();
use strict;
use warnings;

sub is_wide_string
{
defined($_[0]) && utf8::is_utf8($_[0]) && (bytes::length($_[0]) !=
length($_[0]))
}

my $u = "\x{442}\x{435}\x{441}\x{442}"; # same as "тест"

# plain binary data, for example part of binary file (say, JPEG)
my $bin = "\xf1\xf2\xf3";

my $ascii = "x";
my ($ascii_u, undef) = split(/ /, "$ascii $u");
die unless utf8::is_utf8($ascii_u);

print "original bin length:\t";
print length($bin) . "\t" . bytes::length($bin) ."\n";

my $bin_a = $bin.$ascii;

print "bin_a length:\t";
print length($bin_a) . "\t" . bytes::length($bin_a) ."\n";

my $bin_u = $bin.$ascii_u; # THIS LINE CONTAINS A BUG

#die if is_wide_string($bin_u);

print "bin_u length:\t";
print length($bin_u) . "\t" . bytes::length($bin_u) ."\n";

open my $f, ">", "file_a.tmp";
binmode $f;
syswrite $f, $bin_a;
close $f;

open $f, ">", "file_u.tmp";
binmode $f;
syswrite $f, $bin_u;
close $f;

system("md5sum file_?.tmp");

__END__
original bin length: 3 3
bin_a length: 4 4
bin_u length: 4 7
33818f4b23aa74cddb8eb625845a459a file_a.tmp
33818f4b23aa74cddb8eb625845a459a file_u.tmp


---
via perlbug: queue: perl5 status: open
https://rt.perl.org:443/rt3/Ticket/Display.html?id=117355


perlbug-followup at perl

Aug 12, 2013, 1:53 PM

Post #7 of 10 (7 views)
Permalink
[perl #117355] [lu]cfirst don't respect 'use bytes' [In reply to]

On Sun Aug 11 19:56:53 2013, rjbs wrote:

And another edge case when perl unicode not working well, is filenames.

Code below prints that two strings are same. Tries to open file with
name defined by one string, and then to reopen file with name defined by
second string. Second attempt fail.

So that is another case when user might want to use bytes::xxx,
_is_utf8() etc to access perl string internals, because internals
matters in this case.

(btw, I have a program which have to deal with non-UTF filesystems, this
makes things even worse. it has to pass _binary_ strings representing
filename across whole program and I should be very careful and never
merge it with ASCII strings with utf-8 bin or Unicode strings)

========

use Encode;
use utf8;
use strict;
use warnings;

my $u = "\x{442}\x{435}\x{441}\x{442}"; # same as "тест"

# plain binary data, for example part of binary file (say, JPEG)
my $bin = "\xf1\xf2\xf3";

my $ascii = "x";
my ($ascii_u, undef) = split(/ /, "$ascii $u");
die unless utf8::is_utf8($ascii_u);

print "original bin length:\t";
print length($bin) . "\t" . bytes::length($bin) ."\n";

my $bin_a = $bin.$ascii;

print "bin_a length:\t";
print length($bin_a) . "\t" . bytes::length($bin_a) ."\n";

my $bin_u = $bin.$ascii_u; # THIS LINE CONTAINS A BUG

die unless $bin_u eq $bin_a;
print "bin_u and bin_a are same!\n";

open my $f, ">", "$bin_u.tmp";
binmode $f;
syswrite $f, "TEST";
close $f;

open $f, "<", "$bin_a.tmp" or die "file not found $!";
__END__
original bin length: 3 3
bin_a length: 4 4
bin_u and bin_a are same!
file not found No such file or directory at poc5.pl line 39.


---
via perlbug: queue: perl5 status: open
https://rt.perl.org:443/rt3/Ticket/Display.html?id=117355


perlbug-followup at perl

Aug 12, 2013, 2:17 PM

Post #8 of 10 (7 views)
Permalink
[perl #117355] [lu]cfirst don't respect 'use bytes' [In reply to]

On Mon Aug 12 13:57:25 2013, ikegami [at] adaelis wrote:
> On Mon, Aug 12, 2013 at 4:08 PM, Victor Efimov via RT <
> perlbug-followup [at] perl> wrote:
> >
> > sub try_drop_utf8_flag
> > {
> > Encode::_utf8_off($_[0]) if utf8::is_utf8($_[0]) &&
> > (bytes::length($_[0]) == length($_[0]));
> > }
>
>
> That's just C<< utf8::downgrade($_[0], 1) >>

Yes, you are right, except one small difference.
For characters > 127, but <= 255 it works different way.
Thus it cannot be used, when strings are filenames (like in example
above, also another example below).

(That's btw exactly like I work with it in my program
https://github.com/vsespb/mt-aws-glacier - read millions of filenames,
split, try drop utf-8 flags, and process with regexps)

use bytes ();
use utf8;
binmode STDOUT, ":encoding(utf-8)";
use Devel::Peek;
sub try_drop_utf8_flag
{
Encode::_utf8_off($_[0]) if utf8::is_utf8($_[0]) &&
(bytes::length($_[0]) == length($_[0]));
}
sub do_downgrade
{
utf8::downgrade($_[0], 1)
}
my $s = "ú";
my $s1 = $s;
try_drop_utf8_flag($s1);
my $s2 = $s;
do_downgrade($s2);
Dump($s1);
Dump($s2);


die unless $s1 eq $s2;

open my $f, ">", "$s1.tmp";
binmode $f;
syswrite $f, "TEST";
close $f;

open $f, "<", "$s2.tmp" or die "file not found $!";


__END__
SV = PVMG(0xfc00a0) at 0xfc1440
REFCNT = 1
FLAGS = (PADMY,SMG,POK,pPOK,UTF8)
IV = 0
NV = 0
PV = 0x1042b90 "\303\272"\0 [UTF8 "\x{fa}"]
CUR = 2
LEN = 8
MAGIC = 0x1094090
MG_VIRTUAL = &PL_vtbl_utf8
MG_TYPE = PERL_MAGIC_utf8(w)
MG_LEN = 1
SV = PV(0xfd6538) at 0xfc1488
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
PV = 0xfdccd0 "\372"\0
CUR = 1
LEN = 8
file not found No such file or directory at bench3-poc.pl line 29.


---
via perlbug: queue: perl5 status: open
https://rt.perl.org:443/rt3/Ticket/Display.html?id=117355


perlbug-followup at perl

Aug 12, 2013, 2:20 PM

Post #9 of 10 (8 views)
Permalink
[perl #117355] [lu]cfirst don't respect 'use bytes' [In reply to]

sorry, RT corrupted this character in code (even RT don't like Latin1
chars, unlike "wide" chars, like Cyrilic !). I meant this one
http://www.fileformat.info/info/unicode/char/da/index.htm

On Mon Aug 12 14:17:19 2013, vsespb wrote:
> On Mon Aug 12 13:57:25 2013, ikegami [at] adaelis wrote:
> > On Mon, Aug 12, 2013 at 4:08 PM, Victor Efimov via RT <
> > perlbug-followup [at] perl> wrote:
> > >
> > > sub try_drop_utf8_flag
> > > {
> > > Encode::_utf8_off($_[0]) if utf8::is_utf8($_[0]) &&
> > > (bytes::length($_[0]) == length($_[0]));
> > > }
> >
> >
> > That's just C<< utf8::downgrade($_[0], 1) >>
>
> Yes, you are right, except one small difference.
> For characters > 127, but <= 255 it works different way.
> Thus it cannot be used, when strings are filenames (like in example
> above, also another example below).
>
> (That's btw exactly like I work with it in my program
> https://github.com/vsespb/mt-aws-glacier - read millions of filenames,
> split, try drop utf-8 flags, and process with regexps)
>
> use bytes ();
> use utf8;
> binmode STDOUT, ":encoding(utf-8)";
> use Devel::Peek;
> sub try_drop_utf8_flag
> {
> Encode::_utf8_off($_[0]) if utf8::is_utf8($_[0]) &&
> (bytes::length($_[0]) == length($_[0]));
> }
> sub do_downgrade
> {
> utf8::downgrade($_[0], 1)
> }
> my $s = "�";
> my $s1 = $s;
> try_drop_utf8_flag($s1);
> my $s2 = $s;
> do_downgrade($s2);
> Dump($s1);
> Dump($s2);
>
>
> die unless $s1 eq $s2;
>
> open my $f, ">", "$s1.tmp";
> binmode $f;
> syswrite $f, "TEST";
> close $f;
>
> open $f, "<", "$s2.tmp" or die "file not found $!";
>
>
> __END__
> SV = PVMG(0xfc00a0) at 0xfc1440
> REFCNT = 1
> FLAGS = (PADMY,SMG,POK,pPOK,UTF8)
> IV = 0
> NV = 0
> PV = 0x1042b90 "\303\272"\0 [UTF8 "\x{fa}"]
> CUR = 2
> LEN = 8
> MAGIC = 0x1094090
> MG_VIRTUAL = &PL_vtbl_utf8
> MG_TYPE = PERL_MAGIC_utf8(w)
> MG_LEN = 1
> SV = PV(0xfd6538) at 0xfc1488
> REFCNT = 1
> FLAGS = (PADMY,POK,pPOK)
> PV = 0xfdccd0 "\372"\0
> CUR = 1
> LEN = 8
> file not found No such file or directory at bench3-poc.pl line 29.




---
via perlbug: queue: perl5 status: open
https://rt.perl.org:443/rt3/Ticket/Display.html?id=117355


perlbug-followup at perl

Aug 12, 2013, 4:04 PM

Post #10 of 10 (6 views)
Permalink
[perl #117355] [lu]cfirst don't respect 'use bytes' [In reply to]

On Mon Aug 12 14:02:02 2013, Hugmeir wrote:
> but the way to get it is not necessarily "use bytes", but "if you don't
> want unicode semantics, encode your strings before matching"

Well, utf8::encode($bytes) will change the string. So if

a) I have ASCII regexp
b) I have data, which sometimes ASCII-7-bit (in most cases), and
sometimes Unicode with wide characters
c) I want the regexp to work fast, at least when data is ASCII
d) I want to code to not be broken, if data is not ASCII.


utf8::encode($bytes) won't work as needed. It will damage string if it's
Unicode. It won't be a character string anymore, (I might want to
process it after regexp match, or I want to use regexp match variables)

> and as of
> blead, looks like ascii+utf8 now matches just as fast as plain ascii.

Yes, indeed, 5.18 still slow, but blead already fast.


---
via perlbug: queue: perl5 status: open
https://rt.perl.org:443/rt3/Ticket/Display.html?id=117355

Perl porters RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.