
pagaltzis at gmx
Nov 23, 2009, 8:34 AM
Post #8 of 30
(3426 views)
Permalink
|
* Carl Johnstone <catalyst [at] fadetoblack> [2009-11-23 15:35]: > Aristotle Pagaltzis wrote: > > # everything should be bytes at this point, but just in case > > $response->content_length( bytes::length( $response->body ) ); > > > > I was shocked to discover this! Any code that uses > > bytes::length is automatically broken. > > Not in this case Yes in this case. > the HTTP spec says that the Content-Length header should > contain the number of octets in the body. If you're sending > UTF-8 then this is likely different to the number of characters > in the string. You’re right about HTTP. But there’s no room for “likelies” here: that’s programming by coincidence. Either you want it or you don’t, and in this case you do. But bytes::length doesn’t do that. Please plese don’t make statements like “not in this case” without knowing what the thing you are talking about does, i.e. in this case bytes::length, does. There are enough misconceptions about Unicode in Perl already. Try this: use 5.010; require utf8; require bytes; require Data::Dump; $a = $b = chr(0xff); utf8::upgrade($a); utf8::downgrade($b); say Data::Dump::pp $a, $b; say $a eq $b ? 'ok' : 'not ok'; say length($a) == length($b) ? 'ok' : 'not ok'; say bytes::length($a) == bytes::length($b) ? 'ok' : 'not ok'; It will print the following: ("\xFF", "\xFF") ok ok not ok In other words, there are two entirely identical strings here, their internal buffers just happen to be in different formats: one is a packed byte array, the other is a variable-width integer arrays. And then bytes::length goes and *IGNORES* which is which, and just blithely looks at the size of the buffer without caring about the (ill-named) UTF8 flag – even though both strings, when printed, will produce the *exact same output*. Because they are IDENTICAL. In Perl, there are ONLY strings. Semantically, there are no “byte strings and character strings”. Just strings. All strings are the same: character sequences, where a a character is an arbitrarily large integer value. That’s *all*. Now there are, on the level of the perl implementation, two string formats: packed byte sequence strings (which are fast but can only store codepoints < 0x100) and variable-width integer sequence strings (which are slower but can store all codepoints). However, from the Perl level, there is NO difference between those two kinds of string. If you have binary data in a string, then it’s simply a string that happens to consist of characters all < 0x100. Note how I didn’t talk about whether it’s a byte array string or a variable-width integer string? That’s because that doesn’t matter. Observe: my $jpeg = do { open my $fh, '<', 'some-image.jpeg' or die $!; local $/; <$fh>; }; utf8::upgrade( $jpeg ); ### <------ note here open my $fh, '>', 'output.jpeg' or die $!; print $fh $jpeg; If you run this code, end result will be two EXACTLY IDENTICAL files. Because the contents in $jpeg mean the SAME THING after upgrading as they did before. You cannot tell from just looking at a string, whether it contains binary data or text. However, if you ask for its bytes::length( $jpeg ), you’ll get the wrong number! Because bytes.pm is broken! As designed! Note that up- or downgrading a string like this will happen at pretty random points in your code, and it won’t be obvious where or why. It’s not actually random of course, but the point where it happens might be hidden in some module several layers down your call stack. It might happen only some of the time. Which is perfectly fine, because the distinction between these two kinds of strings is an implementation detail in perl! Just like when you print numbers in Perl, and perl stringifies the scalar, caches the result of that conversion in the IV slot of the scalar, and never bothers to let you know. Because you don’t need to know. So it might happen that you properly Encode::encode’d your string, but it’s passed to some routine somewhere in the guts of some module you are using, which still causes it to get upgraded in the course some operation. And that’s just fine. It’s not a bug, just like it’s not a bug that perl silently stringifies numbers and silently numifies strings. The resulting output will always be correct in the end because every operation knows to pay attention to all the IOK, POK, etc flags in scalars that keep track of these conversions. But bytes::length doesn’t! It breaks the fixed-/variable width abstraction by blithely ignoring the UTF8 flag. (Which should have been named UOK, to go with the IOK, POK, etc flags that scalars already have.) It’s as if, when you asked for the length of the number 65, and the scalar had never been stringified before, Perl didn’t bother to stringify it, and just looked at the length of the IV slot (integer value), and because you are running a 32-bit perl, the answer you got was 4. Whereas if you had stringified the scalar, then instead the answer would be 2 because "65" is two characters long. And maybe your code is written such that it sometimes happens to stringify the scalar (eg. by printing it in a diagnostic message) and sometimes not. Then you get to play a lottery! Fun! Conclusion of this much longer rant than I planned to write: If you’re using bytes.pm or any of its functions, your code is BROKEN. Unconditionally. Regards, -- Aristotle Pagaltzis // <http://plasmasturm.org/> _______________________________________________ List: Catalyst [at] lists Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst Searchable archive: http://www.mail-archive.com/catalyst [at] lists/ Dev site: http://dev.catalyst.perl.org/
|