
arnon at back2front
Jun 14, 2012, 9:34 PM
Post #8 of 9
(1420 views)
Permalink
|
Thanks very much Josh for investigating this - it saved me some time narrowing down the issue. Even still, I did spend quite a lot of time working out a solution for my needs, and still I don't think it is generalizable as-is. However, in case someone else wants to give it a crack, I provide details below. On 2012-06-05 19:30, Josh Chamas wrote: > doing this is where we have a problem: > > <% print Encode::decode('ISO-8859-1',"\xE2"); %> > > and immediately in the Apache::ASP::Response::Write() method the data > has already been converted incorrectly The fact that such a simple use of Encode causes an issue is a little surprising. Surely others are using Apache::ASP in multi-language environments - is no one using Encode this way? How are others coping with this limitation right now? > Its as if by merely going through the tied interface that data goes > through some conversion process. Not quite, as the same results happen without a tie'd interface. The "use bytes" pragma is what causes the conversion (see test script below). > Apache::ASP::Response does a "use bytes" which is to deal with the > output stream correctly I believe this is around content length > calculations. > I think this is fine here, and turning this off makes things worse for > these examples. It looks like "use bytes" is now deprecated and should indeed be removed. The documentation doesn't mention any trivial substitute. However, this pragma mostly just overrides some built-in functions with byte-oriented versions. So I made the following changes to Response.pm: - changed use bytes => no bytes (just import the namespace) - changed all occurrences of length() => bytes::length() This resolved the mixed-encoding issue originally posted, but introduced a new (more manageable) issue. For debugging purposes, I peeked at the "UTF-8 flag" (Perl's internal flag that indicates that a string has a known decoding). This flag should be transparent in principle, but it helped make sense of the behaviour of Apache::ASP. Results of testing are summarized as follows: 1. Testing Perl/CGI, asp-perl, and Apache::ASP, all 3 give the same results with the "use bytes" pragma turned on: - For any string with the UTF-8 flag off, output is correctly encoded. - Any string with the flag on is (double-)encoded as UTF-8, regardless of the actual output encoding. 2. Testing Perl/CGI and asp-perl with "no bytes" produces correct results: - The UTF-8 flag does not affect output - it is correctly encoded in every case. - However, an interesting test case is that of the double-encoding problem (see http://ahinea.com/en/tech/perl-unicode-struggle.html). This case is indicative of bad code, so is not a concern here, but it illustrates how a tie'd filehandle differs from plain STDOUT. In this case, a single "wide character" double-encodes the entire output (with buffering on, this can be the entire page), instead of just the string. - These test cases are demonstrated by the script below. 3. Testing Apache::ASP with "no bytes" produces different results from the command-line (asp-perl) version, as well as different results from Perl/CGI running on Apache. This suggests an interaction effect between Apache and Apache::ASP (both are required to produce these results). - With the UTF-8 flag off, output is correctly encoded as before. - However, with "no bytes", Apache::ASP, and the UTF-8 flag on, the entire output is double-encoded. This result is similar to the double-encoding problem in the previous test case, except that it doesn't require a "wide character" - any string with the UTF-8 flag on will do. This test script demonstrates all but the last test case: #!/usr/bin/perl use Encode; foreach ( "STDOUT", "tie_use_bytes", "tie_no_bytes" ) { print "$_: "; tie *FH, $_ if ! /^S/; my $STDOUT = select ( FH ) if ! /^S/; print "\x{263a}", Encode::decode('ISO-8859-1',"\xE2"), "\xE2"; print "\n"; close ( FH ) if ! /^S/; select ( $STDOUT ) if ! /^S/; } use strict; package tie_use_bytes; use bytes; sub TIEHANDLE { bless {}, shift; } sub PRINT { shift()->{out} .= join ( $,, @_ ); } sub CLOSE { print STDOUT delete ( shift()->{out} ); } package tie_no_bytes; no bytes; sub TIEHANDLE { bless {}, shift; } sub PRINT { shift()->{out} .= join ( $,, @_ ); } sub CLOSE { print STDOUT delete ( shift()->{out} ); } # Output: ################## Wide character in print at ... STDOUT: ☺ââ # STDOUT output is correct in all cases tie_use_bytes: ☺ââ # with "use bytes", the UTF-8-flagged 2nd character is double-encoded Wide character in print at ... tie_no_bytes: ☺ââ # with "no bytes", the output is correct, but a "wide character" double-encodes the entire string because of the way the tie'd file handle is implemented ######################### By the way, if it's getting difficult to wrap your head around this, you're not alone. At this point, I peeked at the $Response->{out} data buffer, and could see that it was encoded correctly. However, the output from Apache (when the UTF-8 flag is on) was not correct, suggesting that Apache is doing something to encode the string in this case. I decided therefore to address the problem by turning off the UTF-8 flag. The most fault-tolerant method I managed to come up with to do this was the following: ${$Response->{BinaryRef}} = Encode::encode ( 'ISO-8859-1', ${$Response->{BinaryRef}}, sub{ Encode::encode ( 'UTF-8', chr ( shift() ) ) } ) if ! grep ( /^utf8$/, PerlIO::get_layers ( STDOUT ) ); which can go at the top of the $Response->Flush() method, or in global.asa/Script_OnFlush(). With this solution I can now modify Apache::ASP's output encoding (eg, using binmode ( STDOUT );), as originally desired, and the output appears correct in all my test cases. -- ------------------------------------------------------------------------------- Arnon Weinberg www.back2front.ca --------------------------------------------------------------------- To unsubscribe, e-mail: asp-unsubscribe [at] perl For additional commands, e-mail: asp-help [at] perl
|