Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Catalyst: Users

Avoiding UTF8 in Catalyst

 

 

First page Previous page 1 2 Next page Last page  View All Catalyst users RSS feed   Index | Next | Previous | View Threaded


schaefer at alphanet

Nov 21, 2009, 2:22 PM

Post #1 of 30 (3571 views)
Permalink
Avoiding UTF8 in Catalyst

Hi,

my goal: no UTF8, in short:

- all the perl code, all the data files, all the template files and
the UNIX locale are all in ISO-8859-1

- the HTML result should be in ISO-8859-1
(Content-Type: text/html; charset=iso-8859-1)

- the Content-Length: should be correct.

First, I modified lib/MyApp/View/TT.pm as follows:

__PACKAGE__->config(TEMPLATE_EXTENSION => '.tt',
DEFAULT_ENCODING => 'ISO-8859-1',
WRAPPER => 'wrapper.tt');

Apparently all diacritic characters are expanded into HTML entities.
Which is functional, but not optimal. However, with FormFu, this
unnecessary expansion doesn't happen, which is fine.

I got the following result:

- the HTML data is in ISO-8859-1 (or as HTML entities, which is
acceptable as a work-around) as wanted
- however the HTTP header charset is UTF8

After looking at line 45 of
/usr/local/share/perl/5.8.8/Catalyst/Action/RenderView.pm
it looks that the utf-8 charset HTTP header is hardcoded. I have thus modified
my lib/MyApp/Controller/Root.pm to do the following in
end : ActionClass('RenderView'):

$c->response->content_type('text/html; charset=iso-8859-1');

With this, I got the following result:

- the HTML data is in ISO-8859-1 as wanted (no change, logical)
- the HTTP header charset is now the correct iso-8859-1
- however, the Content-Length: sent is wrong.

After investigating, the Content-Length: is one off per non 7-bit
character. As if the standard iso-8859-1 byte stream was sent as
is, but was, internally converted to UTF-8 just for generating
a wrong byte count. Very strange. Normally that process should really
output something wrong or generate an error in the conversion. It
doesn't.

My questions:

- is there a better way to use the standard charset than to do all
of the above hacks ?

- if not, how to work-around the content length in
end : ActionClass('RenderView') ? Unfortunately, it looks like
$c->result->body is undefined at this point, and that
$c->finalize_body() doesn't do anything useful.

Version info:
Catalyst 5.80007 and 5.80013

PS: I wouldn't have noticed the Content-Length: issue if I hadn't use a
reverse proxy. With that reverse proxy, and the standalone Catalyst
server, you get 5-10 seconds hangs if the Content-Length is too big,
which is what happens with this strange UTF8 behaviour. Without it,
the size is wrong (as seen by wireshark != PageInfo Firefox), but
the WWW client seems to compensate.

PS/2: the http://www.catb.org/~esr/faqs/smart-questions.html URL doesn't
work currently, so maybe my question is unsmart.

_______________________________________________
List: Catalyst [at] lists
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst [at] lists/
Dev site: http://dev.catalyst.perl.org/


moseley at hank

Nov 21, 2009, 5:16 PM

Post #2 of 30 (3475 views)
Permalink
Re: Avoiding UTF8 in Catalyst [In reply to]

On Sat, Nov 21, 2009 at 2:22 PM, Marc SCHAEFER <schaefer [at] alphanet> wrote:

> Hi,
>
> my goal: no UTF8, in short:
>

Can you explain why? That sounds like something to regret later down the
road.
Even if your templates have latin1 characters you can still output utf8.
Less to to worry about with entity encoding, I guess.



>
> - all the perl code, all the data files, all the template files and
> the UNIX locale are all in ISO-8859-1
>

Hopefully you don't have much text in your perl code. Get the database into
utf8 will be a win long term.



>
> - the HTML result should be in ISO-8859-1
> (Content-Type: text/html; charset=iso-8859-1)
>

You need to encode the body, too. Plugin: Unicode::Encoding was just updated
by Tom, IIRC.



>
> - the Content-Length: should be correct.
>
> First, I modified lib/MyApp/View/TT.pm as follows:
>
> __PACKAGE__->config(TEMPLATE_EXTENSION => '.tt',
> DEFAULT_ENCODING => 'ISO-8859-1',
> WRAPPER => 'wrapper.tt');
>

There's a DEFAULT_ENCODING option? Isn't it just ENCODING?



>
> Apparently all diacritic characters are expanded into HTML entities.
>

Where does that happen?




--
Bill Moseley
moseley [at] hank


pagaltzis at gmx

Nov 22, 2009, 5:10 AM

Post #3 of 30 (3462 views)
Permalink
Re: Avoiding UTF8 in Catalyst [In reply to]

* Marc SCHAEFER <schaefer [at] alphanet> [2009-11-21 23:30]:
> After investigating, the Content-Length: is one off per non
> 7-bit character. As if the standard iso-8859-1 byte stream was
> sent as is, but was, internally converted to UTF-8 just for
> generating a wrong byte count. Very strange. Normally that
> process should really output something wrong or generate an
> error in the conversion. It doesn't.

No, it was not converted to UTF-8. Its internal representation
was upgraded. That’s not the same thing (and if you think it is,
you have at the very least not understood Unicode in Perl). It
should have no observable effect.

As a quick fix, you want to utf8::downgrade the $c->res->body at
the last moment before emitting the data to the wire. I’m not
sure off hand which method to wrap in the application class to do
that, though.

Regards,
--
Aristotle Pagaltzis // <http://plasmasturm.org/>

_______________________________________________
List: Catalyst [at] lists
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst [at] lists/
Dev site: http://dev.catalyst.perl.org/


pagaltzis at gmx

Nov 22, 2009, 1:21 PM

Post #4 of 30 (3454 views)
Permalink
Re: Avoiding UTF8 in Catalyst [In reply to]

Hi Marc,

* Marc SCHAEFER <schaefer [at] alphanet> [2009-11-22 15:05]:
> On Sun, Nov 22, 2009 at 02:10:29PM +0100, Aristotle Pagaltzis wrote:
> > As a quick fix, you want to utf8::downgrade the $c->res->body
> > at the last moment before emitting the data to the wire.
>
> Interestingly, the data arrives on the other side as a stream
> of bytes, which are iso-8859-1. So it means that Perl knows or
> is taught to represent this internal representation correctly
> when print'ing. But not while counting (data is correct
> Content-Length: is too high).
>
> Very funny. I will wait a bit if there are any other comments
> on this issue, and maybe try to see what is really happening
> just before going on the wire, because as I can see, there must
> be something wrong there.

ah, d’oh, I see. You wrote that the count is off, but didn’t say
whether it’s too big or too small, and I assumed the wrong way
around, even though it’s now obvious to me as well that this
doesn’t make any sense given your problem description.

So I went thrawling the Catalyst sources and found what appears
to be the offending line. From finalize_headers in Catalyst.pm:

# everything should be bytes at this point, but just in case
$response->content_length( bytes::length( $response->body ) );

I was shocked to discover this! Any code that uses bytes::length
is automatically broken.

It looks like your response body is either upgraded at some point
during the request or starts life as a multibyte string, and then
this code is of course going to count the wrong length.

To work around it for now, in your case, it should suffice to put
a `before` modifier on `finalize_headers` to utf8::downgrade the
response body.

Regards,
--
Aristotle Pagaltzis // <http://plasmasturm.org/>

_______________________________________________
List: Catalyst [at] lists
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst [at] lists/
Dev site: http://dev.catalyst.perl.org/


catalyst at fadetoblack

Nov 23, 2009, 6:29 AM

Post #5 of 30 (3422 views)
Permalink
Re: Re: Avoiding UTF8 in Catalyst [In reply to]

Aristotle Pagaltzis wrote:
> # everything should be bytes at this point, but just in case
> $response->content_length( bytes::length( $response->body ) );
>
> I was shocked to discover this! Any code that uses bytes::length
> is automatically broken.

Not in this case, the HTTP spec says that the Content-Length header should
contain the number of octets in the body. If you're sending UTF-8 then this
is likely different to the number of characters in the string.

Carl


_______________________________________________
List: Catalyst [at] lists
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst [at] lists/
Dev site: http://dev.catalyst.perl.org/


moseley at hank

Nov 23, 2009, 7:42 AM

Post #6 of 30 (3423 views)
Permalink
Re: Avoiding UTF8 in Catalyst [In reply to]

On Mon, Nov 23, 2009 at 2:01 AM, Marc SCHAEFER <schaefer [at] alphanet> wrote:
> On Sat, Nov 21, 2009 at 05:16:24PM -0800, Bill Moseley wrote:
>> > Apparently all diacritic characters are expanded into HTML entities.
>>
>> Where does that happen?
>
> It looks like it's TT::View's htmlentity which does this, not just for
> <> and friends. It's not a big issue for me. Maybe it could even be
> parametered.
>

Still not following. You are talking about Catalyst::View::TT?


BTW -- when looking at C::V::TT I see where you got that DEFAULT_ENCODING
from -- it's documented in C::V::TT.

As far as I know there's no such setting in Template Toolkit. There's
"ENCODING" to specify the encoding of your templates.

If your templates are 8859-1 with 8 bit characters my suggestion would be to
convert them to utf-8 and set ENCODING to utf8 for the templates, and move
toward utf8 everywhere. Make sure you use the plugin to decode and
encode.





--
Bill Moseley
moseley [at] hank


schaefer at alphanet

Nov 23, 2009, 8:18 AM

Post #7 of 30 (3435 views)
Permalink
Re: Avoiding UTF8 in Catalyst [In reply to]

On Mon, Nov 23, 2009 at 07:42:06AM -0800, Bill Moseley wrote:
> Still not following. You are talking about Catalyst::View::TT?

It appears that the latin1 -> htmlentities conversion is done by
View:TT's htmlentity, e.g.:

[% FOREACH h IN cols %]<td>[% b.$h | html_entity %]</td>[% END %]

This is perfectly OK, even if not strictly required. I thought it was
something else doing that, but it isn't.

> BTW -- when looking at C::V::TT I see where you got that DEFAULT_ENCODING
> from -- it's documented in C::V::TT.

The simple fact that html_entity above changes (iso-8859-1) in &eacute;
means that something must have understood I am using iso-8859-1, which
is good. But you seem to be right:

> As far as I know there's no such setting in Template Toolkit. There's
> "ENCODING" to specify the encoding of your templates.

I am using:

package MyApp::View::TT;

use strict;
use base 'Catalyst::View::TT';

__PACKAGE__->config(TEMPLATE_EXTENSION => '.tt',
FILTERS => { 'latex' => \&latex },
DEFAULT_ENCODING => 'iso-8859-1',
WRAPPER => 'wrapper.tt');

You are however right that removing the DEFAULT_ENCODING above
doesn't change anything. Replacing it by ENCODING => 'utf-8'
creates a charset conversion bug (which is expected). Replacing with
ENCODING => 'iso-8859-1' doesn't change anything. So I can safely
assume that as usually expected, iso-8859-1 is the default. I now
removed this specification altogether.

> If your templates are 8859-1 with 8 bit characters my suggestion would be to
> convert them to utf-8 and set ENCODING to utf8 for the templates, and move
> toward utf8 everywhere. Make sure you use the plugin to decode and
> encode.

Again, utf8 is out of the question here: be it in the source file, the
database, or the output. UTF-8 is unacceptable in our environment.

My problem (Catalyst sending iso-8859-1 data to the browser, but having
a wrong Content-Length: as if counting the bytes from the UTF-8
equivalent (or Perl Unicode upgraded string as mentionned in a separate
mail by Aristotle Pagaltzis)) was solved by adding the following to MyApp.pm:

before 'finalize_headers'
=> sub {
my $c = shift;

if ($c->response) {
my $s = $c->response->body;
utf8::downgrade($s);
$c->response->body($s);
}
};

There is still apparently something wrong: there is absolutely no reason
why a Perl Unicode string should be used, but I was unable to determine
why it was created (upgraded) in the first place.

The fact is that counting bytes from the Perl Unicode upgraded string is
wrong when using ISO-8859-1.

Maybe Catalyst dropped any support for non UTF-8 charset. By doing that
it also dropped any support for any charset having a bytesize different
than the Perl Unicode upgraded string internal format, apparently.

But I am no expert on this.



_______________________________________________
List: Catalyst [at] lists
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst [at] lists/
Dev site: http://dev.catalyst.perl.org/


pagaltzis at gmx

Nov 23, 2009, 8:34 AM

Post #8 of 30 (3426 views)
Permalink
Re: Avoiding UTF8 in Catalyst [In reply to]

* Carl Johnstone <catalyst [at] fadetoblack> [2009-11-23 15:35]:
> Aristotle Pagaltzis wrote:
> > # everything should be bytes at this point, but just in case
> > $response->content_length( bytes::length( $response->body ) );
> >
> > I was shocked to discover this! Any code that uses
> > bytes::length is automatically broken.
>
> Not in this case

Yes in this case.

> the HTTP spec says that the Content-Length header should
> contain the number of octets in the body. If you're sending
> UTF-8 then this is likely different to the number of characters
> in the string.

You’re right about HTTP.

But there’s no room for “likelies” here: that’s programming by
coincidence. Either you want it or you don’t, and in this case
you do. But bytes::length doesn’t do that.

Please plese don’t make statements like “not in this case”
without knowing what the thing you are talking about does, i.e.
in this case bytes::length, does. There are enough misconceptions
about Unicode in Perl already.

Try this:

use 5.010; require utf8; require bytes; require Data::Dump;
$a = $b = chr(0xff);
utf8::upgrade($a);
utf8::downgrade($b);
say Data::Dump::pp $a, $b;
say $a eq $b ? 'ok' : 'not ok';
say length($a) == length($b) ? 'ok' : 'not ok';
say bytes::length($a) == bytes::length($b) ? 'ok' : 'not ok';

It will print the following:

("\xFF", "\xFF")
ok
ok
not ok

In other words, there are two entirely identical strings here,
their internal buffers just happen to be in different formats:
one is a packed byte array, the other is a variable-width integer
arrays. And then bytes::length goes and *IGNORES* which is which,
and just blithely looks at the size of the buffer without caring
about the (ill-named) UTF8 flag – even though both strings, when
printed, will produce the *exact same output*. Because they are
IDENTICAL.

In Perl, there are ONLY strings. Semantically, there are no “byte
strings and character strings”. Just strings. All strings are the
same: character sequences, where a a character is an arbitrarily
large integer value. That’s *all*.

Now there are, on the level of the perl implementation, two
string formats: packed byte sequence strings (which are fast but
can only store codepoints < 0x100) and variable-width integer
sequence strings (which are slower but can store all codepoints).

However, from the Perl level, there is NO difference between
those two kinds of string. If you have binary data in a string,
then it’s simply a string that happens to consist of characters
all < 0x100. Note how I didn’t talk about whether it’s a byte
array string or a variable-width integer string? That’s because
that doesn’t matter. Observe:

my $jpeg = do {
open my $fh, '<', 'some-image.jpeg' or die $!;
local $/;
<$fh>;
};

utf8::upgrade( $jpeg ); ### <------ note here

open my $fh, '>', 'output.jpeg' or die $!;
print $fh $jpeg;

If you run this code, end result will be two EXACTLY IDENTICAL
files. Because the contents in $jpeg mean the SAME THING after
upgrading as they did before. You cannot tell from just looking
at a string, whether it contains binary data or text.

However, if you ask for its bytes::length( $jpeg ), you’ll get
the wrong number! Because bytes.pm is broken! As designed!

Note that up- or downgrading a string like this will happen at
pretty random points in your code, and it won’t be obvious where
or why. It’s not actually random of course, but the point where
it happens might be hidden in some module several layers down
your call stack. It might happen only some of the time. Which is
perfectly fine, because the distinction between these two kinds
of strings is an implementation detail in perl! Just like when
you print numbers in Perl, and perl stringifies the scalar,
caches the result of that conversion in the IV slot of the
scalar, and never bothers to let you know.

Because you don’t need to know.

So it might happen that you properly Encode::encode’d your
string, but it’s passed to some routine somewhere in the guts of
some module you are using, which still causes it to get upgraded
in the course some operation. And that’s just fine. It’s not
a bug, just like it’s not a bug that perl silently stringifies
numbers and silently numifies strings. The resulting output will
always be correct in the end because every operation knows to pay
attention to all the IOK, POK, etc flags in scalars that keep
track of these conversions.

But bytes::length doesn’t! It breaks the fixed-/variable width
abstraction by blithely ignoring the UTF8 flag. (Which should
have been named UOK, to go with the IOK, POK, etc flags that
scalars already have.) It’s as if, when you asked for the length
of the number 65, and the scalar had never been stringified
before, Perl didn’t bother to stringify it, and just looked at
the length of the IV slot (integer value), and because you are
running a 32-bit perl, the answer you got was 4. Whereas if you
had stringified the scalar, then instead the answer would be
2 because "65" is two characters long. And maybe your code is
written such that it sometimes happens to stringify the scalar
(eg. by printing it in a diagnostic message) and sometimes not.
Then you get to play a lottery! Fun!

Conclusion of this much longer rant than I planned to write:

If you’re using bytes.pm or any of its functions, your code is
BROKEN. Unconditionally.

Regards,
--
Aristotle Pagaltzis // <http://plasmasturm.org/>

_______________________________________________
List: Catalyst [at] lists
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst [at] lists/
Dev site: http://dev.catalyst.perl.org/


pagaltzis at gmx

Nov 23, 2009, 8:43 AM

Post #9 of 30 (3420 views)
Permalink
Re: Avoiding UTF8 in Catalyst [In reply to]

* Marc SCHAEFER <schaefer [at] alphanet> [2009-11-23 17:20]:
> On Mon, Nov 23, 2009 at 07:42:06AM -0800, Bill Moseley wrote:
> >Still not following. You are talking about Catalyst::View::TT?
>
> It appears that the latin1 -> htmlentities conversion is done
> by View:TT's htmlentity, e.g.:
>
> [% FOREACH h IN cols %]<td>[% b.$h | html_entity %]</td>[% END %]
>
> This is perfectly OK, even if not strictly required. I thought
> it was something else doing that, but it isn't.

If you use the `html` filter instead of `html_entity`, it will
escape only the five characters that have to be.

> There is still apparently something wrong: there is absolutely
> no reason why a Perl Unicode string should be used, but I was
> unable to determine why it was created (upgraded) in the first
> place.

There is no reason why such a string should NOT be used either.
The meaning of the string doesn’t change. It’s an implementation
detail in perl whether the string has been upgraded or not.

The bug is that bytes::length is being used to get its length.

> The fact is that counting bytes from the Perl Unicode upgraded
> string is wrong when using ISO-8859-1.

Using bytes::length is ALWAYS wrong. No really, it’s ALWAYS
wrong. (See the long rant in the other mail I just sent for an
explanation.)

> Maybe Catalyst dropped any support for non UTF-8 charset. By
> doing that it also dropped any support for any charset having
> a bytesize different than the Perl Unicode upgraded string
> internal format, apparently.

It’s just plain a bug in Catalyst that it’s using bytes::length.

I had an IRC convo with Tomas Doran last night and explained the
problem to him. He knocked out some tests for the broken
behaviour. It should be all fixed in the next release, and then
you can upgrade and throw away that `before finalize_headers`
workaround.

Regards,
--
Aristotle Pagaltzis // <http://plasmasturm.org/>

_______________________________________________
List: Catalyst [at] lists
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst [at] lists/
Dev site: http://dev.catalyst.perl.org/


catalyst at fadetoblack

Nov 23, 2009, 9:38 AM

Post #10 of 30 (3421 views)
Permalink
Re: Re: Avoiding UTF8 in Catalyst [In reply to]

Aristotle Pagaltzis wrote:
> But there’s no room for “likelies” here: that’s programming by
> coincidence.

The "likely" was correct.

When using UTF-8 whether the length of the string is different in bytes and
characters depends entirely on what the contents of the string are. Given a
particular string I could tell you exactly whether they should match, but in
the general case all I can say is that it's *likely* to be different.

In any case that's an argument about English :-)

> Either you want it or you don’t, and in this case
> you do. But bytes::length doesn’t do that.
>
> Please plese don’t make statements like “not in this case”
> without knowing what the thing you are talking about does, i.e.
> in this case bytes::length, does. There are enough misconceptions
> about Unicode in Perl already.

As far as the usage of bytes::length. Yes I agree with you that the code is
wrong as it's taking the byte length of perl's internal representation -
which happens to be utf-8 and whilst correct in that case, isn't for any
other character set and shouldn't be relied upon.

You *do* have to take a byte length of the string in the destination
character set though, so I'm interested in what the correct solution would
be.

Carl


_______________________________________________
List: Catalyst [at] lists
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst [at] lists/
Dev site: http://dev.catalyst.perl.org/


peter at dragonstaff

Nov 23, 2009, 9:42 AM

Post #11 of 30 (3430 views)
Permalink
Re: Avoiding UTF8 in Catalyst [In reply to]

>
> The fact is that counting bytes from the Perl Unicode upgraded string is
> wrong when using ISO-8859-1.
>
> Maybe Catalyst dropped any support for non UTF-8 charset. By doing that
> it also dropped any support for any charset having a bytesize different
> than the Perl Unicode upgraded string internal format, apparently.
>
> But I am no expert on this.
>
>
I would recommend using utf-8 throughout, even if you think you'll never
need it.
The reason is you can accidentally send what appear to be correct bytes even
though they are not and if you are using a English browser you will never
realise that pure chance is saving you.
You sail along sticking it in a database, in files, in templates and it
works... until one day it doesn't.
Then you are in big trouble with data that might-be-latin-1 or
might-be-utf-8 or might-be-double-encoded.
I watched a colleague spend 3 whole months fixing an internal framework like
that. Proving his fixes worked was very difficult and the historic data in
Oracle, well, there was no reliable way to unbork it.
Many of us have been through a lifetime of pain with latin-1 encoding.
Unless you are dead set on it, it's much easier to use utf-8 throughout and
add a few simple utf-8 unit tests at the input/output boundaries of your
system components.

It's not so hard, add use open ':utf8'; to your code at the top or use
binmode $fh, ':utf8'; on open file handles.
Use the default utf-8 encoding in Template Toolkit.
When you want to print a variable do use Encode; print encode_utf8($foo);.

Regards, Peter
http://perl.dragonstaff.co.uk


jshirley at gmail

Nov 23, 2009, 9:54 AM

Post #12 of 30 (3424 views)
Permalink
Re: Re: Avoiding UTF8 in Catalyst [In reply to]

On Mon, Nov 23, 2009 at 8:34 AM, Aristotle Pagaltzis <pagaltzis [at] gmx>wrote:

>
> [huge snip]
>


Aristotle++

This was a fantastic explanation with examples. Even though I *think* I
understand the unicode issues in perl, I still can find myself getting
confused. These examples just help that.

Thanks for this, it was really great.

-J

--
J. Shirley :: jshirley [at] gmail :: Killing two stones with one bird.
http://our.coldhardcode.com/jshirley/


pagaltzis at gmx

Nov 23, 2009, 10:14 AM

Post #13 of 30 (3430 views)
Permalink
Re: Avoiding UTF8 in Catalyst [In reply to]

* Carl Johnstone <catalyst [at] fadetoblack> [2009-11-23 18:50]:
> Aristotle Pagaltzis wrote:
> > Please plese don’t make statements like “not in this case”
> > without knowing what the thing you are talking about does,
> > i.e. in this case bytes::length, does. There are enough
> > misconceptions about Unicode in Perl already.
>
> As far as the usage of bytes::length. Yes I agree with you that
> the code is wrong as it's taking the byte length of perl's
> internal representation - which happens to be utf-8 and whilst
> correct in that case, isn't for any other character set and
> shouldn't be relied upon.

No: the internal representation can be either of two formats, and
which of the two you get is not reliable, because it’s purely an
implementation detail. It’s never correct. It just accidentally
works much of the time, getting the right answer by using the
wrong method.

> You *do* have to take a byte length of the string in the
> destination character set though

Yes.

> so I'm interested in what the correct solution would be.

Encode the string to the destination encoding (not just character
set), so that the string represents an encoded octet stream, and
then look at the plain old character length of that string. That
will always give you the right answer, regardless of whether that
string is packed bytes or variable-width integers.

Regards,
--
Aristotle Pagaltzis // <http://plasmasturm.org/>

_______________________________________________
List: Catalyst [at] lists
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst [at] lists/
Dev site: http://dev.catalyst.perl.org/


schaefer at alphanet

Nov 23, 2009, 10:24 AM

Post #14 of 30 (3423 views)
Permalink
Re: Re: Avoiding UTF8 in Catalyst [In reply to]

On Mon, Nov 23, 2009 at 05:43:25PM +0100, Aristotle Pagaltzis wrote:
> If you use the `html` filter instead of `html_entity`, it will
> escape only the five characters that have to be.

Thank you. It works like a charm.

> I had an IRC convo with Tomas Doran last night and explained the
> problem to him. He knocked out some tests for the broken

Thank you for your time! It's nice to see the responsiveness of the
project.


_______________________________________________
List: Catalyst [at] lists
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst [at] lists/
Dev site: http://dev.catalyst.perl.org/


moseley at hank

Nov 23, 2009, 11:08 AM

Post #15 of 30 (3428 views)
Permalink
Re: Re: Avoiding UTF8 in Catalyst [In reply to]

On Mon, Nov 23, 2009 at 10:14 AM, Aristotle Pagaltzis <pagaltzis [at] gmx>wrote:

>
>
>
> Encode the string to the destination encoding (not just character
> set), so that the string represents an encoded octet stream, and
> then look at the plain old character length of that string. That
> will always give you the right answer, regardless of whether that
> string is packed bytes or variable-width integers.
>

Correct. And I'd argue that when it's time to set the length it should die
if utf8 flag is still set.

When calculating the length the content should have already been encoded.

Again, at some point decoding and encoding should be core not just a
plugin. It's an important part of the request cycle.



--
Bill Moseley
moseley [at] hank


pagaltzis at gmx

Nov 23, 2009, 1:08 PM

Post #16 of 30 (3425 views)
Permalink
Re: Avoiding UTF8 in Catalyst [In reply to]

* Bill Moseley <moseley [at] hank> [2009-11-23 20:10]:
> I'd argue that when it's time to set the length it should die
> if utf8 flag is still set.

I’m of two minds about this… it may well be that a string is
correctly encoded but has gotten upgraded, and such a string will
produce the right output anyhow. I don’t know if it’s not too
stringent to demand that the UTF8 flag be off.

However, the string should be *downgradeable* by that time. If
there are wide characters in it at that time, then throwing an
exception is absolutely the right thing to do. But if there
aren’t, then you can’t decide based on the UTF8 flag whether the
string is correct or not.

As I wrote, you can read a binary file, upgrade the string, and
output it right back, and you’ll get an identical copy of the
file out of that, because a string means one and the same thing
regardless of whether it’s upgraded.

> When calculating the length the content should have already
> been encoded.

Yes.

> Again, at some point decoding and encoding should be core not
> just a plugin. It's an important part of the request cycle.

I agree.

Although it’s difficult to make it fully automatic because
browsers suck so bad about telling you what encoding the data
that they’re sending is in.

I am working on a plugin for that, but due to its dependencies
and API I don’t know if it’d be reasonable to make it core.

Regards,
--
Aristotle Pagaltzis // <http://plasmasturm.org/>

_______________________________________________
List: Catalyst [at] lists
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst [at] lists/
Dev site: http://dev.catalyst.perl.org/


bobtfish at bobtfish

Nov 23, 2009, 2:20 PM

Post #17 of 30 (3412 views)
Permalink
Re: Re: Avoiding UTF8 in Catalyst [In reply to]

On 23 Nov 2009, at 18:24, Marc SCHAEFER wrote:

> On Mon, Nov 23, 2009 at 05:43:25PM +0100, Aristotle Pagaltzis wrote:

>> I had an IRC convo with Tomas Doran last night and explained the
>> problem to him. He knocked out some tests for the broken
>
> Thank you for your time! It's nice to see the responsiveness of the
> project.

Now fixed in trunk:

http://dev.catalystframework.org/svnweb/Catalyst/revision?rev=11978

Please test it out for me to ensure this does fix your issue as
expected?

Cheers
t0m


_______________________________________________
List: Catalyst [at] lists
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst [at] lists/
Dev site: http://dev.catalyst.perl.org/


bobtfish at bobtfish

Nov 30, 2009, 6:21 PM

Post #18 of 30 (3015 views)
Permalink
Re: Re: Avoiding UTF8 in Catalyst [In reply to]

On 23 Nov 2009, at 22:20, Tomas Doran wrote:

>
> On 23 Nov 2009, at 18:24, Marc SCHAEFER wrote:
>
>> On Mon, Nov 23, 2009 at 05:43:25PM +0100, Aristotle Pagaltzis wrote:
>
>>> I had an IRC convo with Tomas Doran last night and explained the
>>> problem to him. He knocked out some tests for the broken
>>
>> Thank you for your time! It's nice to see the responsiveness of the
>> project.
>
> Now fixed in trunk:
>
> http://dev.catalystframework.org/svnweb/Catalyst/revision?rev=11978
>
> Please test it out for me to ensure this does fix your issue as
> expected?

Any chance of a confirmation that this is fixed in Catalyst for you
(or not)?

http://search.cpan.org/CPAN/authors/id/B/BO/BOBTFISH/Catalyst-Runtime-5.80014_01.tar.gz
http://search.cpan.org/CPAN/authors/id/B/BO/BOBTFISH/Catalyst-Runtime-5.80014_02.tar.gz

(both contain the fix for this issue).

Cheers
t0m


_______________________________________________
List: Catalyst [at] lists
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst [at] lists/
Dev site: http://dev.catalyst.perl.org/


bobtfish at bobtfish

Nov 30, 2009, 6:24 PM

Post #19 of 30 (3008 views)
Permalink
Re: Re: Avoiding UTF8 in Catalyst [In reply to]

Apologies for replying to myself - late night and I'm getting confused.


On 1 Dec 2009, at 02:21, Tomas Doran wrote:
>
> Any chance of a confirmation that this is fixed in Catalyst for you
> (or not)?
>
> http://search.cpan.org/CPAN/authors/id/B/BO/BOBTFISH/Catalyst-Runtime-5.80014_01.tar.gz
> http://search.cpan.org/CPAN/authors/id/B/BO/BOBTFISH/Catalyst-Runtime-5.80014_02.tar.gz
>
> (both contain the fix for this issue).

This is _LIES_. The fix you need is in _02, but _not_ in _01.

So please test the newer package.

Sorry about the disinformation.
Cheers
t0m


_______________________________________________
List: Catalyst [at] lists
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst [at] lists/
Dev site: http://dev.catalyst.perl.org/


jon at jrock

Dec 7, 2009, 9:34 PM

Post #20 of 30 (2681 views)
Permalink
Re: Re: Avoiding UTF8 in Catalyst [In reply to]

Sorry to dig up a very old thread, but I am very behind on email and
wanted to comment :)

* On Sun, Nov 22 2009, Aristotle Pagaltzis wrote:
> So I went thrawling the Catalyst sources and found what appears
> to be the offending line. From finalize_headers in Catalyst.pm:
>
> # everything should be bytes at this point, but just in case
> $response->content_length( bytes::length( $response->body ) );
>
> I was shocked to discover this! Any code that uses bytes::length
> is automatically broken.

FWIW, we did this so that people not using Catalyst::Plugin::Unicode but
that had a Unicode string in memory would get something resembling the
correct result. The next line of code basically copies the char*
backing the SV into the response socket. Also wrong, but works for the
correct case and for many incorrect-but-still-common cases. (I know lot
of prominent Catalyst developers that had their apps horribly wrong for
years, but still used their website to make millions of dollars. It is
nice to get everything right all the time, but sometimes you don't...)

Basically, if you are doing things right, this code will cause no harm
(as the string will be an octet stream, and bytes::length will return
the length of the octet stream you are about to send). If you are doing
things wrong, you might get the right answer (because you will get the
length of your octet stream that you are about to send, and those octets
happen to represent utf-8 or latin-1, and that's what your content-type
header said you would send). A "you fail" error would be nice... but
could be annoying in a number of cases. HTTP is a binary protocol, but
people need to send text, so there is an impedance mismatch.

Catalyst's Unicode handling has been a nightmare because of the
weird-ass things people do with "Unicode", general misunderstanding, and
backwards compatibility. (I recall someone wanting the URLs in their
app to be EUC_JP-encoded, but the form submissions to be UTF-8.)

When it's possible to break Catalyst backcompat severely, a correct
solution will be implemented. But for now, trying hard to Do The Right
Thing (instead of causing weird web browser errors) is what we're stuck
with.

Regards,
Jonathan Rockway

--
print just => another => perl => hacker => if $,=$"

_______________________________________________
List: Catalyst [at] lists
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst [at] lists/
Dev site: http://dev.catalyst.perl.org/


pagaltzis at gmx

Dec 8, 2009, 12:26 AM

Post #21 of 30 (2671 views)
Permalink
Re: Avoiding UTF8 in Catalyst [In reply to]

* Jonathan Rockway <jon [at] jrock> [2009-12-08 06:40]:
> Basically, if you are doing things right, this code will cause
> no harm

Yes it will, in some cases.

> (as the string will be an octet stream

There is no such thing as an octet stream in Perl. There are only
strings, and strings are sequences of arbitrarily large integers.
You can store an octet stream in a string, which will then be
a string that just happens to be a sequence of integers < 256,
but it’s a string like any other, not specifically an octet
sequence, and any string in Perl can internally have either form
of internal representation. *Usually* after encoding a string
will be a packed byte array… but that’s an implementation detail.

> and bytes::length will return the length of the octet stream
> you are about to send).

This will work only if the string is using one of the two kinds
of internal representation but not in the other.

The case the OP had was that he wanted to send Latin-1 and his
strings contained sequences of Latin-1 characters, which happen
to be interchangeable with their octet representation. His
strings were getting upgraded in the course of the code, which is
hardly uncommon with Latin-1 strings and in fact is necessary in
some cases.

It should not have mattered that they were upgraded. Their
content was semantically correct. But it did matter, because
Catalyst::Engine used bytes::length, so forced the user to care
about the internal representation.

And you know what you said about the internal representation.

> HTTP is a binary protocol, but people need to send text, so
> there is an impedance mismatch.

HTTP is a red herring. *All* forms of I/O have this mismatch.

> But for now, trying hard to Do The Right Thing (instead of
> causing weird web browser errors) is what we're stuck with.

Nice ideal. Unfortunately you can’t. You can merely partially
paper over one set of problems – only by creating another.

I’m not saying that people who have broken apps should be told to
take a hike. It might be nice to provide old workaround approach
as a plugin for people who depended on that behaviour. It can be
agonising to fix an app after the fact, as I know very well.
I only recently cleaned up $job app in that regard, which still
suffered the legacy of the days of Perl 5.6 and some very old
DBD::mysql versions… and therefor required cleaning a database
that contained arbitrarily mixed doubly- & triply-encoded data.
So I put it off for as long as it could wait; other things took
priority. It’s nice to have the option to wait until an opportune
moment.

But Catalyst shouldn’t in the meantime punish people who haven’t
done anything wrong for the mistakes of other people.

Regards,
--
Aristotle Pagaltzis // <http://plasmasturm.org/>

_______________________________________________
List: Catalyst [at] lists
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst [at] lists/
Dev site: http://dev.catalyst.perl.org/


moseley at hank

Dec 8, 2009, 7:57 AM

Post #22 of 30 (2651 views)
Permalink
Re: Re: Avoiding UTF8 in Catalyst [In reply to]

On Tue, Dec 8, 2009 at 12:26 AM, Aristotle Pagaltzis <pagaltzis [at] gmx>wrote:

>
> There is no such thing as an octet stream in Perl. There are only
> strings, and strings are sequences of arbitrarily large integers.
>

Help me out here.

What I've stuck in my mind is that the poorly-named utf8 flag on Perl
strings is really the "is_character_data" flag. To get get character data
it *must* be decoded on input, and the act of decoding sets that flag. Even
decoding 8 bit character encoding will set the flag.

$ perl -MEncode -wle '$x=Encode::decode("ASCII", "hello"); print
Encode::is_utf8( $x ) ? "flag set\n" : "no flag\n";'
flag set

$ perl -MEncode -wle '$x=Encode::decode("iso-8859-1", "hello"); print
Encode::is_utf8( $x ) ? "flag set\n" : "no flag\n";'
flag set

And any strings with the flag set *must* be encoded before printing (sending
out of Perl) -- otherwise you are printing abstract "characters" that have
no meaning outside of Perl.

Plus, content_length must be the encoded length. Therefore, it's impossible
to set the content length on character data unless you encode it first.

So the code seems like it must be:

die "no clue how long the body is because it's still characters" if
Encode::is_utf8( $response->body );
$response->content_length( length( $response->body ) );

That's not very friendly, of course. But, what other choice is there?

The correct thing would be to force all responses to have a defined content
type and then encode the characters at the end of the request (right before
setting content length).




--
Bill Moseley
moseley [at] hank


bobtfish at bobtfish

Dec 8, 2009, 6:27 PM

Post #23 of 30 (2621 views)
Permalink
Re: Re: Avoiding UTF8 in Catalyst [In reply to]

On 8 Dec 2009, at 05:34, Jonathan Rockway wrote:
> Sorry to dig up a very old thread, but I am very behind on email and
> wanted to comment :)

No problem. Your insight as to why things are they way they are is
useful :)

>>
>> I was shocked to discover this! Any code that uses bytes::length
>> is automatically broken.
>
> FWIW, we did this so that people not using Catalyst::Plugin::Unicode
> but
> that had a Unicode string in memory would get something resembling the
> correct result.

When you said 'we did this', I looked at the blame history, and that
code had been there since at or before 5.50.

Did you mean 'it was done like this', 'it was explained to me to be
this way because', or 'we made a conscious decision to keep it this
way because'?

Sorry for the pedantry, but I just 'fixed' this, so I'd like to
clearly establish what grounds on which this may be considered a bad
move :)

Cheers
t0m




_______________________________________________
List: Catalyst [at] lists
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst [at] lists/
Dev site: http://dev.catalyst.perl.org/


jon at jrock

Dec 8, 2009, 7:05 PM

Post #24 of 30 (2631 views)
Permalink
Re: Re: Avoiding UTF8 in Catalyst [In reply to]

* On Tue, Dec 08 2009, Bill Moseley wrote:
> On Tue, Dec 8, 2009 at 12:26 AM, Aristotle Pagaltzis <pagaltzis [at] gmx> wrote:
>
> There is no such thing as an octet stream in Perl. There are only
> strings, and strings are sequences of arbitrarily large integers.
>
> Help me out here.
>
> What I've stuck in my mind is that the poorly-named utf8 flag on Perl strings is really
> the "is_character_data" flag.   To get get character data it *must* be decoded on
> input, and the act of decoding sets that flag.  Even decoding 8 bit character encoding
> will set the flag.

Sorry, it doesn't mean that. latin1 text is character data, but won't
have the UTF8 flag on. The UTF8 flag doesn't mean anything more than
any of the other SV flags. All of these flags are basically performance
hacks and should be considered totally off-limits to user code. They
have absolutely no meaning there.

> And any strings with the flag set *must* be encoded before printing (sending out of
> Perl) -- otherwise you are printing abstract "characters" that have no meaning outside
> of Perl.

Any string without the flag set must also be encoded.

If text ever enters your application, it must do so through a call to
decode. If text ever leaves your application, it must do so through a
call to encode.

Your application must always, without exception, decode and encode all
text data.

It's confusing because this is sometimes done automatically by libraries
that are in use. It's confusing because sometimes it's *not* done by
the libraries that are in use :) If you're not sure if your library is
doing this for you, read the source, or ask someone :)

Regards,
Jonathan Rockway

--
print just => another => perl => hacker => if $,=$"

_______________________________________________
List: Catalyst [at] lists
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst [at] lists/
Dev site: http://dev.catalyst.perl.org/


jon at jrock

Dec 8, 2009, 8:13 PM

Post #25 of 30 (2628 views)
Permalink
Re: Re: Avoiding UTF8 in Catalyst [In reply to]

* On Tue, Dec 08 2009, Aristotle Pagaltzis wrote:

<snip pedantry>

> This will work only if the string is using one of the two kinds
> of internal representation but not in the other.

Exactly my point.

> The case the OP had was that he wanted to send Latin-1 and his
> strings contained sequences of Latin-1 characters, which happen
> to be interchangeable with their octet representation. His
> strings were getting upgraded in the course of the code, which is
> hardly uncommon with Latin-1 strings and in fact is necessary in
> some cases.
>
> It should not have mattered that they were upgraded. Their
> content was semantically correct. But it did matter, because
> Catalyst::Engine used bytes::length, so forced the user to care
> about the internal representation.

It wouldn't have mattered if he had Encode::encode'd to latin-1, right?
He didn't do that, so the app broke. Using length instead of
bytes::length would have fixed his app, but would have broken apps that
are using UTF-8 encodings and also forgot to Encode::encode.

The fact that either program works is an undocumented side-effect. But
changing that side-effect would break currently-working apps, and we
don't want to do that. Like I said, bytes::length is not there to be a
good example of Modern Perl. It is there so that people don't have to
fix their broken apps today. Catalyst got this wrong from the
beginning, so we are stuck until we have a good way to make everything
work for everyone. (I am experimenting with a new response layer on top
of Plack that should prove useful for this sort of thing.)

Wow, I can't believe I am defending backcompat. Must be that caffeine
powder...

Regards,
Jonathan Rockway

--
print just => another => perl => hacker => if $,=$"

_______________________________________________
List: Catalyst [at] lists
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst [at] lists/
Dev site: http://dev.catalyst.perl.org/

First page Previous page 1 2 Next page Last page  View All Catalyst users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.