Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: ModPerl: ModPerl

CGI and character encoding

 

 

ModPerl modperl RSS feed   Index | Next | Previous | View Threaded


aw at ice-sa

Feb 24, 2011, 1:31 PM

Post #1 of 7 (1127 views)
Permalink
CGI and character encoding

Hi.

I wonder if someone here can give me a clue as to where to look...

I am using
Apache/2.2.9 (Debian) DAV/2 SVN/1.5.1 mod_jk/1.2.26 PHP/5.2.6-1+lenny9 with Suhosin-Patch
mod_ssl/2.2.9 OpenSSL/0.9.8g mod_apreq2-20051231/2.6.0 mod_perl/2.0.4 Perl/v5.10.0

perl -MCGI -e 'print $CGI::VERSION'
3.52

A perl cgi-bin script running under mod_perl, receives posted form parameters from a form
defined as such :

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
....
<body>
<form action="/litfdm/litfdm.pl" name="form"
enctype="multipart/form-data" charset="UTF-8" method="POST">
...
<input name="de-utf8" type="hidden" value="ÄäÖöÜü">
...

(Note: the html page itself has been saved as UTF-8 by an UTF-8 aware editor)


When I retrieve the above hidden field using

my $chars = $cgi->param('de-utf8');

the variable $chars does contain the proper UTF-8 encoded *bytes* for the above string (in
other words, 2 bytes per character e.g.), but it arrives into the script /without/ the
perl "utf8" flag set.

If I then use this value to print to a filehandle opened as such :

open(FH,'>:utf8',"myfile");
print FH $chars,"\n";

It comes out of course as .. well, I cannot type this on my keyboard, but anyone aware of
double-encoding issues can imagine the "A-tilde Copyright A-tilde squiggle.. " result.

I can of course convert it, by using

$chars = Encode::decode('utf8',$cgi->param('de-utf8'));

but it is a p.i.t.a. and I would like to know if there is a way to retrieve the posted
value directly as UTF-8, and if yes what this depends on.
(I cannot find a setting for instance in the CGI.pm module documentation.)


Thanks.
André

P.S.
Unfortunately, when the browser (Firefox 3.5.3) is posting this data to the server, it is
posting it as something like

...
Content-Type multipart/form-data; boundary=---------------------------326972172326727
...

-----------------------------326972172326727
Content-Disposition: form-data; name="de-utf8"

ÄäÖöÜü
-----------------------------326972172326727

which means that there is no charset header to the parts either.


mpeters at plusthree

Feb 24, 2011, 1:36 PM

Post #2 of 7 (1118 views)
Permalink
Re: CGI and character encoding [In reply to]

On 02/24/2011 04:31 PM, André Warnier wrote:

> I wonder if someone here can give me a clue as to where to look...

The CGI.pm documentation talks about the -utf8 import flag which is
probably what you're looking for. But it does caution not to use it for
anything that needs to do file uploads.

--
Michael Peters
Plus Three, LP


lloyd at protectchildren

Feb 24, 2011, 2:32 PM

Post #3 of 7 (1113 views)
Permalink
RE: CGI and character encoding [In reply to]

FWIW, with CGI.pm I always iterate through the params and Encode::decode with the appropriate encoding with an exception for anything binary. (file uploads etc)


-----Original Message-----
From: André Warnier [mailto:aw [at] ice-sa]
Sent: Thursday, February 24, 2011 3:31 PM
To: mod_perl list
Subject: CGI and character encoding

Hi.

I wonder if someone here can give me a clue as to where to look...

I am using
Apache/2.2.9 (Debian) DAV/2 SVN/1.5.1 mod_jk/1.2.26 PHP/5.2.6-1+lenny9 with Suhosin-Patch
mod_ssl/2.2.9 OpenSSL/0.9.8g mod_apreq2-20051231/2.6.0 mod_perl/2.0.4 Perl/v5.10.0

perl -MCGI -e 'print $CGI::VERSION'
3.52

A perl cgi-bin script running under mod_perl, receives posted form parameters from a form
defined as such :

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
....
<body>
<form action="/litfdm/litfdm.pl" name="form"
enctype="multipart/form-data" charset="UTF-8" method="POST">
...
<input name="de-utf8" type="hidden" value="ÄäÖöÜü">
...

(Note: the html page itself has been saved as UTF-8 by an UTF-8 aware editor)


When I retrieve the above hidden field using

my $chars = $cgi->param('de-utf8');

the variable $chars does contain the proper UTF-8 encoded *bytes* for the above string (in
other words, 2 bytes per character e.g.), but it arrives into the script /without/ the
perl "utf8" flag set.

If I then use this value to print to a filehandle opened as such :

open(FH,'>:utf8',"myfile");
print FH $chars,"\n";

It comes out of course as .. well, I cannot type this on my keyboard, but anyone aware of
double-encoding issues can imagine the "A-tilde Copyright A-tilde squiggle.. " result.

I can of course convert it, by using

$chars = Encode::decode('utf8',$cgi->param('de-utf8'));

but it is a p.i.t.a. and I would like to know if there is a way to retrieve the posted
value directly as UTF-8, and if yes what this depends on.
(I cannot find a setting for instance in the CGI.pm module documentation.)


Thanks.
André

P.S.
Unfortunately, when the browser (Firefox 3.5.3) is posting this data to the server, it is
posting it as something like

...
Content-Type multipart/form-data; boundary=---------------------------326972172326727
...

-----------------------------326972172326727
Content-Disposition: form-data; name="de-utf8"

ÄäÖöÜü
-----------------------------326972172326727

which means that there is no charset header to the parts either.


ceeshek at gmail

Feb 24, 2011, 2:33 PM

Post #4 of 7 (1115 views)
Permalink
Re: CGI and character encoding [In reply to]

Hi André,

There is a perlmonks post from a few years ago that explains one way
of automating this with CGI.pm. I've used this for several years now
without problems.

http://www.perlmonks.org/?node_id=651574

Just remember that decoding params is just one part of dealing with
utf-8. You need to worry about any data coming into or going out of
your app (reading files, retrieving from DB, send HTML out to the
browser, etc...). The following wiki book has some great information
on how to deal with utf-8 in your perl applications (and it also
includes the CGI.pm hack from Rhesa that I linked to above in the
perlmonks link).

http://en.wikibooks.org/wiki/Perl_Programming/Unicode_UTF-8

Cheers,

Cees Hek


On Fri, Feb 25, 2011 at 8:31 AM, André Warnier <aw [at] ice-sa> wrote:
> Hi.
>
> I wonder if someone here can give me a clue as to where to look...
>
> I am using
> Apache/2.2.9 (Debian) DAV/2 SVN/1.5.1 mod_jk/1.2.26 PHP/5.2.6-1+lenny9 with
> Suhosin-Patch mod_ssl/2.2.9 OpenSSL/0.9.8g mod_apreq2-20051231/2.6.0
> mod_perl/2.0.4 Perl/v5.10.0
>
> perl -MCGI -e 'print $CGI::VERSION'
> 3.52
>
> A perl cgi-bin script running under mod_perl, receives posted form
> parameters from a form defined as such :
>
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
>       "http://www.w3.org/TR/html4/loose.dtd">
> <html>
>        <head>
>        <meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
> ....
>  <body>
>        <form action="/litfdm/litfdm.pl" name="form"
>                enctype="multipart/form-data" charset="UTF-8" method="POST">
> ...
> <input name="de-utf8" type="hidden" value="ÄäÖöÜü">
> ...
>
> (Note: the html page itself has been saved as UTF-8 by an UTF-8 aware
> editor)
>
>
> When I retrieve the above hidden field using
>
> my $chars = $cgi->param('de-utf8');
>
> the variable $chars does contain the proper UTF-8 encoded *bytes* for the
> above string (in other words, 2 bytes per character e.g.), but it arrives
> into the script /without/ the perl "utf8" flag set.
>
> If I then use this value to print to a filehandle opened as such :
>
> open(FH,'>:utf8',"myfile");
> print FH $chars,"\n";
>
> It comes out of course as .. well, I cannot type this on my keyboard, but
> anyone aware of double-encoding issues can imagine the "A-tilde Copyright
> A-tilde squiggle.. " result.
>
> I can of course convert it, by using
>
> $chars = Encode::decode('utf8',$cgi->param('de-utf8'));
>
> but it is a p.i.t.a. and I would like to know if there is a way to retrieve
> the posted value directly as UTF-8, and if yes what this depends on.
> (I cannot find a setting for instance in the CGI.pm module documentation.)
>
>
> Thanks.
> André
>
> P.S.
> Unfortunately, when the browser (Firefox 3.5.3) is posting this data to the
> server, it is posting it as something like
>
> ...
> Content-Type    multipart/form-data;
> boundary=---------------------------326972172326727
> ...
>
> -----------------------------326972172326727
> Content-Disposition: form-data; name="de-utf8"
>
> ÄäÖöÜü
> -----------------------------326972172326727
>
> which means that there is no charset header to the parts either.
>


aw at ice-sa

Feb 24, 2011, 2:41 PM

Post #5 of 7 (1113 views)
Permalink
Re: CGI and character encoding [In reply to]

Michael Peters wrote:
> On 02/24/2011 04:31 PM, André Warnier wrote:
>
>> I wonder if someone here can give me a clue as to where to look...
>
> The CGI.pm documentation talks about the -utf8 import flag which is
> probably what you're looking for. But it does caution not to use it for
> anything that needs to do file uploads.
>

Thanks. My workstation version of the CGI documentation is apparently outdated, and did
not mention that "pragma". The CPAN version does.
But yes, I will need file uploads too, and since there is no telling how exactly the -utf8
flag interferes with them, I think I'll stick with the p.i.t.a. method for now.

I wonder why browsers do not put a charset parameter in the multipart/form-data parts..
It would seem like a logical and MIME-conformant thing to do.


mschout at gkg

Feb 24, 2011, 8:17 PM

Post #6 of 7 (1117 views)
Permalink
Re: CGI and character encoding [In reply to]

On 02/24/2011 03:31 PM, André Warnier wrote:
> Hi.
>
> I wonder if someone here can give me a clue as to where to look...

If you have a fairly recent CGI.pm, it will decode utf-8 properly for
you (even avoiding double-decoding), but there are some caveats. In
addition to what others have already said, If you are running under
mod_perl (which obviously you are), CGI.pm adds a cleanup handler (via
register_cleanup) which resets CGI.pm's global variables. One of the
variables that gets reset is the PARAM_UTF8 variable (which the -utf8
import controls). Because of this, once the clenaup handler gets
called, UTF-8 decoding gets turned off.

You have to work around this by manually making sure $CGI::PARAM_UTF8 =
1 before calling CGI->new.

Regards,
Michael Schout


aw at ice-sa

Feb 25, 2011, 12:48 PM

Post #7 of 7 (1100 views)
Permalink
Re: CGI and character encoding [In reply to]

Thanks to Michael, Michael, Lloyd, Cees,

your answers and insights have made things clearer for me.
I think I'll use a combination of all of that for this new application we're writing.

In other words, to program "defensively", I propose to do this :

when sending the html page with the <form> :
- create the page and save it as UTF-8
- have the proper charset indications in it
- include a hidden test field with some known UTF-8 sequence (e.g. "ÄÖÜ")
- make sure that the application and the webserver send out the page with the proper
Content-type and charset (HTTP headers)

But since we still don't know what the browser (and the user) will actually do with this,

upon reception of the POST :
- get the test field and check how it was received
a) check if it has the "is_utf8()" flag set (probably not)
b) if not (a) check if at least it has the correct UTF-8 bytes in it (6, not 3)
c) if nor (a) nor (b), reject with error (don't know what it is then)
d) if not (a), but (b), then set a flag 'must_decode'

- get the other parameters, and
- if the 'must_decode' flag is not set, leave them 'as is'
- if the flag is set, Encode::decode('utf8',..) all received
parameters, except for file uploads (*)

That's of course in the hope that, some day, browsers will send multipart data with the
proper charset indication, and that CGI.pm will take it into account and do the right thing.



(*) although a question then is how a Polish browser would send the filename attribute,
assuming it is originally something like "Qualitätsübersicht.pdf"

ModPerl modperl RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.