Gossamer Forum
Home : General : Perl Programming :

Accented/special characters being converted to multiple characters

Quote Reply
Accented/special characters being converted to multiple characters
I am having another problem with accented characters. I have a Perl script that fetches a remote XML document and extracts two fields per record. It all works perfectly well.

The problem arises when the XML file contains special or accented characters.
Somehow, every special character gets converted to a pair of (or sometimes three) special characters.

For example:
é gets converted to é
É gets converted to É
È gets converted to È
Ê gets converted to Ê
`` (left angle double quote) gets converted to “
´´ (right angle double quote) gets converted to â€

The list goes on...

I need help.

Thanks.

Note: for the purposes of this question, accented and special mean the same thing.
Quote Reply
Re: [Crolguvar] Accented/special characters being converted to multiple characters In reply to
Hi. Here is a good post (I use similar code in some of my scripts/plugins, especially when dealing with DMOZ);

http://www.gossamer-threads.com/...i?post=153772#153772

Hope that helps.

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] Accented/special characters being converted to multiple characters In reply to
Thanks Andy but I can't use that. The script needs to run to generate part of the content for the index page of a site so speed is an issue (loading all those modules may take too long for me).

I found something that works...kinda:

my $title = pack('C*', unpack('U*', $rawTitle));

$rawTitle is the string with the odd characters and $title is the string I need to display.

The follwing accomplishes what I want:

Code:

use utf8;
my $rawTitle = "RÈGLEMENT SUR LA CRÉATION D'UNE ZONE SANS FRÊNES";
my $title = pack('C*', unpack('U*', $rawTitle));


This will make $title = "RÈGLEMENT SUR LA CRÉATION D'UNE ZONE SANS FRÊNES"

However (there just had to be a however...)

I get the value of $rawTitle by running a few Regular Expressions on a value I obtain by using:

Code:

my $ua = new LWP::UserAgent;
my $res = $ua->request(GET $url);
my $content = $res->content;

For some reason, data obtained this way will not encode/decode (or pack/unpack) properly.

I think it has to do with the fact that $rawTitle is treated as UTF8 when I hard code it in the script (because of the use utf8; directive).

I need someway to convert the content I get from the $ua->request(GET $url); into something that displays properly.


Quote Reply
Re: [Crolguvar] Accented/special characters being converted to multiple characters In reply to
Quote:
but I can't use that. The script needs to run to generate part of the content for the index page of a site so speed is an issue (loading all those modules may take too long for me).

Did you actually try benchmarking it?

Code:
$title =~ s/([\200-\377]+)/from_utf8({ -string => $1, -charset => 'ISO-8859-1'})/eg;

It doesn't use that much more speed. Just give it a go... I think its going to be your only option Frown

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] Accented/special characters being converted to multiple characters In reply to
Well I tried it but there seems to be a problem in using Unicode::Map8 on Windows Unsure.
I keep getting an error:

Can't locate loadable object for module Unicode::Map8 in @INC (@INC contains: ... )

I did however find another way of doing it Smile:
Code:
my $ua = new LWP::UserAgent;
my $rawContent = $ua->request(GET $url);
my $contentAsUTF8 = Unicode::String::utf8($rawContent->content);
my $content = $contentAsUTF8->latin1();

With this, I only need to use Unicode::String to do the conversion.

Of course there is a limitation to what latin1() can map so I do end up losing some characters like em dash (U+2014) and left double quotation mark (U+201c) but since the purpose of the script is only to generate a list of titles on the fly, I can live with that. For my purposes, accents are more important than punctuation.

Thanks for pointing me in the right direction Andy.
Quote Reply
Re: [Crolguvar] Accented/special characters being converted to multiple characters In reply to
Glad to hear you figured it out :)

I'm having problems with greek words not being translated correctly at the moment Frown Oh, joy :p

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!