Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Bricolage: users

Mixing character sets

 

 

First page Previous page 1 2 Next page Last page  View All Bricolage users RSS feed   Index | Next | Previous | View Threaded


chris.schults at pccsea

Mar 4, 2008, 2:57 PM

Post #1 of 28 (14168 views)
Permalink
Mixing character sets

As I understand it, Bricolage creates files encoded as utf-8 (at least
by default). Is it possible for Bricolage to output in another character
set?

Here's my dilemma:

The current site's files are iso-8859-1 and Apache's AddDefaultCharset
is set as such. The files also have the charset specified within the
document via the meta element. And we'll be migrating content Bricolage
incrementally, which means we'll have a mix of utf-8 and iso-8859-1
files.

However, when Bricolage publishes to the server as utf-8, special
(unescaped) characters don't display properly. If I change
AddDefaultCharset to utf-8, the characters in the Bricolage-generated
pages look fine, but then the old pages get mucked up.

I think I came up with a solution, which is to set AddDefaultCharset to
'off', assuming all my pages declare the charset from within via the
meta element. And for those who don't, use the AddCharset directive,
like so:

AddDefaultCharset off
AddCharset utf-8 .css
AddCharset utf-8 .inc
AddCharset utf-8 .js

However, I'm wondering if I could simply have Bricolage output as
iso-8859-1.

Chris

--------------------------------

Chris Schults
Web Developer
PCC Natural Markets
206-547-1222 x104
chris.schults [at] pccsea
http://www.pccnaturalmarkets.com


D-Beaudet at NGA

Mar 4, 2008, 4:47 PM

Post #2 of 28 (13738 views)
Permalink
RE: Mixing character sets [In reply to]

There might be a better way than I'm about to suggest, but you could use some of the handy Perl encoding / decoding modules like HTML::Entities and Unicode::UTF8simple and just convert all the content generated by your templates to a character set of your choosing and set the appropriate META tag.

I'm using UTF-8 for the Bric GUI (admin...system...preferences) and storing the data encoded as UTF-8 in the database, but since our public web server is set to run with ISO-8859-1, I've just instructed my content contributors to use HTML entities instead of pasting high-value UTF-8 characters into the GUI -- but I would be wiser to follow my own advice above and convert all content to ISO just in case.


-----Original Message-----
From: Schults, Chris [mailto:chris.schults [at] pccsea]
Sent: Tue 3/4/2008 5:57 PM
To: users [at] lists
Subject: Mixing character sets

As I understand it, Bricolage creates files encoded as utf-8 (at least
by default). Is it possible for Bricolage to output in another character
set?

Here's my dilemma:

The current site's files are iso-8859-1 and Apache's AddDefaultCharset
is set as such. The files also have the charset specified within the
document via the meta element. And we'll be migrating content Bricolage
incrementally, which means we'll have a mix of utf-8 and iso-8859-1
files.

However, when Bricolage publishes to the server as utf-8, special
(unescaped) characters don't display properly. If I change
AddDefaultCharset to utf-8, the characters in the Bricolage-generated
pages look fine, but then the old pages get mucked up.

I think I came up with a solution, which is to set AddDefaultCharset to
'off', assuming all my pages declare the charset from within via the
meta element. And for those who don't, use the AddCharset directive,
like so:

AddDefaultCharset off
AddCharset utf-8 .css
AddCharset utf-8 .inc
AddCharset utf-8 .js

However, I'm wondering if I could simply have Bricolage output as
iso-8859-1.

Chris

--------------------------------

Chris Schults
Web Developer
PCC Natural Markets
206-547-1222 x104
chris.schults [at] pccsea
http://www.pccnaturalmarkets.com


lannings at who

Mar 5, 2008, 1:37 AM

Post #3 of 28 (13726 views)
Permalink
Re: Mixing character sets [In reply to]

On Tue, 4 Mar 2008, Schults, Chris wrote:
> As I understand it, Bricolage creates files encoded as utf-8 (at least
> by default). Is it possible for Bricolage to output in another character
> set?

There's an "encoding" attribute on the $burner object,
so you should in principle be able to $burner->set_encoding($encoding);
http://bricolage.cc/docs/1.10/api/Bric::Util::Burner#Public_Instance_Methods

However.... Looking in lib/Bric/Util/Burner/Mason.pm, sub end_page,

binmode(OUT, ':' . $self->get_encoding || 'utf8') if ENCODE_OK;

(and the other template systems besides Mason have copied that)
it's not really the "encoding" that is being set (that's arguably a bug),
and I think just putting the encoding directly won't work.
Referring to `perldoc PerlIO`, you might have to put something like,

$burner->set_encoding('bytes');

and use Encode (if I understand correctly),
or maybe something like

$burner->set_encoding('encoding(latin1)');


chris.schults at pccsea

Mar 5, 2008, 9:34 AM

Post #4 of 28 (13723 views)
Permalink
RE: Mixing character sets [In reply to]

> There's an "encoding" attribute on the $burner object,
> so you should in principle be able to
$burner->set_encoding($encoding);
>
http://bricolage.cc/docs/1.10/api/Bric::Util::Burner#Public_Instance_Me
> thods

Thanks Scott. I should have reviewed this page first. I'll play around
with this option.

Chris


chris.schults at pccsea

Mar 5, 2008, 10:22 AM

Post #5 of 28 (13757 views)
Permalink
RE: Mixing character sets [In reply to]

> Referring to `perldoc PerlIO`, you might have to put something like,
>
> $burner->set_encoding('bytes');
>
> and use Encode (if I understand correctly),
> or maybe something like
>
> $burner->set_encoding('encoding(latin1)');

Ok, making some progress, but I'm not there yet. In Apache, I switched
my default charset to iso-8859-1.

Using:

$burner->set_encoding('encoding(latin1)');

Or:

$burner->set_encoding('encoding(iso-8859-1)');

Results in, for example, '\x{2019}' in place of right single quotation
marks. But what I think I need outputted is in either decimal NCR form
(’) or hexadecimal NCR form (’).

Any ideas?

Chris


david at kineticode

Mar 5, 2008, 1:08 PM

Post #6 of 28 (13736 views)
Permalink
Re: Mixing character sets [In reply to]

On Mar 4, 2008, at 16:47, Beaudet, David P. wrote:

> There might be a better way than I'm about to suggest, but you could
> use some of the handy Perl encoding / decoding modules like
> HTML::Entities and Unicode::UTF8simple and just convert all the
> content generated by your templates to a character set of your
> choosing and set the appropriate META tag.
>
> I'm using UTF-8 for the Bric GUI (admin...system...preferences) and
> storing the data encoded as UTF-8 in the database, but since our
> public web server is set to run with ISO-8859-1, I've just
> instructed my content contributors to use HTML entities instead of
> pasting high-value UTF-8 characters into the GUI -- but I would be
> wiser to follow my own advice above and convert all content to ISO
> just in case.

Everything should be stored in the database in UTF-8. *Everything*.
All text, that is.

To output 8859-1 instead of UTF-8, put this into /autohandler:

$burner->set_encoding('encoding(iso-8859-1)');

That should be all you have to do.

Best,

David


david at kineticode

Mar 5, 2008, 1:11 PM

Post #7 of 28 (13713 views)
Permalink
Re: Mixing character sets [In reply to]

On Mar 5, 2008, at 01:37, Scott Lanning wrote:

> There's an "encoding" attribute on the $burner object,
> so you should in principle be able to $burner-
> >set_encoding($encoding);
> http://bricolage.cc/docs/1.10/api/Bric::Util::Burner#Public_Instance_Methods
>
> However.... Looking in lib/Bric/Util/Burner/Mason.pm, sub end_page,
>
> binmode(OUT, ':' . $self->get_encoding || 'utf8') if ENCODE_OK;
>
> (and the other template systems besides Mason have copied that)
> it's not really the "encoding" that is being set (that's arguably a
> bug),
> and I think just putting the encoding directly won't work.

I don't follow you here. It should work. Radio Free Asia has relied on
this behavior for years. I just fixed a bunch of encoding shit in
SVN::Notify, and based on my knowledge (re-)gained, there, this looks
right to me. All the data in Bricolage is in :utf8 (Perl's internal
representation). By setting the io layer in this way using `binmode`,
anything sent to that file handle (all the output of the templates) is
automatically converted to the encoding specified by the `encoding`
attribute.

> Referring to `perldoc PerlIO`, you might have to put something like,
>
> $burner->set_encoding('bytes');

You should only do this if you're outputting binary data (e.g., your
templates generate images).

> and use Encode (if I understand correctly),
> or maybe something like
>
> $burner->set_encoding('encoding(latin1)');

Yes, that should work.

Best,

David


david at kineticode

Mar 5, 2008, 1:16 PM

Post #8 of 28 (13726 views)
Permalink
Re: Mixing character sets [In reply to]

On Mar 5, 2008, at 10:22, Schults, Chris wrote:

>> $burner->set_encoding('encoding(iso-8859-1)');
>
> Results in, for example, '\x{2019}' in place of right single quotation
> marks. But what I think I need outputted is in either decimal NCR form
> (’) or hexadecimal NCR form (’).

Don't encode those characters as entities. Leave them as raw :utf8
data. If you're using HTML::Entities, tell it to only encode unsafe
characters:

encode_entities($input, '<>&"');

This will leave any high UTF-8 characters in place, and they will be
converted to latin-1 by the encoding you set.

Best,

David


chris.schults at pccsea

Mar 5, 2008, 1:38 PM

Post #9 of 28 (13730 views)
Permalink
RE: Mixing character sets [In reply to]

> To output 8859-1 instead of UTF-8, put this into /autohandler:
>
> $burner->set_encoding('encoding(iso-8859-1)');

Okay, I moved this from my story template to /autohandler, which
obviously makes more sense for my case.

> Don't encode those characters as entities. Leave them as raw :utf8
> data. If you're using HTML::Entities, tell it to only encode unsafe
> characters:
>
> encode_entities($input, '<>&"');
>
> This will leave any high UTF-8 characters in place, and they will be
> converted to latin-1 by the encoding you set.

Unfortunately, I'm still seeing the same behavior. Let me back up a
little:

1) On the current site, in Apache the default charset is set to
iso-8859-1 and the HTML document specifies the same. Special characters,
such as curly quotes, display fine.

2) I open the source file from the current site in my code editor, copy
the text and paste it into Bricolage. Then preview.

3) Initially, the default charset for the preview site was set to utf-8.
The characters did not display proper.

4) When the default charset was changed to iso-8859-1, they displayed
fine.

5) After adding $burner->set_encoding('encoding(iso-8859-1)'); to the
autohandler, special characters are displayed with what I assume is
their Unicode value. For example, curly quotes appear as \x{201c} and
\x{201d}. Further, it does not matter what the default charset is in
Apache -- the result is the same.

Please help. My head is spinning from trying to figure this out.

Chris


D-Beaudet at NGA

Mar 5, 2008, 1:46 PM

Post #10 of 28 (13746 views)
Permalink
RE: Mixing character sets [In reply to]

> $burner->set_encoding('encoding(iso-8859-1)');
>
>That should be all you have to do.

Will that convert high-byte UTF-8 characters to html entities at the same time?


chris.schults at pccsea

Mar 5, 2008, 2:55 PM

Post #11 of 28 (13716 views)
Permalink
RE: Mixing character sets [In reply to]

Oops, confused the scenario a bit. See corrections:

> 1) On the current site, in Apache the default charset is set to
> iso-8859-1 and the HTML document specifies the same. Special
> characters,
> such as curly quotes, display fine.
>
> 2) I open the source file from the current site in my code editor,
copy
> the text and paste it into Bricolage. Then preview.
>
> 3) Initially, the default charset for the preview site was set to utf-
> 8.
> The characters did not display proper.

Actually, when utf-8, special characters displayed fine.

> 4) When the default charset was changed to iso-8859-1, they displayed
> fine.

When iso-8859-1, special characters did NOT display correctly.

> 5) After adding $burner->set_encoding('encoding(iso-8859-1)'); to the
> autohandler, special characters are displayed with what I assume is
> their Unicode value. For example, curly quotes appear as \x{201c} and
> \x{201d}. Further, it does not matter what the default charset is in
> Apache -- the result is the same.


david at kineticode

Mar 5, 2008, 6:47 PM

Post #12 of 28 (13764 views)
Permalink
Re: Mixing character sets [In reply to]

On Mar 5, 2008, at 13:46, Beaudet, David P. wrote:

>> $burner->set_encoding('encoding(iso-8859-1)');
>>
>> That should be all you have to do.
>
> Will that convert high-byte UTF-8 characters to html entities at the
> same time?

No.

David


david at kineticode

Mar 5, 2008, 6:49 PM

Post #13 of 28 (13726 views)
Permalink
Re: Mixing character sets [In reply to]

On Mar 5, 2008, at 14:55, Schults, Chris wrote:

>> 1) On the current site, in Apache the default charset is set to
>> iso-8859-1 and the HTML document specifies the same. Special
>> characters,
>> such as curly quotes, display fine.
>>
>> 2) I open the source file from the current site in my code editor,
> copy
>> the text and paste it into Bricolage. Then preview.
>>
>> 3) Initially, the default charset for the preview site was set to
>> utf-
>> 8.
>> The characters did not display proper.
>
> Actually, when utf-8, special characters displayed fine.

Right, because it's UTF-8 from the database to the output file.

>> 4) When the default charset was changed to iso-8859-1, they displayed
>> fine.
>
> When iso-8859-1, special characters did NOT display correctly.

Right, because they were UTF-8.

>> 5) After adding $burner->set_encoding('encoding(iso-8859-1)'); to the
>> autohandler, special characters are displayed with what I assume is
>> their Unicode value. For example, curly quotes appear as \x{201c} and
>> \x{201d}. Further, it does not matter what the default charset is in
>> Apache -- the result is the same.

So you're saying that `$burner->set_encoding('encoding(iso-8859-1)');`
isn't working. What version of Perl is this?

David


lannings at who

Mar 6, 2008, 1:23 AM

Post #14 of 28 (13729 views)
Permalink
Re: Mixing character sets [In reply to]

On Wed, 5 Mar 2008, David E. Wheeler wrote:
> data in Bricolage is in :utf8 (Perl's internal representation). By setting
> the io layer in this way using `binmode`, anything sent to that file handle
> (all the output of the templates) is automatically converted to the encoding
> specified by the `encoding` attribute.

I'm not saying that setting the IO layer is incorrect or whatever.
What I mean is that it's misleading to call it "encoding",
since it's actually the "IO layer". (I should've avoided
calling it a "bug". It's only a bug in terms of naming.)
So, for example, $burner->set_encoding('iso-8859-1')
is *NOT* what you want to do, despite the fact that
$burner->set_encoding('utf8') is the default. This "utf8"
isn't the encoding, but rather it is the IO layer.

[.The rest I know you understand, just for general information
of the template developers on the list.]
$burner->set_encoding is more general than setting the encoding.
For example, in `perldoc PerlIO` there is:

:via
Use ":via(MODULE)" either in open() or binmode() to install a layer
that does whatever transformation (for example compression / decompression,
encryption / decryption) to the filehandle. See PerlIO::via for more
information.

So you could do $burner->set_encoding('via(UnWikify)') if you wanted to
implement the UnWikify "via" module. These PerlIO::via modules seem
quite powerful, including QuotedPrint, Base64, and StripHTML.
So for template developers that are aware of this,
it possibly opens up another approach to solving whatever problems.


chris.schults at pccsea

Mar 6, 2008, 8:06 AM

Post #15 of 28 (13696 views)
Permalink
RE: Mixing character sets [In reply to]

> So you're saying that `$burner->set_encoding('encoding(iso-8859-1)');`
> isn't working. What version of Perl is this?

According to 'perl -v', we're using 5.8.8.

Chris


david at kineticode

Mar 6, 2008, 12:53 PM

Post #16 of 28 (13734 views)
Permalink
Re: Mixing character sets [In reply to]

On Mar 6, 2008, at 01:23, Scott Lanning wrote:

> I'm not saying that setting the IO layer is incorrect or whatever.
> What I mean is that it's misleading to call it "encoding",
> since it's actually the "IO layer". (I should've avoided
> calling it a "bug". It's only a bug in terms of naming.)

Oh, yes, completely agree.

> So, for example, $burner->set_encoding('iso-8859-1')
> is *NOT* what you want to do, despite the fact that
> $burner->set_encoding('utf8') is the default. This "utf8"
> isn't the encoding, but rather it is the IO layer.

Yeah, it was dumb for me to do it that way.

> [.The rest I know you understand, just for general information
> of the template developers on the list.]
> $burner->set_encoding is more general than setting the encoding.
> For example, in `perldoc PerlIO` there is:
>
> :via
> Use ":via(MODULE)" either in open() or binmode() to install a
> layer
> that does whatever transformation (for example compression /
> decompression,
> encryption / decryption) to the filehandle. See PerlIO::via for
> more
> information.
>
> So you could do $burner->set_encoding('via(UnWikify)') if you wanted
> to
> implement the UnWikify "via" module. These PerlIO::via modules seem
> quite powerful, including QuotedPrint, Base64, and StripHTML.
> So for template developers that are aware of this,
> it possibly opens up another approach to solving whatever problems.

I wasn't aware of those layers. Cool.

David


david at kineticode

Mar 6, 2008, 12:53 PM

Post #17 of 28 (13733 views)
Permalink
Re: Mixing character sets [In reply to]

On Mar 6, 2008, at 08:06, Schults, Chris wrote:

>> So you're saying that `$burner-
>> >set_encoding('encoding(iso-8859-1)');`
>> isn't working. What version of Perl is this?
>
> According to 'perl -v', we're using 5.8.8.

Then I have no idea what's going on. Something is creating those
Unicode characters

David


chris.schults at pccsea

Mar 6, 2008, 4:12 PM

Post #18 of 28 (13740 views)
Permalink
RE: Mixing character sets [In reply to]

> Then I have no idea what's going on. Something is creating those
> Unicode characters…

Just so I understand ...

1) When I copy a curly/smart quote ' “ ' from Microsoft Word and paste it into Bricolage, it is stored as UTF-8.
2) Normally, it is burned as UTF-8.
3) When burned to a server with a default charset of UTF-8, the curly quote should display fine.
4) If the default charset is something else, say ISO-8859-1, you should see something like '“'.
5) However, when using '$burner->set_encoding('encoding(iso-8859-1)');', that curly quote should be converted from UTF-8 to ISO-8859-1 during burning and thus display properly.

Correct?

But in my case, instead of the actual character, what is being written to the file is '\x{201c}' (and I'm going nuts trying to solve this).

Chris


D-Beaudet at NGA

Mar 6, 2008, 7:46 PM

Post #19 of 28 (13752 views)
Permalink
RE: Mixing character sets [In reply to]

> 5) However, when using
'$burner->set_encoding('encoding(iso-8859-1)');',
> that curly quote should be converted from UTF-8 to ISO-8859-1 during
> burning and thus display properly.

Well, that seems to be identifying the correct (or at least an
acceptable) Unicode character for the quote that was pasted in from
Word...

http://theorem.ca/~mvcorks/cgi-bin/unicode.pl.cgi?start=2000&end=206F

but I believe the problem is that '\x{201c}' is not a syntax that web
browsers understand as representing a unicode character -- which is why
I had previously asked whether set_encoding() also results in the html
entity being generated -- if it generated the html entity syntax
&#x201c; instead, it would work.

Also, this might be worth a quick read:

http://en.wikipedia.org/wiki/Unicode_and_HTML#HTML_document_characters


chris.schults at pccsea

Mar 6, 2008, 8:11 PM

Post #20 of 28 (13832 views)
Permalink
RE: Mixing character sets [In reply to]

David P., thanks for your tips.

After spending hours researching this, I think I might have a potential explanation, but no solution.

I think the issue is with characters, such as smart quotes, derived from Microsoft Word, according to Wikepedia:

<snip>

Word processors have traditionally offered curved quotes to users, because in printed documents curved quotes are preferred to straight ones. Before Unicode was widely accepted and supported, this meant representing the curved quotes in whatever 8-bit encoding the software and underlying operating system <http://en.wikipedia.org/wiki/Operating_system> were using - but the character sets for Windows <http://en.wikipedia.org/wiki/Microsoft_Windows> and Macintosh <http://en.wikipedia.org/wiki/Apple_Macintosh> used two different pairs of values for curved quotes, and ISO 8859-1 <http://en.wikipedia.org/wiki/ISO_8859-1> (typically the default character set for the Unices <http://en.wikipedia.org/wiki/Unix> and, until recently, Linux <http://en.wikipedia.org/wiki/Linux> ) has no curved quotes, making cross-platform compatibility a nightmare.

Compounding the problem is the "smart quotes" feature mentioned above, which some word processors (including Microsoft Word and OpenOffice.org <http://en.wikipedia.org/wiki/OpenOffice.org> ) use by default. With this feature turned on, users may not have realised that the ASCII-compatible straight quotes they were typing on their keyboards ended up as something entirely different.

</snip>

Source: http://en.wikipedia.org/wiki/Smart_quotes#Quotation_marks_in_electronic_documents <http://en.wikipedia.org/wiki/Smart_quotes#Quotation_marks_in_electronic_documents>

And according to this list of ANSI characters not in ISO-8859-1:

http://www.alanwood.net/demos/charsetdiffs.html#a <http://www.alanwood.net/demos/charsetdiffs.html#a>

The characters in question are not part of ISO-8859-1, and believe them to be part of Windows-1252 (CP1252). Thus, I'm guessing that when converted to ISO-8859-1 in Bricolage, there is no match, so the Unicode representation is returned in the format '\x{...}'. Does this make sense to y'all.

However, this is not usable to me, so how do I convert '\x{...}' to something useful?

I've have now read more than I've ever wanted or expected to about Unicode, character sets and character encodings ... and I'm still confused. Sigh.

Chris

P.S. Apologies for the numerous posts.


D-Beaudet at NGA

Mar 6, 2008, 8:43 PM

Post #21 of 28 (13736 views)
Permalink
RE: Mixing character sets [In reply to]

> I think the issue is with characters, such as smart quotes, derived
from Microsoft Word

Yes, actually, I've run into this before with another application come
to think of it. Win-1252 code page is a real pain. I ended up
downgrading all characters within a certain byte range of 1252 to
alternative low-byte ISO-8859-1 characters instead of allowing a high
byte character to ever be persisted to the database. If you're using a
WYSIWG editor, this might be something that it can be configured to
cleanup / strip / convert for you -- otherwise, you can always try the
paste into notepad, then copy and paste into Bricolage as a work-around.


chris.schults at pccsea

Mar 7, 2008, 10:36 AM

Post #22 of 28 (13743 views)
Permalink
RE: Mixing character sets [In reply to]

> Yes, actually, I've run into this before with another application come
> to think of it. Win-1252 code page is a real pain. I ended up

And this guy has experienced this pain as well:

http://linuxplanet.com/linuxplanet/opinions/3749/1/

He quotes "Perl super-hacker" Tom Christiansen referring to these
characters as "intentional errors designed to destroy the web by
subverting open standards and thus secure Microsoft's hegemony." Heh.

> to think of it. Win-1252 code page is a real pain. I ended up
> downgrading all characters within a certain byte range of 1252 to
> alternative low-byte ISO-8859-1 characters instead of allowing a high
> byte character to ever be persisted to the database. If you're using
a
> WYSIWG editor, this might be something that it can be configured to
> cleanup / strip / convert for you -- otherwise, you can always try the
> paste into notepad, then copy and paste into Bricolage as a
work-around.

I instruct people to save to plain text from Word and replace special
characters. But, this is not guaranteed, so I guess it would be safest
to replace these characters with the proper entity.

However, I tested $burner->set_encoding('encoding(windows-1252)'); with
the server default charset as iso-8859-1, and it works (at least for me
on Firefox 2.0 and IE7). Though, I realize this might not work for all
browsers, so I'll probably go with the substitution option.

Chris


david at kineticode

Mar 10, 2008, 3:31 PM

Post #23 of 28 (13719 views)
Permalink
Re: Mixing character sets [In reply to]

On Mar 6, 2008, at 16:12, Schults, Chris wrote:

> 1) When I copy a curly/smart quote ' ' from Microsoft Word and
> paste it into Bricolage, it is stored as UTF-8.

Everything is stored as UTF-8 in Bricolage, but that doesn't mean that
a ' ' pasted from Microsoft Word is stored as ' ' in UTF-8. It's
probably CP-1252.

> 2) Normally, it is burned as UTF-8.
> 3) When burned to a server with a default charset of UTF-8, the
> curly quote should display fine.
> 4) If the default charset is something else, say ISO-8859-1, you
> should see something like '“'.
> 5) However, when using '$burner-
> >set_encoding('encoding(iso-8859-1)');', that curly quote should be
> converted from UTF-8 to ISO-8859-1 during burning and thus display
> properly.

Yes, if it's stored as \x{201c}, then it should be burned out as a
Latin-1 character.

> But in my case, instead of the actual character, what is being
> written to the file is '\x{201c}' (and I'm going nuts trying to
> solve this).

That sure sounds like UTF-8. :-(

David


david at kineticode

Mar 10, 2008, 3:37 PM

Post #24 of 28 (13731 views)
Permalink
Re: Mixing character sets [In reply to]

On Mar 6, 2008, at 20:11, Schults, Chris wrote:

> And according to this list of ANSI characters not in ISO-8859-1:
>
> http://www.alanwood.net/demos/charsetdiffs.html#a <http://www.alanwood.net/demos/charsetdiffs.html#a
> >

Yes, and this is why I wrote Encode::ZapCP1252: to convert those bogus
characters to ASCII. I need to update it to optionally convert them to
UTF-8.

> The characters in question are not part of ISO-8859-1, and believe
> them to be part of Windows-1252 (CP1252). Thus, I'm guessing that
> when converted to ISO-8859-1 in Bricolage, there is no match, so the
> Unicode representation is returned in the format '\x{...}'. Does
> this make sense to y'all.

No, because Bricolage expects UTF-8 to be submitted to the browser,
and it stores the data as UTF-8. So it never converts from CP-1252 to
ISO-8859-1. It converts from CP-1252 to UTF-8, and then later from
UTF-8 to ISO-8859-1. Of course, it only takes that first step if
you've set your character set preference in Bricolage to CP-1252.

Ah-ha! That's the bit I've been trying to remember for how we've
recommended handling this issue in the past. Try changing your
character set preference, then create a new story and paste from Word,
and then try to preview it with a template that calls $burner-
>set_encoding('encoding(iso-8859-1)');' and see if it doesn't
properly come out as ISO-8859-1. That should work!

Of course, the only thing I cannot understand is why you continue to
get "\x{201c}", which is a UTF-8 character

> However, this is not usable to me, so how do I convert '\x{...}' to
> something useful?

I'm a little confused. Are you seeing a curly quote and calling it
\x{201c}" (which is how you can represent it in a Perl double-quoted
string), or are you seeing the literal string \x{201c}"?

> I've have now read more than I've ever wanted or expected to about
> Unicode, character sets and character encodings ... and I'm still
> confused. Sigh.

It's all good stuff to know, and will pay off in the long run, believe
me.

Best,

David


chris.schults at pccsea

Mar 10, 2008, 4:14 PM

Post #25 of 28 (13717 views)
Permalink
RE: Mixing character sets [In reply to]

> Yes, and this is why I wrote Encode::ZapCP1252: to convert those bogus
> characters to ASCII. I need to update it to optionally convert them to
> UTF-8.

Ah, that's cool, but instead of approximations, I'm converting to HTML entities from /autohandler.mc:

<%filter>
# replace Microsoft-1252 characters
encode_entities($_, '€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ');
</%filter>
% $burner->set_encoding('encoding(iso-8859-1)');
% $burner->chain_next;

This appears to be working just fine.

> I'm a little confused. Are you seeing a curly quote and calling it
> \x{201c}" (which is how you can represent it in a Perl double-quoted
> string), or are you seeing the literal string \x{201c}"?

Oh, sorry if I wasn't clear. I'm seeing the literal string.

Chris

First page Previous page 1 2 Next page Last page  View All Bricolage users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.