brion at pobox
Feb 21, 2002, 1:47 PM
Post #5 of 8
On ¼aý, 2002-02-21 at 09:59, Jan Hidders wrote:
> Right now there is a localization problem wrt. indexing. The fulltext index
> indexes single words and defines these as series of letters, numbers, and
> the odd "'" and "_". Since the standard character set of MySQL is ISO 8859-1
> I assume that it knows what are letters in that character set. I really
> don't know how this behaves when the character set of MySQL is changed.
> Available, by the way, are big5, cp1251, cp1257, czech, danish, dec8, dos,
> euc_kr, gb2312, gbk, german1, hebrew, hp8, hungarian, koi8_ru, koi8_ukr,
> latin1, latin2, sjis, swe7, tis620, ujis, usa7, and win1251ukr. But I don't
> think we want to go that way because then (if I understand the documentation
> correctly) we need a separate MySQL server for every character set. Anyway,
> in all cases the indexing breaks down for entities because it doesn't index
> words with '&' and ';' in them, so it sees "Gödel" as "G" and "del"
> with some funny symbols inbetween that it doesn't index. The indexing also
> has no idea that this has something to do with "Godel".
> Admittedly unaware of any previous discussion on this before, I would
> suggest the following:
> 1. Internally, i.e., in the database fields and URLs we use for bodies and
> titles only standard ASCII plus HTML entities. However, to allow indexing we
> encode e as something like '_101_' in the database fields.
> 2. Externally in search and edit boxes the user can type any character the
> browser allows, but we always translate internally the non-ASCII ones to
> 3. When a request for a page is made we always translate the entities as
> much as possible to the character set specified in the request, including
> the contents edit boxes.
Ugh. Doable, though. Presumably the point of this is so that someone can
ö (actual o-with-umlaut in the display character encoding)
or any number of other alternatives in the edit box and put the same
actual sequence of bytes into the data?
Also remember that we'll still have to escape entities that _aren't_ in
the display character set in all edit boxes, so that they won't be
disappeared or converted into "?"s when the user hits submit. (I'm
assuming that you don't want to put the raw HTML entities for _every_
non-ASCII character into the edit box appearing as the entity codes? See
my previous message on this subject for why that's a Very Bad Idea.)
> The main thing is to define the translation functions:
> - string encodeEntities ( mb-string external-string, string character-set )
> - mb-string decodeEntities ( string internal-string, string character-set )
> (With mb-string I mean a multi-byte character string.)
cf $wikiRecodeInput(), $wikiRecodeOutput() if you want a ready place to
(Keep in mind that ASCII-with-HTML-entities is for all intents and
purposes a multibyte character encoding. It switches from single-byte to
double-byte mode when encountering a "&", and self-recovers if it is not
followed by a correct multibyte code string ending in ";".)
> For localization we define the following functions:
> - string canonicalTitle ( string internal-string ) translates an internal
> title to it's canonical form. It deals with capitalization, for example. If
> two strings are translated to the same canonical form they are formally the
> same title. If a string is translated to an empty string it is not a valid
> title. If you don't want entities in your titles, you can define that here.
> - string urlTitle ( string internal-string ) translates an internal
> canonized title to its URL form. It probably only replaces space characters
> with "+" and escapes ASCII characters that need to be escaped in an URL.
> For these functions we also need to define arrays that associate entities
> with their uppercase equivalents, and vice versa, for the relevant character
Easy enough, I can generate that from the Unicode data tables.
> Having said all this I also want to emphasize that we first need to have a
> document that describes exactly how we are going to do this, before we code
> another line for localization. We have to realize that we are a real project
Yes, a real project that's already running and has thousands of pages
that don't conform to the as-yet-nonexistant document. Hopefully we can
munge them together!
-- brion vibber (brion @ pobox.com)