Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Dev

Unicode locale values in 2.7

 

 

Python dev RSS feed   Index | Next | Previous | View Threaded


eric at trueblade

Dec 3, 2009, 3:19 AM

Post #1 of 5 (909 views)
Permalink
Unicode locale values in 2.7

While researching http://bugs.python.org/issue7327, I've come to the
conclusion that trunk handles locales incorrectly in regards to Unicode.
Fixing this would be the first step toward resolving this issue with
float and Decimal locale-aware formatting.

The issue concerns the locale "cs_CZ.UTF-8", and the "thousands_sep"
value (among others). The C struct lconv (in Linux) contains '\xc2\xa0'
for thousands_sep. In py3k this is handled by calling mbstowcs (which is
locale-aware) and then PyUnicode_FromWideChar, so the value is converted
to u"\xa0" (non-breaking space).

But in trunk, the value is just used as-is. So when formating a decimal,
for example, '\xc2\xa0' is just inserted into the result, such as:
>>> format(Decimal('1000'), 'n')
'1\xc2\xa0000'
This doesn't make much sense, and causes an error when converting it to
unicode:
>>> format(Decimal('1000'), u'n')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/root/python/trunk/Lib/decimal.py", line 3609, in __format__
return _format_number(self._sign, intpart, fracpart, exp, spec)
File "/root/python/trunk/Lib/decimal.py", line 5704, in _format_number
return _format_align(sign, intpart+fracpart, spec)
File "/root/python/trunk/Lib/decimal.py", line 5595, in _format_align
result = unicode(result)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 1:
ordinal not in range(128)

I believe that the correct solution is to do what py3k does in locale,
which is to convert the struct lconv values to unicode. But since this
would be a disruptive change if universally applied, I'd like to propose
that we only convert to unicode if the values won't fit into a str.

So the algorithm would be something like:
1. call mbstowcs
2. if every value in the result is in the range [32, 126], return a str
3. otherwise, return a unicode

This would mean that for most locales, the current behavior in trunk
wouldn't change: the locale.localeconv() values would continue to be
str. Only for those locales where the values wouldn't fit into a str
would unicode be returned.

Does this seem like an acceptable change?

Eric.

PS: Thanks to Mark Dickinson and others on irc and on the issue for
helping in formulating this.

_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


solipsis at pitrou

Dec 3, 2009, 3:33 AM

Post #2 of 5 (864 views)
Permalink
Re: Unicode locale values in 2.7 [In reply to]

Eric Smith <eric <at> trueblade.com> writes:
>
> But in trunk, the value is just used as-is. So when formating a decimal,
> for example, '\xc2\xa0' is just inserted into the result, such as:
> >>> format(Decimal('1000'), 'n')
> '1\xc2\xa0000'
> This doesn't make much sense,

Why doesn't it make sense? It's normal UTF-8.
The same thing happens when the monetary sign is non-ASCII, see
Lib/test/test_locale.py for an example.

> I believe that the correct solution is to do what py3k does in locale,
> which is to convert the struct lconv values to unicode. But since this
> would be a disruptive change if universally applied, I'd like to propose
> that we only convert to unicode if the values won't fit into a str.

This would still be disruptive, because some programs may rely on these values
being bytestrings in the current locale encoding.

I'd say don't try to fix this, and encourage people to use py3k if they really
want safe unicode+locale. Proper unicode behaviour is one of py3k's main
features after all.

Regards

Antoine.


_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


dickinsm at gmail

Dec 3, 2009, 3:55 AM

Post #3 of 5 (863 views)
Permalink
Re: Unicode locale values in 2.7 [In reply to]

On Thu, Dec 3, 2009 at 11:33 AM, Antoine Pitrou <solipsis [at] pitrou> wrote:
> Eric Smith <eric <at> trueblade.com> writes:
>>
>> But in trunk, the value is just used as-is. So when formating a decimal,
>> for example, '\xc2\xa0' is just inserted into the result, such as:
>> >>> format(Decimal('1000'), 'n')
>> '1\xc2\xa0000'
>> This doesn't make much sense,
>
> Why doesn't it make sense? It's normal UTF-8.
> The same thing happens when the monetary sign is non-ASCII, see
> Lib/test/test_locale.py for an example.

Well, one problem is that it messes up character counts. Suppose
you're aware that the thousands separator might be a single multibyte
character, and you want to produce a unicode result that's zero-padded
to a width of 6. There's currently no sensible way of doing this that
I can see:

format(Decimal('1000'), '06n').decode('utf-8') gives a string of length 5

format(Decimal('1000'), u'06n') fails with a UnicodeDecodeError.

Mark
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


solipsis at pitrou

Dec 3, 2009, 4:04 AM

Post #4 of 5 (861 views)
Permalink
Re: Unicode locale values in 2.7 [In reply to]

> Well, one problem is that it messes up character counts.

Well, I know it does. That's why py3k is inherently better than 2.x's
bytestrings-by-default behaviour. There's a reason we don't try to
backport py3k's unicode goodness to 2.x, and that's it would be terribly
messy to do so while retaining some measure of backwards compatibility.

(By the way, I would mention that relying on locale to get number
formatting right regardless of the actual user is optimistic, borderline
foolish ;-))

cheers

Antoine.


_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


martin at v

Dec 3, 2009, 5:49 AM

Post #5 of 5 (857 views)
Permalink
Re: Unicode locale values in 2.7 [In reply to]

> But in trunk, the value is just used as-is. So when formating a decimal,
> for example, '\xc2\xa0' is just inserted into the result, such as:
>>>> format(Decimal('1000'), 'n')
> '1\xc2\xa0000'
> This doesn't make much sense

I agree with Antoine: it makes sense, and is the correct answer, given
the locale definition.

Now, I think that the locale definition is flawed - it's *not* a
property of the Czech language or culture that the "no-break space"
character is the thousands-separator. If anything other than the regular
space should be the thousands separator, it should be "thin space", and
it should be used in all locales on a system that currently use space.
Having it just in the Czech locale is a misconfiguration, IMO.

But if we accept the system's locale definition, then the above is
certainly the right answer.

> and causes an error when converting it to
> unicode:
>>>> format(Decimal('1000'), u'n')

You'll need to decode in the locale's encoding, then it would
work. Unfortunately, that is difficult to achieve.

> I believe that the correct solution is to do what py3k does in locale,
> which is to convert the struct lconv values to unicode. But since this
> would be a disruptive change if universally applied, I'd like to propose
> that we only convert to unicode if the values won't fit into a str.

I think Guido is on record for objecting to that kind of API strongly.

> So the algorithm would be something like:
> 1. call mbstowcs
> 2. if every value in the result is in the range [32, 126], return a str
> 3. otherwise, return a unicode

Not sure what API you are describing here - the algorithm for doing
what?

> This would mean that for most locales, the current behavior in trunk
> wouldn't change: the locale.localeconv() values would continue to be
> str. Only for those locales where the values wouldn't fit into a str
> would unicode be returned.
>
> Does this seem like an acceptable change?

Definitely not. This will be just for 2.7, and I see no point in
producing such an incompatibility. Applications may already perform
the conversion themselves, and that would break under such a change.

Regards,
Martin

_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com

Python dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.