Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Bugs

[issue7327] format: minimum width: UTF-8 separators and decimal points

 

 

Python bugs RSS feed   Index | Next | Previous | View Threaded


report at bugs

Nov 15, 2009, 2:29 AM

Post #1 of 16 (473 views)
Permalink
[issue7327] format: minimum width: UTF-8 separators and decimal points

New submission from Stefan Krah <stefan-usenet [at] bytereef>:

This issue affects the format functions of float and decimal.

When calculating the padding necessary to reach the minimum width,
UTF-8 separators and decimal points are calculated by their byte
lengths. This can lead to printed representations that are too short.


Real world example (separator):

>>> import locale
>>> from decimal import *
>>> locale.setlocale(locale.LC_NUMERIC, "cs_CZ.UTF-8")
'cs_CZ.UTF-8'
>>> s = format(Decimal("-1.5"), ' 019.18n')
>>> len(s)
19
>>> len(s.decode('utf-8'))
16
>>> s
'-0\xc2\xa0000\xc2\xa0000\xc2\xa0001,5'
>>>
>>>
>>> s = format(-1.5, ' 019.18n')
>>> s
'-0\xc2\xa0000\xc2\xa0000\xc2\xa0001,5'
>>> len(s.decode('utf-8'))
16
>>>


Constructed example (separator and decimal point):

>>> u = {'decimal_point' : "\xc2\xbf", 'grouping' : [3, 3, 0],
'thousands_sep': "\xc2\xb4"}
>>> def get_fmt(x, locale, fmt='n'):
... return Decimal.__format__(Decimal(x), fmt, _localeconv=locale)
...
>>> s = get_fmt(Decimal("1.5"), u, "020n")
>>> s
'00\xc2\xb4000\xc2\xb4000\xc2\xb4001\xc2\xbf5'
>>> len(s.decode('utf-8'))
16

----------
messages: 95283
nosy: eric.smith, mark.dickinson, skrah
severity: normal
status: open
title: format: minimum width: UTF-8 separators and decimal points

_______________________________________
Python tracker <report [at] bugs>
<http://bugs.python.org/issue7327>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Nov 28, 2009, 8:43 AM

Post #2 of 16 (433 views)
Permalink
[issue7327] format: minimum width: UTF-8 separators and decimal points [In reply to]

Changes by Mark Dickinson <dickinsm [at] gmail>:


----------
assignee: -> mark.dickinson

_______________________________________
Python tracker <report [at] bugs>
<http://bugs.python.org/issue7327>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Nov 28, 2009, 9:53 AM

Post #3 of 16 (432 views)
Permalink
[issue7327] format: minimum width: UTF-8 separators and decimal points [In reply to]

Matthew Barnett <python [at] mrabarnett> added the comment:

Surely this is to be expected when working with bytestrings. You should
be working in Unicode and using UTF-8 only for input and output.

----------
nosy: +mrabarnett

_______________________________________
Python tracker <report [at] bugs>
<http://bugs.python.org/issue7327>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Nov 30, 2009, 5:04 AM

Post #4 of 16 (422 views)
Permalink
[issue7327] format: minimum width: UTF-8 separators and decimal points [In reply to]

Stefan Krah <stefan-usenet [at] bytereef> added the comment:

What you mean by "working with bytestrings"? The UTF-8 separators or
decimal points come directly from struct lconv (man localeconv). The
logical way to reach a minimum width of 19 is to have 19 UTF-8
characters, which can subsequently be converted to other formats.

----------

_______________________________________
Python tracker <report [at] bugs>
<http://bugs.python.org/issue7327>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Dec 1, 2009, 3:19 PM

Post #5 of 16 (418 views)
Permalink
[issue7327] format: minimum width: UTF-8 separators and decimal points [In reply to]

R. David Murray <rdmurray [at] bitdance> added the comment:

In python3:

>>> locale.setlocale(locale.LC_NUMERIC, "cs_CZ.UTF-8")
'cs_CZ.UTF-8'
>>> s = format(Decimal("-1.5"), ' 019.18n')
>>> len(s)
20
>>> print(s)
-0 000 000 000 001,5

Python3 uses unicode for strings. Python2 uses bytes. To format
unicode in python2, you do:

>>> s2 = locale.format("% 019.18g", Decimal("-1.5"))
>>> len(s2)
19
>>> print s2
-0000000000000001,5

Not quite the same thing, clearly. So, is there a way to access the
python3 unicode format semantics in python2? Just passing format a
unicode format string results in a UnicodeDecodeError.

----------
nosy: +r.david.murray
priority: -> normal
type: -> behavior
versions: +Python 2.6, Python 2.7

_______________________________________
Python tracker <report [at] bugs>
<http://bugs.python.org/issue7327>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Dec 1, 2009, 4:05 PM

Post #6 of 16 (420 views)
Permalink
[issue7327] format: minimum width: UTF-8 separators and decimal points [In reply to]

Eric Smith <eric [at] trueblade> added the comment:

In 2.7, I get:

$ ./python.exe
Python 2.7a0 (trunk:76501, Nov 24 2009, 14:57:21)
[GCC 4.0.1 (Apple Inc. build 5465)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.setlocale(locale.LC_NUMERIC, "cs_CZ.UTF-8")
'cs_CZ.UTF-8'
>>> from decimal import Decimal
>>> s = format(Decimal("-1.5"), ' 019.18n')
>>> s
'-0 000 000 000 001,5'
>>> len(s)
20
>>> s = format(Decimal("-1.5"), u' 019.18n')
>>> s
u'-0 000 000 000 001,5'
>>> len(s)
20
>>>

Could you give more details on the UnicodeDecodeError you get? Any
traceback?

----------

_______________________________________
Python tracker <report [at] bugs>
<http://bugs.python.org/issue7327>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Dec 1, 2009, 4:29 PM

Post #7 of 16 (419 views)
Permalink
[issue7327] format: minimum width: UTF-8 separators and decimal points [In reply to]

R. David Murray <rdmurray [at] bitdance> added the comment:

Interesting. My regular locale is LC_CTYPE=en_US.UTF-8, and here is
what I get:

Python 2.7a0 (trunk:76501, Nov 24 2009, 13:59:01)
[GCC 4.4.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import local
>>> import locale
>>> locale.setlocale(locale.LC_NUMERIC, "cs_CZ.UTF-8")
'cs_CZ.UTF-8'
>>> from decimal import Decimal
>>> s = format(Decimal("-1.5"), ' 019.18n')
>>> s
'-0\xc2\xa0000\xc2\xa0000\xc2\xa0001,5'
>>> len(s)
19
>>> print s
-0 000 000 001,5

sys.stdout.encoding gives 'UTF-8'.

And here's the traceback from trying to use unicode:

>>> s = format(Decimal("-1.5"), u' 019.18n')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/rdmurray/python/trunk/Lib/decimal.py", line 3609, in
__format__
return _format_number(self._sign, intpart, fracpart, exp, spec)
File "/home/rdmurray/python/trunk/Lib/decimal.py", line 5704, in
_format_number
return _format_align(sign, intpart+fracpart, spec)
File "/home/rdmurray/python/trunk/Lib/decimal.py", line 5595, in
_format_align
result = unicode(result)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2:
ordinal not in range(128)

----------

_______________________________________
Python tracker <report [at] bugs>
<http://bugs.python.org/issue7327>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Dec 1, 2009, 7:00 PM

Post #8 of 16 (417 views)
Permalink
[issue7327] format: minimum width: UTF-8 separators and decimal points [In reply to]

Eric Smith <eric [at] trueblade> added the comment:

I can duplicate this on Linux. The difference is the values in the
locale for the separators, specifically,
locale.localeconv()['thousands_sep'].

>>> locale.localeconv()['thousands_sep']
'\xc2\xa0'

The question is: since a struct lconv contains char*s, how to interpret
them? The code in decimal interprets them as ascii, apparently. floats
do the same thing, so this isn't strictly a decimal problem. I'll have
to give it some thought.

----------

_______________________________________
Python tracker <report [at] bugs>
<http://bugs.python.org/issue7327>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Dec 2, 2009, 2:42 AM

Post #9 of 16 (417 views)
Permalink
[issue7327] format: minimum width: UTF-8 separators and decimal points [In reply to]

Stefan Krah <stefan-usenet [at] bytereef> added the comment:

In python3.2, the output of decimal looks good. With float, the
separator is printed as two spaces on my Unicode terminal (export
LC_ALL=cs_CZ.UTF-8).

So decimal (3.2) interprets the separator string as a single UTF-8 char
and the final output is a UTF-8 string. I'd say that in C, this is the
intended way of using struct lconv.

If there is an agreement that the final output should be a UTF-8 string,
this looks correct to me.



Python 3.2a0 (py3k:76081M, Nov 6 2009, 15:23:48)
[GCC 4.1.3 20080623 (prerelease) (Ubuntu 4.1.2-23ubuntu3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale, decimal
>>> locale.setlocale(locale.LC_NUMERIC, 'cs_CZ.UTF-8')
'cs_CZ.UTF-8'
>>> x = format(decimal.Decimal("-1.5"), '019.18n')
>>> y = format(float("-1.5"), '019.18n')
>>> x
'-0\xa0000\xa0000\xa0000\xa0001,5'
>>> y
'-0ᅡᅠ000ᅡᅠ000ᅡᅠ001,5'
>>> print(x)
-0 000 000 000 001,5
>>> print(y)
-0ᅡᅠ000ᅡᅠ000ᅡᅠ001,5
>>>

----------

_______________________________________
Python tracker <report [at] bugs>
<http://bugs.python.org/issue7327>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Dec 2, 2009, 3:53 AM

Post #10 of 16 (417 views)
Permalink
[issue7327] format: minimum width: UTF-8 separators and decimal points [In reply to]

Mark Dickinson <dickinsm [at] gmail> added the comment:

So when the format string has type 'str' (as in Stefan's original example)
rather than type 'unicode', I'd say Python is doing the right thing
already: everything in sight, including the separators coming from
localeconv(), has type 'str', so trying to interpret things as unicode
seems a bit of a stretch.

If the '\xc2\xa0' from localeconv()['thousands_sep'] is to be interpreted
as a single unicode character, shouldn't it be a unicode
string already?

However, if localeconv()['thousands_sep'] *were* to give a unicode string,
then I suppose Decimal.__format__ should be returning a unicode result; I
don't think it currently does this. (Should this be true even if the
number being formatted is so short that no thousands separators actually
appear in it?)

----------

_______________________________________
Python tracker <report [at] bugs>
<http://bugs.python.org/issue7327>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Dec 2, 2009, 5:15 AM

Post #11 of 16 (414 views)
Permalink
[issue7327] format: minimum width: UTF-8 separators and decimal points [In reply to]

Eric Smith <eric [at] trueblade> added the comment:

I don't see any documentation that a struct lconv should be interpreted
as UTF-8. In fact Googling "struct lconv utf-8" gives this bug report as
the first hit.

lconv.thousands_sep is char*. It's never been clear to me if this means
"pointer to a single char", or "pointer to a null terminated string of
char". In Objects/stringlib/localeutil.h I treat it as a string of char.

----------

_______________________________________
Python tracker <report [at] bugs>
<http://bugs.python.org/issue7327>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Dec 2, 2009, 5:58 AM

Post #12 of 16 (414 views)
Permalink
[issue7327] format: minimum width: UTF-8 separators and decimal points [In reply to]

Eric Smith <eric [at] trueblade> added the comment:

In trunk, Modules/_localemodule.c also treats these as "string of char",
so at least we're consistent.

In py3k, mbstowcs is used and the result passed to PyUnicode_FromWideChar.

I'm not sure how you'd address this in locale in trunk, or if we want to
do something similar in localeutil.h in trunk (for the Unicode case).

----------

_______________________________________
Python tracker <report [at] bugs>
<http://bugs.python.org/issue7327>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Dec 2, 2009, 6:10 AM

Post #13 of 16 (414 views)
Permalink
[issue7327] format: minimum width: UTF-8 separators and decimal points [In reply to]

Stefan Krah <stefan-usenet [at] bytereef> added the comment:

Googling "multi-byte thousands separator" gives better results. From
those results, it is clear to me that decimal_point and thousands_sep
are strings that may be interpreted as multi-byte characters. The Czech
separator appears to be a no-break space multi-byte character.


http://sourceware.org/ml/libc-hacker/2007-01/msg00005.html
http://drupal.org/node/353897


My point is that if a multi-byte character appears, it should be
counted as a single character for the purposes of calculating
min-width. Otherwise, the printed representation is too short.

----------

_______________________________________
Python tracker <report [at] bugs>
<http://bugs.python.org/issue7327>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Dec 3, 2009, 3:19 AM

Post #14 of 16 (414 views)
Permalink
[issue7327] format: minimum width: UTF-8 separators and decimal points [In reply to]

Mark Dickinson <dickinsm [at] gmail> added the comment:

Reassigning to Eric.

----------
assignee: mark.dickinson -> eric.smith

_______________________________________
Python tracker <report [at] bugs>
<http://bugs.python.org/issue7327>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Dec 3, 2009, 3:21 AM

Post #15 of 16 (411 views)
Permalink
[issue7327] format: minimum width: UTF-8 separators and decimal points [In reply to]

Eric Smith <eric [at] trueblade> added the comment:

I've raised the issue with unicode and locale on python-dev:
http://mail.python.org/pipermail/python-dev/2009-December/094408.html

Pending the outcome of that decision, I'll move forward on this issue.

----------

_______________________________________
Python tracker <report [at] bugs>
<http://bugs.python.org/issue7327>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Dec 4, 2009, 6:25 AM

Post #16 of 16 (396 views)
Permalink
[issue7327] format: minimum width: UTF-8 separators and decimal points [In reply to]

Eric Smith <eric [at] trueblade> added the comment:

See the discussion on python-dev, in particular Martin's comment at
http://mail.python.org/pipermail/python-dev/2009-December/094412.html

The solutions to this seem too complex for 2.x. It is not a problem in 3.x.

----------
resolution: -> wont fix
status: open -> closed

_______________________________________
Python tracker <report [at] bugs>
<http://bugs.python.org/issue7327>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com

Python bugs RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.