Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Python

New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord()

 

 

Python python RSS feed   Index | Next | Previous | View Threaded


__peter__ at web

Aug 19, 2012, 12:43 AM

Post #1 of 3 (154 views)
Permalink
New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord()

Steven D'Aprano wrote:

> On Sat, 18 Aug 2012 19:34:50 +0100, MRAB wrote:
>
>> "a" will be stored as 1 byte/codepoint.
>>
>> Adding "é", it will still be stored as 1 byte/codepoint.
>
> Wrong. It will be 2 bytes, just like it already is in Python 3.2.
>
> I don't know where people are getting this myth that PEP 393 uses Latin-1
> internally, it does not. Read the PEP, it explicitly states that 1-byte
> formats are only used for ASCII strings.

From

Python 3.3.0a4+ (default:10a8ad665749, Jun 9 2012, 08:57:51)
[GCC 4.6.1] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> [sys.getsizeof("é"*i) for i in range(10)]
[49, 74, 75, 76, 77, 78, 79, 80, 81, 82]
>>> [sys.getsizeof("e"*i) for i in range(10)]
[49, 50, 51, 52, 53, 54, 55, 56, 57, 58]
>>> sys.getsizeof("é"*101)-sys.getsizeof("é")
100
>>> sys.getsizeof("e"*101)-sys.getsizeof("e")
100
>>> sys.getsizeof("€"*101)-sys.getsizeof("€")
200

I infer that

(1) both ASCII and Latin1 strings require one byte per character.
(2) Latin1 strings have a constant overhead of 24 bytes (on a 64bit system)
over ASCII-only.

--
http://mail.python.org/mailman/listinfo/python-list


steve+comp.lang.python at pearwood

Aug 19, 2012, 1:56 AM

Post #2 of 3 (142 views)
Permalink
Re: New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord() [In reply to]

On Sun, 19 Aug 2012 09:43:13 +0200, Peter Otten wrote:

> Steven D'Aprano wrote:

>> I don't know where people are getting this myth that PEP 393 uses
>> Latin-1 internally, it does not. Read the PEP, it explicitly states
>> that 1-byte formats are only used for ASCII strings.
>
> From
>
> Python 3.3.0a4+ (default:10a8ad665749, Jun 9 2012, 08:57:51) [GCC
> 4.6.1] on linux
> Type "help", "copyright", "credits" or "license" for more information.
>>>> import sys
>>>> [sys.getsizeof("é"*i) for i in range(10)]
> [49, 74, 75, 76, 77, 78, 79, 80, 81, 82]

Interesting. Say, I don't suppose you're using a 64-bit build? Because
that would explain why your sizes are so larger than mine:

py> [sys.getsizeof("é"*i) for i in range(10)]
[25, 38, 39, 40, 41, 42, 43, 44, 45, 46]


py> [sys.getsizeof("€"*i) for i in range(10)]
[25, 40, 42, 44, 46, 48, 50, 52, 54, 56]

py> c = chr(0xFFFF + 1)
py> [sys.getsizeof(c*i) for i in range(10)]
[25, 44, 48, 52, 56, 60, 64, 68, 72, 76]


On re-reading the PEP more closely, it looks like I did misunderstand the
internal implementation, and strings which fit exactly in Latin-1 will
also use 1 byte per character. There are three structures used:

PyASCIIObject
PyCompactUnicodeObject
PyUnicodeObject

and the third one comes in three variant forms, for 1-byte, 2-byte and 4-
byte data. So I stand corrected.


--
Steven
--
http://mail.python.org/mailman/listinfo/python-list


wxjmfauth at gmail

Aug 19, 2012, 2:24 AM

Post #3 of 3 (140 views)
Permalink
Re: New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord() [In reply to]

Le dimanche 19 août 2012 10:56:36 UTC+2, Steven D'Aprano a écrit :
>
> internal implementation, and strings which fit exactly in Latin-1 will
>

And this is the crucial point. latin-1 is an obsolete and non usable
coding scheme (esp. for european languages).

We fall on the point I mentionned above. Microsoft know this, ditto
for Apple, ditto for "TeX", ditto for the foundries.
Even, "ISO" has recognized its error and produced iso-8859-15.

The question? Why is it still used?

jmf



--
http://mail.python.org/mailman/listinfo/python-list

Python python RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.