Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Dev

please consider changing --enable-unicode default to ucs4

 

 

Python dev RSS feed   Index | Next | Previous | View Threaded


zookog at gmail

Sep 20, 2009, 7:02 AM

Post #1 of 7 (3956 views)
Permalink
please consider changing --enable-unicode default to ucs4

Dear Pythonistas:

This issue causes serious problems. Users occasionally get binaries built for a
compatible Linux and Python version but with a different UCS2-vs-UCS4 setting,
and those users get mysterious memory corruption errors which are hard to
diagnose. It is possible that these situations also open up security
vulnerabilities. A couple such instances are documented on
http://bugs.python.org/setuptools/issue78, but you can find more by googling.
I would like to get this problem fixed!

In order to help address this issue I sampled what UCS size is used by python
executables in the wild. I instrumented a few buildslaves that are
contributed by
various people to the Tahoe-LAFS project to print out their platform,
python version,
and sys.maxunicode. The full results are appended below. maxunicode: 1114111
means that python executable was configured with --enable-unicode=ucs4, and
maxunicode: 65535 means that python executable was configured with
--enable-unicode=ucs2 or just with --enable-unicode . The only
incompatibilities
that I found are because some packagers have deliberately set UCS4
configuration and other packagers have left the default setting.

In the three cases where someone configured python with UCS2, one of the three
is certainly an accident (a custom-built python executable on an Ubuntu server)
and the other two just use the default instead of specifically configuring ucs2
in their configurations of Python and I suspect that they don't know the
difference and that it was an accident that they built a Python incompatible
with other distributions of their operating system.

In sum, while it would be good to add the unicode setting to the platform's ABI
(as discussed in setuptools ticket #78), it would also be good to make
the default
value be UCS4 instead of UCS2. This would fix all three of the potential
incompatibilities that I found (listed below), and once we have proper inclusion
of the unicode setting in the ABI in order to prevent the memory corruption,
defaulting to UCS4 would increase the likelihood that a binary built on one
distribution would be usable on another.

I'm sure that someone can come up with a reason why UCS2 is better than UCS4,
but I'm also sure that the benefits of compatibility outweigh any benefits of
UCS2 encoding, and that the widespread use of UCS4 demonstrates that there is
nothing fatally wrong with it, and that people who really value UCS2 encoding
more than compatibility can choose that for themselves by explicitly
setting UCS2.

Let me restate that I am not suggesting taking away anyone's options, only
making the setting for people who don't specify default to the
compatible option.
Hm, I guess that means that it should default to UCS2 on Windows and Mac and
to UCS4 on Linux and Solaris.

Regards,

Zooko

Ubuntu 6.10 "edgy" i386: python: 2.4.4c1 (#2, Mar 7 2008, 03:03:38) [GCC 4.1.2
20060928 (prerelease) (Ubuntu 4.1.1-13ubuntu5)], maxunicode: 1114111
Ubuntu 7.04 "feisty": python: 2.5.1 (r251:54863, Jul 31 2008, 22:53:39) [GCC
4.1.2 (Ubuntu 4.1.2-0ubuntu4)], maxunicode: 1114111
Ubuntu 7.10 "gutsy" i386: python: 2.5.1 (r251:54863, Jul 31 2008, 23:17:40)
[GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)], maxunicode: 1114111
Ubuntu 8.04 "hardy" amd64: python: 2.5.2 (r252:60911, Jul 22 2009, 15:33:10)
[GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3)], maxunicode: 1114111
Ubuntu 8.04 "hardy" i386: *custom* python: 2.6 (r26:66714, Oct 2 2008,
13:40:28) [GCC 4.2.3 (Ubuntu 4.2.3-2ubuntu7)], maxunicode: 65535
Ubuntu 8.04 "hardy" i386: python: 2.5.2 (r252:60911, Jul 22 2009, 15:35:03)
[GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3)], maxunicode: 1114111
Ubuntu 9.04 "jaunty" amd64: *custom* python: 2.6.2 (release26-maint, Apr 19
2009, 01:58:18) [GCC 4.3.3], maxunicode: 1114111

Debian 4.0 "etch" i386: python: 2.4.4 (#2, Oct 22 2008, 19:52:44) [GCC 4.1.2
20061115 (prerelease) (Debian 4.1.1-21)], maxunicode: 1114111
Debian 5.0 "lenny" i386: python: 2.5.2 (r252:60911, Jan 4 2009, 17:40:26) [GCC
4.3.2], maxunicode: 1114111
Debian 5.0 "lenny" amd64: python: 2.5.2 (r252:60911, Jan 4 2009, 21:59:32)
[GCC 4.3.2], maxunicode: 1114111
Debian 5.0 "lenny" armv5tel: python: 2.5.2 (r252:60911, Jan 5 2009, 02:00:00)
[GCC 4.3.2], maxunicode: 1114111
Debian unstable "squeeze/sid" i386: python: 2.5.4 (r254:67916, Feb 17 2009,
20:16:45) [GCC 4.3.3], maxunicode: 1114111

Fedora 11 "leonidas" amd64: python: 2.6 (r26:66714, Jul 4 2009, 17:37:13) [GCC
4.4.0 20090506 (Red Hat 4.4.0-4)], maxunicode: 1114111

ArchLinux: python: 2.6.2 (r262:71600, Jul 20 2009, 02:23:30) [GCC 4.4.0
20090630 (prerelease)], maxunicode: 65535

NetBSD 4: python: 2.5.2 (r252:60911, Mar 20 2009, 14:00:07) [GCC 4.1.2 20060628
prerelease (NetBSD nb2 20060711)], maxunicode: 65535

OpenSolaris SunOS-5.11-i86pc-i386-32bit: python: 2.4.4 (#1, Mar 10 2009,
09:35:36) [C], maxunicode: 65535
Nexenta NCP1 SunOS-5.11-i86pc-i386-32bit: python: 2.4.3 (#2, May 3 2006,
19:12:42) [GCC 4.0.3 (GNU_OpenSolaris 4.0.3-1nexenta4)], maxunicode: 1114111

Mac OS 10.6 "snow leopard" i386: python: 2.6.1 (r261:67515, Jul 7 2009,
23:51:51) [GCC 4.2.1 (Apple Inc. build 5646)], maxunicode: 65535
Mac OS 10.5 "leopard" i386: python: 2.5.1 (r251:54863, Feb 6 2009, 19:02:12)
[GCC 4.0.1 (Apple Inc. build 5465)], maxunicode: 65535
Mac OS 10.4 "tiger" *custom* python: 2.5.4 (release25-maint:72153M, Apr 30 2009,
12:28:20) [GCC 4.0.1 (Apple Computer, Inc. build 5367)], maxunicode: 65535

Cygwin CYGWIN_NT-5.1-1.5.25-0.156-4-2-i686-32bit-WindowsPE: python: 2.5.2
(r252:60911, Dec 2 2008, 09:26:14) [GCC 3.4.4 (cygming special, gdc 0.12,
using dmd 0.125)], maxunicode: 65535

Windows: python: 2.6.2 (r262:71600, Apr 21 2009, 15:05:37) [MSC v.1500 32 bit
(Intel)], maxunicode: 65535
Windows: python: 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit
(Intel)], maxunicode: 65535
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


benjamin at python

Sep 20, 2009, 7:06 AM

Post #2 of 7 (3825 views)
Permalink
Re: please consider changing --enable-unicode default to ucs4 [In reply to]

2009/9/20 Zooko O'Whielacronx <zookog [at] gmail>:
> Dear Pythonistas:
>
> This issue causes serious problems.  Users occasionally get binaries built for a
> compatible Linux and Python version but with a different UCS2-vs-UCS4 setting,
> and those users get mysterious memory corruption errors which are hard to
> diagnose.  It is possible that these situations also open up security
> vulnerabilities.  A couple such instances are documented on
> http://bugs.python.org/setuptools/issue78, but you can find more by googling.
> I would like to get this problem fixed!

You may want to have a look at the archives of the last time this was
extensively discussed:
http://mail.python.org/pipermail/python-dev/2008-July/080886.html


--
Regards,
Benjamin
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


zookog at gmail

Sep 20, 2009, 7:16 AM

Post #3 of 7 (3807 views)
Permalink
Re: please consider changing --enable-unicode default to ucs4 [In reply to]

I'm sorry, I should have mentioned that I did read those archives
before I posted my letter. That discussion was all about whether UCS2
or UCS4 is better. I consider that question to be mostly irrelevant
to this issue, which is about compatibility for people who don't
choose to configure that setting themselves. Platforms or people who
prefer UCS2 will continue to use it as appropriate. UCS4 is clearly
good enough for the vast majority of Linux users, and having fewer
mysterious segfaults and potential security vulnerabilities would be
an important improvement to the user experience of Python on Linux.

I should mention that the reason I'm spending time on this right now
is that it is currently blocking me from being able to distribute
binaries of Python packages which will work for all of my Linux users.

Regards,

Zooko
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


mal at egenix

Sep 20, 2009, 11:28 AM

Post #4 of 7 (3805 views)
Permalink
Re: please consider changing --enable-unicode default to ucs4 [In reply to]

Zooko O'Whielacronx wrote:
> On Sun, Sep 20, 2009 at 8:27 AM, Antoine Pitrou <solipsis [at] pitrou> wrote:
>>
>> What "binaries" are you talking about?
>
> I mean extension modules with native code, which means .so shared
> library files on unix.

Those will not load unless they are for the right UCS-version of
Python. The extensions will give an ImportError if they are
using any Unicode APIs - we go through great lengths in the
Unicode API to make sure that you cannot mix UCS2 and UCS4 APIs.

I'm not exactly sure what you are trying to achieve by making
UCS4 the default... if you build extensions using the system
Python version, distutils will automatically build the right
UCS-version for you.

>> AFAIK, C extensions should fail loading when they have the wrong UCS2/4 setting.
>
> That would be an improvement! Unfortunately we instead get mysterious
> misbehavior of the module, e.g.:
>
> http://bugs.python.org/setuptools/msg309
> http://allmydata.org/trac/tahoe/ticket/704#comment:5

Those don't appear to be related to UCS2 vs. UCS4 but rather
some problem with the UTF-8 data those users are trying to load.

The fact that setuptools completely ignores the fact
that Python UCS2 and UCS4 are two different Python builds, is
not really a Python Unicode problem, but one of the setuptools design,
so you should probably complain there.

>> For information, all Mandriva versions I've used until now have had their
>> Python's built with UCS2 (maxunicode == 65535).
>
> Thank you for the data point. This means that binary extension
> modules built on Mandriva can't be ported to Ubuntu or vice versa.
> However, is this an argument for or against changing the default
> setting to UCS4? Changing the default setting wouldn't interfere with
> Mandriva's decision, right?

Depends on what you mean with "ported": of course you can port a
source RPM between UCS2 and UCS4 builds. This just requires a
recompile.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Sep 20 2009)
>>> Python/Zope Consulting and Support ... http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::


eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


martin at v

Sep 20, 2009, 12:27 PM

Post #5 of 7 (3802 views)
Permalink
Re: please consider changing --enable-unicode default to ucs4 [In reply to]

Zooko O'Whielacronx wrote:
> I'm sorry, I should have mentioned that I did read those archives
> before I posted my letter. That discussion was all about whether UCS2
> or UCS4 is better. I consider that question to be mostly irrelevant
> to this issue, which is about compatibility for people who don't
> choose to configure that setting themselves.

You surely must have missed the sentence

"For that reason I think it's also better that the configure script
continues to default to UTF-16 -- this will give the UTF-16 support
code the necessary exercise."

This is effectively a BDFL pronouncement. Nothing has changed the
validity of the premise of the statement, so the conclusion remains
valid, as well.

Regards,
Martin
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


rhamph at gmail

Oct 7, 2009, 5:10 PM

Post #6 of 7 (3685 views)
Permalink
Re: please consider changing --enable-unicode default to ucs4 [In reply to]

On Sun, Sep 20, 2009 at 10:17, Zooko O'Whielacronx <zookog [at] gmail> wrote:
> On Sun, Sep 20, 2009 at 8:27 AM, Antoine Pitrou <solipsis [at] pitrou> wrote:
>> AFAIK, C extensions should fail loading when they have the wrong UCS2/4 setting.
>
> That would be an improvement!  Unfortunately we instead get mysterious
> misbehavior of the module, e.g.:
>
> http://bugs.python.org/setuptools/msg309
> http://allmydata.org/trac/tahoe/ticket/704#comment:5

The real issue here is getting confused because python's option is
misnamed. We support UTF-16 and UTF-32, not UCS-2 and UCS-4. This
means that when decoding UTF-8, any scalar value outside the BMP will
be split into a pair of surrogates on UTF-16 builds; if we were using
UCS-2 that'd be an error instead (and *nothing* would understand
surrogates.)

Yet we are getting an error here. However, if you look at the details
you'll notice it's on a 6-byte UTF-8 code unit sequence, corresponding
in the second link to U+6E657770. Although the originally UTF-8 left
open the possibility of including up to 31 bits (or U+7FFFFFFF), this
was removed in RFC 3629 and is now strictly prohibited. The modern
unicode character set itself also imposes that restriction. There is
nothing beyond U+10FFFF. Nothing should create a such a high code
point, and even if it happened internally a RFC 3629-conformant UTF-8
encoder must refuse to pass it through.

Something more subtle must be going on. Possibly several bugs (such
as a non-conformant encoder or garbage being misinterpreted as UTF-8).


--
Adam Olsen, aka Rhamphoryncus
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


mal at egenix

Oct 8, 2009, 12:47 AM

Post #7 of 7 (3685 views)
Permalink
Re: please consider changing --enable-unicode default to ucs4 [In reply to]

Adam Olsen wrote:
> On Sun, Sep 20, 2009 at 10:17, Zooko O'Whielacronx <zookog [at] gmail> wrote:
>> On Sun, Sep 20, 2009 at 8:27 AM, Antoine Pitrou <solipsis [at] pitrou> wrote:
>>> AFAIK, C extensions should fail loading when they have the wrong UCS2/4 setting.
>>
>> That would be an improvement! Unfortunately we instead get mysterious
>> misbehavior of the module, e.g.:
>>
>> http://bugs.python.org/setuptools/msg309
>> http://allmydata.org/trac/tahoe/ticket/704#comment:5

I agree that a better error message would help. I'm just not sure
how to achieve that.

The error message you currently see gets generated by the dynamic
linker trying to resolve a Python Unicode API symbol: the API names
are mangled to assure that you cannot mix UCS2 interpreters and UCS4
extensions (and vice-versa).

We could try to scan the linker error message for 'Py.*UCS.'
and then replace the message with a more helpful one (in importdl.c),
but I'm not sure how portable that is.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Oct 08 2009)
>>> Python/Zope Consulting and Support ... http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::


eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com

Python dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.