Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Dev

Re: please consider changing --enable-unicode default to ucs4

 

 

Python dev RSS feed   Index | Next | Previous | View Threaded


solipsis at pitrou

Sep 20, 2009, 7:27 AM

Post #1 of 17 (2475 views)
Permalink
Re: please consider changing --enable-unicode default to ucs4

Zooko O'Whielacronx <zookog <at> gmail.com> writes:
>
> Users occasionally get binaries built for a
> compatible Linux and Python version but with a different UCS2-vs-UCS4 setting,
> and those users get mysterious memory corruption errors which are hard to
> diagnose.

What "binaries" are you talking about?
AFAIK, C extensions should fail loading when they have the wrong UCS2/4 setting.
That's the reason we have all those #define's in unicodeobject.h: the actual
function names end up being different and, therefore, are not found when linking.

> In order to help address this issue I sampled what UCS size is used by python
> executables in the wild.

For information, all Mandriva versions I've used until now have had their
Python's built with UCS2 (maxunicode == 65535).

Regards

Antoine.


_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


zookog at gmail

Sep 20, 2009, 9:17 AM

Post #2 of 17 (2389 views)
Permalink
Re: please consider changing --enable-unicode default to ucs4 [In reply to]

On Sun, Sep 20, 2009 at 8:27 AM, Antoine Pitrou <solipsis [at] pitrou> wrote:
>
> What "binaries" are you talking about?

I mean extension modules with native code, which means .so shared
library files on unix.

> AFAIK, C extensions should fail loading when they have the wrong UCS2/4 setting.

That would be an improvement! Unfortunately we instead get mysterious
misbehavior of the module, e.g.:

http://bugs.python.org/setuptools/msg309
http://allmydata.org/trac/tahoe/ticket/704#comment:5

> For information, all Mandriva versions I've used until now have had their
> Python's built with UCS2 (maxunicode == 65535).

Thank you for the data point. This means that binary extension
modules built on Mandriva can't be ported to Ubuntu or vice versa.
However, is this an argument for or against changing the default
setting to UCS4? Changing the default setting wouldn't interfere with
Mandriva's decision, right?

Regards,

Zooko
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


zookog at gmail

Sep 20, 2009, 9:33 AM

Post #3 of 17 (2389 views)
Permalink
Re: please consider changing --enable-unicode default to ucs4 [In reply to]

On Sun, Sep 20, 2009 at 8:27 AM, Antoine Pitrou <solipsis [at] pitrou> wrote:
> For information, all Mandriva versions I've used until now have had their
> Python's built with UCS2 (maxunicode == 65535).

By the way, I was investigating this, and discovered an issue on the
Mandriva tracker which suggests that they intend to switch to UCS4 in
the next release in order to avoid compatibility problems like these.
(Not because they think that UCS4 is better than UCS2.)

https://qa.mandriva.com/show_bug.cgi?id=48570

Regards,

Zooko
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


mal at egenix

Sep 28, 2009, 1:25 AM

Post #4 of 17 (2359 views)
Permalink
Re: please consider changing --enable-unicode default to ucs4 [In reply to]

Zooko O'Whielacronx wrote:
> Folks:
>
> I'm sorry, I think I didn't make my concern clear. My users, and lots
> of other users, are having a problem with incompatibility between
> Python binary extension modules. One way to improve the situation
> would be if the Python devs would use their "bully pulpit" -- their
> unique position as a source respected by all Linux distributions --
> and say "We recommend that Linux distributions use UCS4 for
> compatibility with one another". This would not abrogate anyone's
> ability to choose their preferred setting nor, as far as I can tell,
> would it interfere with the ongoing development of Python.

-1

Please note that we did not choose to ship Python as UCS4 binary
on Linux - the Linux distributions did.

The Python default is UCS2 for a good reason: it's a good trade-off
between memory consumption, functionality and performance.

As already mentioned, I also don't understand how the changing
the Python default on Linux would help your users in any way -
if you let distutils compile your extensions, it's automatically
going to use the right Unicode setting for you (as well as your
users).

Unfortunately, this automatic support doesn't help you when
shipping e.g. setuptools eggs, but this is a tool problem,
not one of Python: setuptools completely ignores the fact
that there are two ways to build Python.

I'd suggest you ask the tool maintainers to adjust their tools
to support the Python Unicode option.

> Here are the details:
>
> I'm the maintainer of several Python packages. I work hard to make it
> easy for users, even users who don't know anything about Python, to
> use my software. There have been many pain points in this process and
> I've spent a lot of time on it for about three years now working on
> packaging, including the tools such as setuptools and distutils and
> the new "distribute" tool. Python packaging has been improving during
> these years -- things are looking up.
>
> One of the remaining pain points is that I can distribute binaries of
> my Python extension modules for Windows or Mac, but if I distribute a
> binary Python extension module on Linux, then if the user has a
> different UCS2/UCS4 setting then they won't be able to use the
> extension module. The current de facto standard for Linux is UCS4 --
> it is used by Debian, Ubuntu, Fedora, RHEL, OpenSuSE, etc. etc.. The
> vast majority of Linux users in practice have UCS4, and most binary
> Python modules are compiled for UCS4.
>
> That means that a few folks will get left out. Those folks, from my
> experience, are people who built their python executable themselves
> without specifying an override for the default, and the smaller Linux
> distributions who insist on doing whatever upstream Python devs
> recommend instead of doing whatever the other Linux distros are doing.
> One of the data points that I reported was a Python interpreter that
> was built locally on an Ubuntu server. Since the person building it
> didn't know to override the default setting of --enable-unicode, he
> ended up with a Python interpreter built for UCS2, even though all the
> Python extension modules shipped by Ubuntu were built with UCS4.

People building their own Python version will usually also build
their own extensions, so I don't really believe that the above
scenario is very common.

Also note that Python will complain loudly when you try to load
a UCS2 extension in a UCS4 build and vice-versa. We've made sure
that any extension using the Python Unicode C API has to be built
for the same UCS version of Python. This is done by using different
names for the C APIs at the C level.

> These are not isolated incidents. The following google searches
> suggest that a number of people spend time trying to figure out why
> Python extension modules fail on their linux systems:
>
> http://www.google.com/search?q=PyUnicodeUCS4_FromUnicode+undefined+symbol
> http://www.google.com/search?q=+PyUnicodeUCS2_FromUnicode+undefined+symbol
> http://www.google.com/search?q=_PyUnicodeUCS2_AsDefaultEncodedString+undefined+symbol

Perhaps we should add a FAQ entry for these linker errors
(which are caused by the mentioned C API changes to prevent
mixing UCS version) ?!

Here's a quick way to determine you Python Unicode build type:

python -c "import sys;print((sys.maxunicode<66000)and'UCS2'or'UCS4')"

Perhaps we should include this info as well as an 32/64-bit indicator
and the processor type in the Python startup line:

# python
Python 2.6 (r26:66714, Feb 3 2009, 20:49:49, UCS4, 64-bit, x86_64)
[GCC 4.3.2 [gcc-4_3-branch revision 141291]] on linux2
Type "help", "copyright", "credits" or "license" for more information.

This would help users find the right binaries to install as
extension.

> Another data point is the Mandriva Linux distribution. It is probably
> much smaller than Debian, Ubuntu, or RedHat, but it is still one of
> the major, well-known distributions. I requested of the Python
> maintainer for Mandriva, Michael Scherer, that they switch from UCS2
> to UCS4 in order to reduce compatibility problems like these. His
> answer as I understood it was that it is best to follow the
> recommendations of the upstream Python devs by using the default
> setting instead of choosing a setting for himself.

Which is IMHO what all Linux distributions should have done.

Distributions should really not be put in charge of upstream
coding design decisions.

Regards,
--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Sep 28 2009)
>>> Python/Zope Consulting and Support ... http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::


eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


mal at egenix

Sep 28, 2009, 1:36 AM

Post #5 of 17 (2345 views)
Permalink
Re: please consider changing --enable-unicode default to ucs4 [In reply to]

M.-A. Lemburg wrote:
> Also note that Python will complain loudly when you try to load
> a UCS2 extension in a UCS4 build and vice-versa. We've made sure
> that any extension using the Python Unicode C API has to be built
> for the same UCS version of Python. This is done by using different
> names for the C APIs at the C level.
>
>> These are not isolated incidents. The following google searches
>> suggest that a number of people spend time trying to figure out why
>> Python extension modules fail on their linux systems:
>>
>> http://www.google.com/search?q=PyUnicodeUCS4_FromUnicode+undefined+symbol
>> http://www.google.com/search?q=+PyUnicodeUCS2_FromUnicode+undefined+symbol
>> http://www.google.com/search?q=_PyUnicodeUCS2_AsDefaultEncodedString+undefined+symbol
>
> Perhaps we should add a FAQ entry for these linker errors
> (which are caused by the mentioned C API changes to prevent
> mixing UCS version) ?!

There already is one:

http://www.python.org/doc/faq/extending/#when-importing-module-x-why-do-i-get-undefined-symbol-pyunicodeucs2

I wonder why it doesn't show up in the Google searches.

> Here's a quick way to determine you Python Unicode build type:
>
> python -c "import sys;print((sys.maxunicode<66000)and'UCS2'or'UCS4')"
>
> Perhaps we should include this info as well as an 32/64-bit indicator
> and the processor type in the Python startup line:
>
> # python
> Python 2.6 (r26:66714, Feb 3 2009, 20:49:49, UCS4, 64-bit, x86_64)
> [GCC 4.3.2 [gcc-4_3-branch revision 141291]] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>
> This would help users find the right binaries to install as
> extension.

Regards,
--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Sep 28 2009)
>>> Python/Zope Consulting and Support ... http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::


eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


foom at fuhm

Sep 28, 2009, 8:13 AM

Post #6 of 17 (2343 views)
Permalink
Re: please consider changing --enable-unicode default to ucs4 [In reply to]

On Sep 28, 2009, at 4:25 AM, M.-A. Lemburg wrote:
> Distributions should really not be put in charge of upstream
> coding design decisions.

I don't think you can blame distros for this one....

From PEP 0261:
It is also proposed that one day --enable-unicode will just
default to the width of your platforms wchar_t.

On linux, wchar_t is 4 bytes.

If there's a consensus amongst python upstream that all the distros
should be shipping Python with UCS2 unicode strings, you should reach
out to them and say this, in a rather more clear fashion. Currently,
most signs point towards UCS4 builds as being the better option.

Or, one might reasonably wonder why UCS-4 is an option at all, if
nobody should enable it.

> People building their own Python version will usually also build
> their own extensions, so I don't really believe that the above
> scenario is very common.

I'd just like to note that I've run into this trap multiple times. I
built a custom python, and expected it to work with all the existing,
installed, extensions (same major version as the system install, just
patched). And then had to build it again with UCS4, for it to actually
work. Of course building twice isn't the end of the world, and I'm
certainly used to having to twiddle build options on software to get
it working, but, this *does* happen, and *is* a tiny bit irritating.

James
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


mal at egenix

Sep 28, 2009, 9:12 AM

Post #7 of 17 (2340 views)
Permalink
Re: please consider changing --enable-unicode default to ucs4 [In reply to]

James Y Knight wrote:
> On Sep 28, 2009, at 4:25 AM, M.-A. Lemburg wrote:
>> Distributions should really not be put in charge of upstream
>> coding design decisions.
>
> I don't think you can blame distros for this one....
>
> From PEP 0261:
> It is also proposed that one day --enable-unicode will just
> default to the width of your platforms wchar_t.
>
> On linux, wchar_t is 4 bytes.

The PEP also has this to say:

This has the effect of doubling the size of most Unicode
strings. In order to avoid imposing this cost on every
user, Python 2.2 will allow the 4-byte implementation as a
build-time option. Users can choose whether they care about
wide characters or prefer to preserve memory.

And that's still true today. It was the main reason for not
making it the default on those days. Today, Python 3.x
uses Unicode for all strings, so while the RAM situation has
changed somewhat since Python 2.2, the change has a much
wider effect on the Python memory foot-print than in late 2001.

> If there's a consensus amongst python upstream that all the distros
> should be shipping Python with UCS2 unicode strings, you should reach
> out to them and say this, in a rather more clear fashion. Currently,
> most signs point towards UCS4 builds as being the better option.

UCS4 is the better option if you use lots of non-BMP code points
and if you have to regularly interface with C APIs using wchar_t
on Unix.

> Or, one might reasonably wonder why UCS-4 is an option at all, if nobody
> should enable it.

See above: there are use cases where this does make a lot of sense.

E.g. non-BMP code points can only be represented using surrogates on
UCS2 builds and these can be tricky to deal with (or at least
many people feel like it's tricky to deal with them ;-).

>> People building their own Python version will usually also build
>> their own extensions, so I don't really believe that the above
>> scenario is very common.
>
> I'd just like to note that I've run into this trap multiple times. I
> built a custom python, and expected it to work with all the existing,
> installed, extensions (same major version as the system install, just
> patched). And then had to build it again with UCS4, for it to actually
> work. Of course building twice isn't the end of the world, and I'm
> certainly used to having to twiddle build options on software to get it
> working, but, this *does* happen, and *is* a tiny bit irritating.

Which is why I think that Python should include some more information
on the type of built being used, e.g. by placing the information
prominently on the startup line.

I still don't believe the above use case is a common one, though.

That said, Zooko's original motivation for the proposed change
is making installation of extensions easier for users. That's
a tools question much more than a Python Unicode one.

Aside: --enable-unicode is gone in Python 3.x. You now only
have the choice to use the default (which is UCS2) or switch on
the optional support for UCS4 by using --with-wide-unicode.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Sep 28 2009)
>>> Python/Zope Consulting and Support ... http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::


eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


martin at v

Sep 28, 2009, 7:18 PM

Post #8 of 17 (2339 views)
Permalink
Re: please consider changing --enable-unicode default to ucs4 [In reply to]

James Y Knight wrote:
> On Sep 28, 2009, at 4:25 AM, M.-A. Lemburg wrote:
>> Distributions should really not be put in charge of upstream
>> coding design decisions.
>
> I don't think you can blame distros for this one....
>
> From PEP 0261:
> It is also proposed that one day --enable-unicode will just
> default to the width of your platforms wchar_t.
>
> On linux, wchar_t is 4 bytes.
>
> If there's a consensus amongst python upstream that all the distros
> should be shipping Python with UCS2 unicode strings, you should reach
> out to them and say this, in a rather more clear fashion. Currently,
> most signs point towards UCS4 builds as being the better option.

There is no such consensus. Linux distributions really should build
Python in UCS-4 mode, and I would be in favor of making it the default
to match wchar_t.

Regards,
Martin
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


bjourne at gmail

Sep 29, 2009, 8:17 AM

Post #9 of 17 (2341 views)
Permalink
Re: please consider changing --enable-unicode default to ucs4 [In reply to]

2009/9/28 James Y Knight <foom [at] fuhm>:
>> People building their own Python version will usually also build
>> their own extensions, so I don't really believe that the above
>> scenario is very common.
>
> I'd just like to note that I've run into this trap multiple times. I built a
> custom python, and expected it to work with all the existing, installed,
> extensions (same major version as the system install, just patched). And
> then had to build it again with UCS4, for it to actually work. Of course
> building twice isn't the end of the world, and I'm certainly used to having
> to twiddle build options on software to get it working, but, this *does*
> happen, and *is* a tiny bit irritating.

I've also encountered this trap multiple times. Obviously, the problem
is not rebuilding Python which is quick, but to figure out the correct
configure option to use (--enable-unicode=ucs4). Others have also
spent some time scratching their heads over the strange
PyUnicodeUCS4_FromUnicode error the misconfiguration results in, as
Zooko's links show.

If Python can't infer the unicode setting from the width of the
platforms wchar_t, then perhaps it should be mandatory to specify to
configure whether you want UCS2 or UCS4? For someone clueless like me,
it would be easier to deal with the problem upfront than (much)
further down the line. Explicit being better than implicit and all
that.


--
mvh Björn
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


zookog at gmail

Sep 29, 2009, 10:03 AM

Post #10 of 17 (2335 views)
Permalink
Re: please consider changing --enable-unicode default to ucs4 [In reply to]

Dear MAL and python-dev:

I failed to explain the problem that users are having. I will try
again, and this time I will omit my ideas about how to improve things
and just focus on describing the problem.

Some users are having trouble using Python packages containing binary
extensions on Linux. I want to provide such binary Python packages
for Linux for the pycryptopp project
(http://allmydata.org/trac/pycryptopp ) and the zfec project
(http://allmydata.org/trac/zfec ). I also want to make it possible
for users to install the Tahoe-LAFS project (http://allmydata.org )
without having a compiler or Python header files. (You'd be surprised
at how often Tahoe-LAFS users try to do this on Linux. Linux is no
longer only for people who have the knowledge and patience to compile
software themselves.) Tahoe-LAFS also depends on many packages that
are maintained by other people and are not packaged or distributed by
me -- pyOpenSSL, simplejson, etc..

There have been several hurdles in the way that we've overcome, and no
doubt there will be more, but the current hurdle is that there are two
"formats" for Python extension modules that are used on Linux -- UCS2
and UCS4. If a user gets a Python package containing a compiled
extension module which was built for the wrong UCS2/4 setting, he will
get mysterious (to him) "undefined symbol" errors at import time.

On Mon, Sep 28, 2009 at 2:25 AM, M.-A. Lemburg <mal [at] egenix> wrote:
>
> The Python default is UCS2 for a good reason: it's a good trade-off
> between memory consumption, functionality and performance.

I'm sure you are right about this. At some point I will try to
measure the performance implications in the context of our
application. I don't think it will be an issue for us, as so far no
users have complained about any performance or functionality problems
that were traceable to the choice of UCS2/4.

> As already mentioned, I also don't understand how the changing
> the Python default on Linux would help your users in any way -
> if you let distutils compile your extensions, it's automatically
> going to use the right Unicode setting for you (as well as your
> users).

My users are using some Python packages built by me and some built by
others. The binary packages they get from others could have the
incompatible UCS2/4 setting. Also some of my users might be using a
python configured with the opposite setting of the python interpreter
that I use to build packages.

> Unfortunately, this automatic support doesn't help you when
> shipping e.g. setuptools eggs, but this is a tool problem,
> not one of Python: setuptools completely ignores the fact
> that there are two ways to build Python.

This is the setuptools/distribute issue that I mentioned:
http://bugs.python.org/setuptools/issue78 . If that issue were solved
then if a user tried to install a specific package, for example with a
command-line like "easy_install
http://allmydata.org/source/tahoe/deps/tahoe-dep-eggs/pyOpenSSL-0.8-py2.5-linux-i686.egg",
then instead of getting an undefined symbol error at import time, they
would get an error message to the effect of "This package is not
compatible with your Python interpreter." at install time. That would
be good because it would be less confusing to the users.

However, if they were using the default setuptools/distribute
dependency-satisfaction feature, e.g. because they are installing a
package and that package is marked as
"install_requires=['pyOpenSSL']", then setuptools/distribute would do
its fallback behavior in which it attempts to compile the package from
source when it can't find a compatible binary package. This would
probably confuse the users at least as much as the undefined symbol
error currently does.

In any case, improving the tools to handle incompatible packages
nicely would not make more packages compatible. Let's do both!
Improve tools to handle incompatible packages nicely, and encourage
everyone who compiles python on Linux to use the same UCS2/4
setting.

Thank you for your attention.

Regards,

Zooko
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


mal at egenix

Oct 7, 2009, 11:05 AM

Post #11 of 17 (2241 views)
Permalink
Re: please consider changing --enable-unicode default to ucs4 [In reply to]

Zooko O'Whielacronx wrote:
> Dear MAL and python-dev:
>
> I failed to explain the problem that users are having. I will try
> again, and this time I will omit my ideas about how to improve things
> and just focus on describing the problem.
>
> Some users are having trouble using Python packages containing binary
> extensions on Linux. I want to provide such binary Python packages
> for Linux for the pycryptopp project
> (http://allmydata.org/trac/pycryptopp ) and the zfec project
> (http://allmydata.org/trac/zfec ). I also want to make it possible
> for users to install the Tahoe-LAFS project (http://allmydata.org )
> without having a compiler or Python header files. (You'd be surprised
> at how often Tahoe-LAFS users try to do this on Linux. Linux is no
> longer only for people who have the knowledge and patience to compile
> software themselves.) Tahoe-LAFS also depends on many packages that
> are maintained by other people and are not packaged or distributed by
> me -- pyOpenSSL, simplejson, etc..
>
> There have been several hurdles in the way that we've overcome, and no
> doubt there will be more, but the current hurdle is that there are two
> "formats" for Python extension modules that are used on Linux -- UCS2
> and UCS4. If a user gets a Python package containing a compiled
> extension module which was built for the wrong UCS2/4 setting, he will
> get mysterious (to him) "undefined symbol" errors at import time.

Zooko, I really fail to see the reasoning here:

Why would people who know how to build their own Python interpreter
on Linux and expect it to work like the distribution-provided one,
have a problem looking up the distribution-used configuration
settings ?

This is like compiling your own Linux kernel without using
the same configuration as the distribution kernel and still
expecting the distribution kernel modules to load without
problems.

Note that this has nothing to do with compiling your own
Python extensions. Python's distutils will automatically
use the right settings for compiling those, based on the
configuration of the Python interpreter used for running
the compilation - which will usually be the distribution
interpreter.

Your argument doesn't really live up to the consequences
of switching to UCS4.

Just as data-point: eGenix has been shipping binaries for
Python packages for several years and while we do occasionally
get reports about UCS2/UCS4 mismatches, those are really
in the minority.

I'd also question using the UCS4 default only on Linux.

If we do go for a change, we should use sizeof(wchar_t)
as basis for the new default - on all platforms that
provide a wchar_t type.

However, before we can make such a decision, we need more
data about the consequences. That is:

* memory footprint changes

* performance changes

For both Python 2.x and 3.x. After all, UCS4 uses twice
as much memory for all Unicode objects as UCS2.

Since Python 3.x uses Unicode for all strings, I'd expect
such a change to have more impact there.

We'd also need to look into possible problems with different
compilers using different wchar_t sizes on the same platform
(I doubt that there are any).

On Windows, the default is fixed since Windows uses
UTF-16 for everything Unicode, so UCS2 will for a long
time be the only option on that platform.

That said, it'll take a while for distributions to
upgrade, so you're always better off getting the tools
you're using to deal with the problem for you and your
users, since those are easier to upgrade.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Oct 07 2009)
>>> Python/Zope Consulting and Support ... http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::


eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


solipsis at pitrou

Oct 7, 2009, 11:25 AM

Post #12 of 17 (2248 views)
Permalink
Re: please consider changing --enable-unicode default to ucs4 [In reply to]

Zooko O'Whielacronx <zookog <at> gmail.com> writes:
>
> I accidentally sent this letter just to MAL when I intended it to
> python-dev. Please read it, as it explains why the issue I'm raising
> is not just the "we should switch to ucs4 because it is better" issue
> that was previously settled by GvR.

For what it's worth, with stringbench under py3k, an UCS2 build is roughly 8%
faster than an UCS4 build (190 s. total against 206 s.).



_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


ronaldoussoren at mac

Oct 7, 2009, 12:13 PM

Post #13 of 17 (2245 views)
Permalink
Re: please consider changing --enable-unicode default to ucs4 [In reply to]

On 7 Oct, 2009, at 20:05, M.-A. Lemburg wrote:
>
>
> If we do go for a change, we should use sizeof(wchar_t)
> as basis for the new default - on all platforms that
> provide a wchar_t type.

I'd be -1 on that. Sizeof(wchar_t) is 4 on OSX, but all non-Unix API's
that deal with Unicode text use ucs16.

Ronald
Attachments: smime.p7s (3.48 KB)


mal at egenix

Oct 7, 2009, 1:13 PM

Post #14 of 17 (2238 views)
Permalink
Re: please consider changing --enable-unicode default to ucs4 [In reply to]

Ronald Oussoren wrote:
>
> On 7 Oct, 2009, at 20:05, M.-A. Lemburg wrote:
>>
>>
>> If we do go for a change, we should use sizeof(wchar_t)
>> as basis for the new default - on all platforms that
>> provide a wchar_t type.
>
> I'd be -1 on that. Sizeof(wchar_t) is 4 on OSX, but all non-Unix API's
> that deal with Unicode text use ucs16.

Is that true for non-Carbon APIs as well ?

This is what I found on the web (in summary):

Apple chose to go with UTF-16 at about the same time as Microsoft did
and used sizeof(wchar_t) == 2 for Mac OS. When they moved to Mac OS X,
they switched wchar_t to sizeof(wchar_t) == 4.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Oct 07 2009)
>>> Python/Zope Consulting and Support ... http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::


eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


ronaldoussoren at mac

Oct 7, 2009, 1:24 PM

Post #15 of 17 (2244 views)
Permalink
Re: please consider changing --enable-unicode default to ucs4 [In reply to]

On 7 Oct, 2009, at 22:13, M.-A. Lemburg wrote:

> Ronald Oussoren wrote:
>>
>> On 7 Oct, 2009, at 20:05, M.-A. Lemburg wrote:
>>>
>>>
>>> If we do go for a change, we should use sizeof(wchar_t)
>>> as basis for the new default - on all platforms that
>>> provide a wchar_t type.
>>
>> I'd be -1 on that. Sizeof(wchar_t) is 4 on OSX, but all non-Unix
>> API's
>> that deal with Unicode text use ucs16.
>
> Is that true for non-Carbon APIs as well ?
>
> This is what I found on the web (in summary):
>
> Apple chose to go with UTF-16 at about the same time as Microsoft did
> and used sizeof(wchar_t) == 2 for Mac OS. When they moved to Mac OS X,
> they switched wchar_t to sizeof(wchar_t) == 4.
>

Both Carbon and the modern APIs use UTF-16.

What I don't quite get in the UTF-16 vs. UTF-32 discussion is why
UTF-32 would be useful, because if you want to do generic Unicode
processing you have to look at sequences of composed characters (base
characters + composing marks) anyway instead of separate code points.
Not that I'm a unicode expert in any way...

Ronald
Attachments: smime.p7s (3.48 KB)


mal at egenix

Oct 7, 2009, 2:21 PM

Post #16 of 17 (2239 views)
Permalink
Re: please consider changing --enable-unicode default to ucs4 [In reply to]

Ronald Oussoren wrote:
>
> On 7 Oct, 2009, at 22:13, M.-A. Lemburg wrote:
>
>> Ronald Oussoren wrote:
>>>
>>> On 7 Oct, 2009, at 20:05, M.-A. Lemburg wrote:
>>>>
>>>>
>>>> If we do go for a change, we should use sizeof(wchar_t)
>>>> as basis for the new default - on all platforms that
>>>> provide a wchar_t type.
>>>
>>> I'd be -1 on that. Sizeof(wchar_t) is 4 on OSX, but all non-Unix API's
>>> that deal with Unicode text use ucs16.
>>
>> Is that true for non-Carbon APIs as well ?
>>
>> This is what I found on the web (in summary):
>>
>> Apple chose to go with UTF-16 at about the same time as Microsoft did
>> and used sizeof(wchar_t) == 2 for Mac OS. When they moved to Mac OS X,
>> they switched wchar_t to sizeof(wchar_t) == 4.
>>
>
> Both Carbon and the modern APIs use UTF-16.

Thanks for that data point. So UTF-16 would be the more
natural choice on Mac OS X, despite the choice of sizeof(wchar_t).

> What I don't quite get in the UTF-16 vs. UTF-32 discussion is why UTF-32
> would be useful, because if you want to do generic Unicode processing
> you have to look at sequences of composed characters (base characters +
> composing marks) anyway instead of separate code points. Not that I'm a
> unicode expert in any way...

Very true.

It's one of the reasons why I'm not much of a UCS4-fan - it only
helps with surrogates and that's about it.

Combining characters, various types of control code points
(e.g. joiners, bidirectional marks, breaks, non-breaks, annotations)
context sensitive casing, bidirectional marks and other such
features found in scripts cause very similar problems - often
much harder to solve, since they are not as easily identifiable
as surrogate high and low code points.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Oct 07 2009)
>>> Python/Zope Consulting and Support ... http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::


eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


nyamatongwe at gmail

Oct 7, 2009, 3:55 PM

Post #17 of 17 (2242 views)
Permalink
Re: please consider changing --enable-unicode default to ucs4 [In reply to]

Ronald Oussoren:

> Both Carbon and the modern APIs use UTF-16.

If Unicode size standardization is seen as sufficiently beneficial
then UTF-16 would be more widely applicable than UTF-32. Unix mostly
uses 8-bit APIs which are either explicitly UTF-8 (such as GTK+) or
can accept UTF-8 when the locale is set to UTF-8. They don't accept
UTF-32. It is possible that Unix could move towards UTF-32 but that
hasn't been the case up to now and with both OS X and Windows being
UTF-16, it is more likely that UTF-16 APIs will become more popular on
Unix.

Neil
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com

Python dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.