Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Dev

Unicode 5.1.0

 

 

First page Previous page 1 2 Next page Last page  View All Python dev RSS feed   Index | Next | Previous | View Threaded


guido at python

Aug 21, 2008, 1:35 PM

Post #1 of 33 (3709 views)
Permalink
Unicode 5.1.0

I was just paid a visit by my Google colleague Mark Davis, co-founder
of the Unicode project and the president of the Unicode Consortium. He
would like to see improved Unicode support for Python. (Well duh. :-)
On his list of top priorities are:

1. Upgrade the unicodata module to the Unicode 5.1.0 standard
2. Extende the unicodedata module with some additional properties
3. Add support for Unicode properties to the regex syntax, including
Boolean combinations

I've tried to explain our release schedule and
no-new-features-in-point-releases policies to him, and he understands
that it's too late to add #2 or #3 to 2.6 and 3.0, and that these will
have to wait for 2.7 and 3.1, respectively. However, I've kept the
door sligthtly ajar for adding #1 -- it can't be too much work and it
can't have too much impact. Or can it? I don't actually know what the
impact would be, so I'd like some impact from developers who are
closer to the origins of the unicodedata module.

The two, quite separate, questions, then, are (a) how much work would
it be to upgrade to version 5.1.0 of the database; and (b) would it be
acceptable to do this post-beta3 (but before rc1). If the answer to
(b) is positive, Google can help with (a).

In general, Google has needs in this area that can't wait for 2.7/3.1,
so what we may end up doing is create internal implementations of all
three features (compatible with Python 2.4 and later), publish them as
open source on Google Code, and fold them into core Python at the
first opportunity, which would likely be 2.7 and 3.1.

Comments?

--
--Guido van Rossum (home page: http://www.python.org/~guido/)
_______________________________________________
Python-Dev mailing list
Python-Dev[at]python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


mal at egenix

Aug 21, 2008, 2:26 PM

Post #2 of 33 (3667 views)
Permalink
Re: Unicode 5.1.0 [In reply to]

On 2008-08-21 22:35, Guido van Rossum wrote:
> I was just paid a visit by my Google colleague Mark Davis, co-founder
> of the Unicode project and the president of the Unicode Consortium. He
> would like to see improved Unicode support for Python. (Well duh. :-)
> On his list of top priorities are:
>
> 1. Upgrade the unicodata module to the Unicode 5.1.0 standard
> 2. Extende the unicodedata module with some additional properties
> 3. Add support for Unicode properties to the regex syntax, including
> Boolean combinations
>
> I've tried to explain our release schedule and
> no-new-features-in-point-releases policies to him, and he understands
> that it's too late to add #2 or #3 to 2.6 and 3.0, and that these will
> have to wait for 2.7 and 3.1, respectively. However, I've kept the
> door sligthtly ajar for adding #1 -- it can't be too much work and it
> can't have too much impact. Or can it? I don't actually know what the
> impact would be, so I'd like some impact from developers who are
> closer to the origins of the unicodedata module.
>
> The two, quite separate, questions, then, are (a) how much work would
> it be to upgrade to version 5.1.0 of the database; and (b) would it be
> acceptable to do this post-beta3 (but before rc1). If the answer to
> (b) is positive, Google can help with (a).
>
> In general, Google has needs in this area that can't wait for 2.7/3.1,
> so what we may end up doing is create internal implementations of all
> three features (compatible with Python 2.4 and later), publish them as
> open source on Google Code, and fold them into core Python at the
> first opportunity, which would likely be 2.7 and 3.1.
>
> Comments?

There are two things to consider:

unicodedata is just an optimized database for accessing code
point properties of a specific Unicode version (currently 4.1.0
and 3.2.0). Adding support for a new version needs some work on
the generation script, perhaps keeping the 4.1.0 version of it
like we did for 3.2.0, but that's about it.

However, there are other implications to consider when moving to
Unicode 5.1.0.

Just see the top of http://www.unicode.org/versions/Unicode5.1.0/
for a summary of changes compared to 5.0, plus
http://www.unicode.org/versions/Unicode5.0.0/ for changes between
4.1.0 and 5.0.

So while we could say: "we provide access to the Unicode 5.1.0
database", we cannot say: "we support Unicode 5.1.0", simply because
we have not reviewed the all the necessary changes and implications.

I think it's better to look through all the changes and then come
up with proper support for 2.7/3.1. If Google wants to contribute
to this, even better. To avoid duplication of work or heading in
different directions, it may be a good idea to create a
unicode-sig to discuss things.

Offline 'til next week-ly,
--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Aug 21 2008)
>>> Python/Zope Consulting and Support ... http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
________________________________________________________________________

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::


eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
_______________________________________________
Python-Dev mailing list
Python-Dev[at]python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


tjreedy at udel

Aug 21, 2008, 3:30 PM

Post #3 of 33 (3672 views)
Permalink
Re: Unicode 5.1.0 [In reply to]

Guido van Rossum wrote:
> I was just paid a visit by my Google colleague Mark Davis, co-founder
> of the Unicode project and the president of the Unicode Consortium. He
> would like to see improved Unicode support for Python. (Well duh. :-)
> On his list of top priorities are:
>
> 1. Upgrade the unicodata module to the Unicode 5.1.0 standard
> 2. Extende the unicodedata module with some additional properties
> 3. Add support for Unicode properties to the regex syntax, including
> Boolean combinations
>
> I've tried to explain our release schedule and
> no-new-features-in-point-releases policies to him, and he understands
> that it's too late to add #2 or #3 to 2.6 and 3.0, and that these will
> have to wait for 2.7 and 3.1, respectively. However, I've kept the
> door sligthtly ajar for adding #1 -- it can't be too much work and it
> can't have too much impact. Or can it? I don't actually know what the
> impact would be, so I'd like some impact from developers who are
> closer to the origins of the unicodedata module.
>
> The two, quite separate, questions, then, are (a) how much work would
> it be to upgrade to version 5.1.0 of the database; and (b) would it be
> acceptable to do this post-beta3 (but before rc1). If the answer to
> (b) is positive, Google can help with (a).

http://www.unicode.org/versions/Unicode5.1.0/
"Unicode 5.1.0 contains over 100,000 characters, and provides
significant additions and improvements..." to existing features,
including new files and upgrades to existing files. Sounds close to
adding features ;-)

> In general, Google has needs in this area that can't wait for 2.7/3.1,
> so what we may end up doing is create internal implementations of all
> three features (compatible with Python 2.4 and later), publish them as
> open source on Google Code, and fold them into core Python at the
> first opportunity, which would likely be 2.7 and 3.1.

If possible, I would suggest going a bit further and release a '3rd'
party replacement/extension package, including a Windows installer, that
is also listed on PyPI. Revised releases could and might need to be
done even more rapidly than the bugfix release schedule would allow.
(This could be done with other proposed new/revised modules also.)

What would need to be done now, I believe, if possible and acceptable,
it to slightly repackage the core to put unicode (3.0 strings) and _re*
code in a separate library so that they can be drop-in replaced or masked.

Terry Jan Reedy

_______________________________________________
Python-Dev mailing list
Python-Dev[at]python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


guido at python

Aug 21, 2008, 6:25 PM

Post #4 of 33 (3662 views)
Permalink
Re: Unicode 5.1.0 [In reply to]

On Thu, Aug 21, 2008 at 2:26 PM, M.-A. Lemburg <mal[at]egenix.com> wrote:
> On 2008-08-21 22:35, Guido van Rossum wrote:
>>
>> I was just paid a visit by my Google colleague Mark Davis, co-founder
>> of the Unicode project and the president of the Unicode Consortium. He
>> would like to see improved Unicode support for Python. (Well duh. :-)
>> On his list of top priorities are:
>>
>> 1. Upgrade the unicodata module to the Unicode 5.1.0 standard
>> 2. Extende the unicodedata module with some additional properties
>> 3. Add support for Unicode properties to the regex syntax, including
>> Boolean combinations
>>
>> I've tried to explain our release schedule and
>> no-new-features-in-point-releases policies to him, and he understands
>> that it's too late to add #2 or #3 to 2.6 and 3.0, and that these will
>> have to wait for 2.7 and 3.1, respectively. However, I've kept the
>> door sligthtly ajar for adding #1 -- it can't be too much work and it
>> can't have too much impact. Or can it? I don't actually know what the
>> impact would be, so I'd like some impact from developers who are
>> closer to the origins of the unicodedata module.
>>
>> The two, quite separate, questions, then, are (a) how much work would
>> it be to upgrade to version 5.1.0 of the database; and (b) would it be
>> acceptable to do this post-beta3 (but before rc1). If the answer to
>> (b) is positive, Google can help with (a).
>>
>> In general, Google has needs in this area that can't wait for 2.7/3.1,
>> so what we may end up doing is create internal implementations of all
>> three features (compatible with Python 2.4 and later), publish them as
>> open source on Google Code, and fold them into core Python at the
>> first opportunity, which would likely be 2.7 and 3.1.
>>
>> Comments?
>
> There are two things to consider:
>
> unicodedata is just an optimized database for accessing code
> point properties of a specific Unicode version (currently 4.1.0
> and 3.2.0). Adding support for a new version needs some work on
> the generation script, perhaps keeping the 4.1.0 version of it
> like we did for 3.2.0, but that's about it.
>
> However, there are other implications to consider when moving to
> Unicode 5.1.0.
>
> Just see the top of http://www.unicode.org/versions/Unicode5.1.0/
> for a summary of changes compared to 5.0, plus
> http://www.unicode.org/versions/Unicode5.0.0/ for changes between
> 4.1.0 and 5.0.
>
> So while we could say: "we provide access to the Unicode 5.1.0
> database", we cannot say: "we support Unicode 5.1.0", simply because
> we have not reviewed the all the necessary changes and implications.

Mark's response to this was:

"""
I'd suspect that you'll be as conformant to U5.1.0 as you were to U4.1.0 ;-)

More seriously, I don't think this is a roadblock -- I doubt that
there are real differences between U5.1.0 and U4.10 in terms of
conformance that would be touched by Python -- the conformance changes
tend to be either completely backward compatible or very esoteric.
What I can do is to review the Python support to see if and where
there are any problems, but I wouldn't anticipate any.
"""

Which suggests that he believes that the differences in the database
are very minor, and that upgrading just the database would not cause
any problems for code that worked well with the 4.1.0 database.

> I think it's better to look through all the changes and then come
> up with proper support for 2.7/3.1. If Google wants to contribute
> to this, even better. To avoid duplication of work or heading in
> different directions, it may be a good idea to create a
> unicode-sig to discuss things.

Not me. :-)

--
--Guido van Rossum (home page: http://www.python.org/~guido/)
_______________________________________________
Python-Dev mailing list
Python-Dev[at]python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


fredrik at pythonware

Aug 22, 2008, 3:47 AM

Post #5 of 33 (3657 views)
Permalink
Re: Unicode 5.1.0 [In reply to]

On Fri, Aug 22, 2008 at 3:25 AM, Guido van Rossum <guido[at]python.org> wrote:

>> So while we could say: "we provide access to the Unicode 5.1.0
>> database", we cannot say: "we support Unicode 5.1.0", simply because
>> we have not reviewed the all the necessary changes and implications.
>
> Mark's response to this was:
>
> """
> I'd suspect that you'll be as conformant to U5.1.0 as you were to U4.1.0 ;-)

is the suggestion to *replace* the 4.1.0 database with a 5.1.0
database, or to add yet another database in that module?

(how's the 3.2/4.1 dual support implemented? do we have two distinct
datasets, or are the differences encoded in some clever way? would it
make sense to split the unicodedata module into three separate
modules, one for each major Unicode version?)

</F>
_______________________________________________
Python-Dev mailing list
Python-Dev[at]python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


facundobatista at gmail

Aug 22, 2008, 6:42 AM

Post #6 of 33 (3653 views)
Permalink
Re: Unicode 5.1.0 [In reply to]

2008/8/21 Guido van Rossum <guido[at]python.org>:

> The two, quite separate, questions, then, are (a) how much work would
> it be to upgrade to version 5.1.0 of the database; and (b) would it be
> acceptable to do this post-beta3 (but before rc1). If the answer to
> (b) is positive, Google can help with (a).

Two thoughts:

- In view of jumping to a new standard at *this* point, what I'd like
to have is a comprehensive test suite for unicodedata in a similar
sense to what happens with Decimal... It would be great to have from
the Unicode Consortium a series of test cases (in Python, or in
something we could process), to verify that we support Unicode
properly.

- We always could have a beta4 if it's necessary...

Just my two pesos cents.

Regards,

--
. Facundo

Blog: http://www.taniquetil.com.ar/plog/
PyAr: http://www.python.org/ar/
_______________________________________________
Python-Dev mailing list
Python-Dev[at]python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


solipsis at pitrou

Aug 22, 2008, 7:54 AM

Post #7 of 33 (3651 views)
Permalink
Re: Unicode 5.1.0 [In reply to]

Facundo Batista <facundobatista <at> gmail.com> writes:
>
> Two thoughts:
>
> - In view of jumping to a new standard at *this* point, what I'd like
> to have is a comprehensive test suite for unicodedata in a similar
> sense to what happens with Decimal... It would be great to have from
> the Unicode Consortium a series of test cases (in Python, or in
> something we could process), to verify that we support Unicode
> properly.
>

And another question: would it be hard for Google to maintain this separately
until at least it's integrated to 3.1?

> - We always could have a beta4 if it's necessary...

If we go this route there are lots of attractive things that might justify yet
and yet another beta :-)

Just my two over-evaluated euro cents.

Regards

Antoine.


_______________________________________________
Python-Dev mailing list
Python-Dev[at]python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


guido at python

Aug 22, 2008, 7:59 AM

Post #8 of 33 (3651 views)
Permalink
Re: Unicode 5.1.0 [In reply to]

On Fri, Aug 22, 2008 at 3:47 AM, Fredrik Lundh <fredrik[at]pythonware.com> wrote:
> On Fri, Aug 22, 2008 at 3:25 AM, Guido van Rossum <guido[at]python.org> wrote:
[MAL]
>>> So while we could say: "we provide access to the Unicode 5.1.0
>>> database", we cannot say: "we support Unicode 5.1.0", simply because
>>> we have not reviewed the all the necessary changes and implications.
>>
>> Mark's response to this was:
>>
>> """
>> I'd suspect that you'll be as conformant to U5.1.0 as you were to U4.1.0 ;-)
>
> is the suggestion to *replace* the 4.1.0 database with a 5.1.0
> database, or to add yet another database in that module?

That's up to us. I don't know what the reason was for keeping the
3.2.0 database around -- does anyone here recall ever using it? For
what?

I think Mark believes that 5.1.0 is very much backwards compatible
with 4.1.0 so that there is no need to retain access to 4.1.0; but as
I said I don't know the use case so who knows.

> (how's the 3.2/4.1 dual support implemented? do we have two distinct
> datasets, or are the differences encoded in some clever way? would it
> make sense to split the unicodedata module into three separate
> modules, one for each major Unicode version?)

The current API looks fine to me: unicodedata is the latest version
whereas unicodedata.ucd_3_2_0 is the older version. The APIs are the
same; there's a tiny bit of code in the generated _db.h file that
expresses the differences:

static const change_record* get_change_3_2_0(Py_UCS4 n)
{
int index;
if (n >= 0x110000) index = 0;
else {
index = changes_3_2_0_index[n>>7];
index = changes_3_2_0_data[(index<<7)+(n & 127)];
}
return change_records_3_2_0+index;
}

static Py_UCS4 normalization_3_2_0(Py_UCS4 n)
{
switch(n) {
case 0x2f868: return 0x2136A;
case 0x2f874: return 0x5F33;
case 0x2f91f: return 0x43AB;
case 0x2f95f: return 0x7AAE;
case 0x2f9bf: return 0x4D57;
default: return 0;
}
}

--
--Guido van Rossum (home page: http://www.python.org/~guido/)
_______________________________________________
Python-Dev mailing list
Python-Dev[at]python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


guido at python

Aug 22, 2008, 8:05 AM

Post #9 of 33 (3655 views)
Permalink
Re: Unicode 5.1.0 [In reply to]

On Fri, Aug 22, 2008 at 6:42 AM, Facundo Batista
<facundobatista[at]gmail.com> wrote:
> - In view of jumping to a new standard at *this* point, what I'd like
> to have is a comprehensive test suite for unicodedata in a similar
> sense to what happens with Decimal... It would be great to have from
> the Unicode Consortium a series of test cases (in Python, or in
> something we could process), to verify that we support Unicode
> properly.

Unicode conformance isn't specified in the same way as Decimal
conformance. While there are certain algorithms that can be tested
(e.g. normalization, encoding, decoding), much of the conformance
requirements (AFAIK) are expressed in lots of words about providing
certain facilities etc. I don't actually think putting lots of effort
into this is well-spent; given the mechanical nature of the
translation from the unicode database files into C code (see
Tools/unicode/makeunicodedata.py) a bug in the translation is likely
to result in either bad C code or a systematic error that is easily
spotted.

> - We always could have a beta4 if it's necessary...

No way.

On Fri, Aug 22, 2008 at 7:54 AM, Antoine Pitrou <solipsis[at]pitrou.net> wrote:
> And another question: would it be hard for Google to maintain this separately
> until at least it's integrated to 3.1?

That's the plan.

--
--Guido van Rossum (home page: http://www.python.org/~guido/)
_______________________________________________
Python-Dev mailing list
Python-Dev[at]python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


fredrik at pythonware

Aug 22, 2008, 8:13 AM

Post #10 of 33 (3646 views)
Permalink
Re: Unicode 5.1.0 [In reply to]

On Fri, Aug 22, 2008 at 4:59 PM, Guido van Rossum <guido[at]python.org> wrote:

>> (how's the 3.2/4.1 dual support implemented? do we have two distinct
>> datasets, or are the differences encoded in some clever way? would it
>> make sense to split the unicodedata module into three separate
>> modules, one for each major Unicode version?)
>
> The current API looks fine to me: unicodedata is the latest version
> whereas unicodedata.ucd_3_2_0 is the older version. The APIs are the
> same; there's a tiny bit of code in the generated _db.h file that
> expresses the differences:
>
> static const change_record* get_change_3_2_0(Py_UCS4 n)
> {
> int index;
> if (n >= 0x110000) index = 0;
> else {
> index = changes_3_2_0_index[n>>7];
> index = changes_3_2_0_data[(index<<7)+(n & 127)];
> }
> return change_records_3_2_0+index;
> }

there's a bunch of data tables as well, but they don't seem to be very
large. looks like Martin did a thorough job here.

... digging digging digging ...

yes, the generator script produces difference tables between the main
version and a list of older versions. I'd say it's worth running the
script on the 5.1.0 tables, and if it doesn't choke, compare the
resulting table with the corresponding table for 4.1.0 (a simple loop
fetching the main properties for all code points). if the differences
look reasonably small, switch 5.1.0 and keep the others.

I can tinker a little with this over the weekend, unless Martin tells
me not to ;-)

</F>
_______________________________________________
Python-Dev mailing list
Python-Dev[at]python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


fredrik at pythonware

Aug 22, 2008, 8:15 AM

Post #11 of 33 (3646 views)
Permalink
Re: Unicode 5.1.0 [In reply to]

when did Python-Dev turn into a members only list, btw?

---

Your mail to 'Python-Dev' with the subject

Re: Unicode 5.1.0

Is being held until the list moderator can review it for approval.

The reason it is being held:

Post by non-member to a members-only list

---
_______________________________________________
Python-Dev mailing list
Python-Dev[at]python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


guido at python

Aug 22, 2008, 9:12 AM

Post #12 of 33 (3644 views)
Permalink
Re: Unicode 5.1.0 [In reply to]

2008/8/22 Fredrik Lundh <fredrik[at]pythonware.com>:
> On Fri, Aug 22, 2008 at 4:59 PM, Guido van Rossum <guido[at]python.org>
wrote:
>
>>> (how's the 3.2/4.1 dual support implemented? do we have two distinct
>>> datasets, or are the differences encoded in some clever way? would it
>>> make sense to split the unicodedata module into three separate
>>> modules, one for each major Unicode version?)
>>
>> The current API looks fine to me: unicodedata is the latest version
>> whereas unicodedata.ucd_3_2_0 is the older version. The APIs are the
>> same; there's a tiny bit of code in the generated _db.h file that
>> expresses the differences:
>>
>> static const change_record* get_change_3_2_0(Py_UCS4 n)
>> {
>> int index;
>> if (n >= 0x110000) index = 0;
>> else {
>> index = changes_3_2_0_index[n>>7];
>> index = changes_3_2_0_data[(index<<7)+(n & 127)];
>> }
>> return change_records_3_2_0+index;
>> }
>
> there's a bunch of data tables as well, but they don't seem to be very
> large. looks like Martin did a thorough job here.
>
> ... digging digging digging ...
>
> yes, the generator script produces difference tables between the main
> version and a list of older versions. I'd say it's worth running the
> script on the 5.1.0 tables, and if it doesn't choke, compare the
> resulting table with the corresponding table for 4.1.0 (a simple loop
> fetching the main properties for all code points). if the differences
> look reasonably small, switch 5.1.0 and keep the others.

Right, that's my hope as well. I believe the changes between 3.2 and 4.1
were much larger than more recent changes. (Yay convergence! :-)

> I can tinker a little with this over the weekend, unless Martin tells
> me not to ;-)

That would be great!

--
--Guido van Rossum (home page: http://www.python.org/~guido/)


guido at python

Aug 22, 2008, 9:51 AM

Post #13 of 33 (3645 views)
Permalink
Re: Unicode 5.1.0 [In reply to]

I think it's an anti-spam measure. Anybody can be a member though. :-)

On Fri, Aug 22, 2008 at 8:15 AM, Fredrik Lundh <fredrik[at]pythonware.com> wrote:
> when did Python-Dev turn into a members only list, btw?
>
> ---
>
> Your mail to 'Python-Dev' with the subject
>
> Re: Unicode 5.1.0
>
> Is being held until the list moderator can review it for approval.
>
> The reason it is being held:
>
> Post by non-member to a members-only list

--
--Guido van Rossum (home page: http://www.python.org/~guido/)
_______________________________________________
Python-Dev mailing list
Python-Dev[at]python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


amk at amk

Aug 22, 2008, 9:52 AM

Post #14 of 33 (3644 views)
Permalink
Re: Unicode 5.1.0 [In reply to]

On Fri, Aug 22, 2008 at 07:59:46AM -0700, Guido van Rossum wrote:
> That's up to us. I don't know what the reason was for keeping the
> 3.2.0 database around -- does anyone here recall ever using it? For
> what?

RFC 3491, one of the internationalized domain name RFCs, explicitly
requires Unicode 3.2.0, so Lib/stringprep.py needs to use the old
database and we have to keep 3.2.0 available. Maybe no specs depend
on 4.1.0, so it could simply be replaced by 5.1.0.

--amk
_______________________________________________
Python-Dev mailing list
Python-Dev[at]python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


martin at v

Aug 24, 2008, 12:35 PM

Post #15 of 33 (3492 views)
Permalink
Re: Unicode 5.1.0 [In reply to]

> is the suggestion to *replace* the 4.1.0 database with a 5.1.0
> database, or to add yet another database in that module?

I would replace it.

> (how's the 3.2/4.1 dual support implemented?

The compiler needs data files for all supported versions, with
old_versions listing the, well, old versions. It then computes
deltas, expecting that they should mostly consist of new
assignments (i.e. characters unassigned in 3.2 might be assigned
in newer versions). It detects all differences, but might not be
able to represent all changes.

> do we have two distinct
> datasets, or are the differences encoded in some clever way?

The latter. It doesn't really need to be that clever: primarily
just a compressed list of "new" characters is needed, per version.

> would it
> make sense to split the unicodedata module into three separate
> modules, one for each major Unicode version?)

You couldn't use the space savings then, I suppose.

Regards,
Martin
_______________________________________________
Python-Dev mailing list
Python-Dev[at]python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


martin at v

Aug 24, 2008, 12:40 PM

Post #16 of 33 (3492 views)
Permalink
Re: Unicode 5.1.0 [In reply to]

> That's up to us. I don't know what the reason was for keeping the
> 3.2.0 database around -- does anyone here recall ever using it? For
> what?

It's needed for IDNA. The IDNA RFC requires that Unicode 3.2 is used
for performing IDNA (in particular, for determining what a valid domain
name is).

The IDNA people consider it security-relevant that it is
really the 3.2 database, and would probably consider it a serious
security bug if newer Python versions suddenly started to use newer
Unicode databases for IDNA.

At some point, IDNA might get updated to a newer version of the Unicode
spec; we can then drop 3.2 (and stick with whatever the RFC then
specifies).

Regards,
Martin

_______________________________________________
Python-Dev mailing list
Python-Dev[at]python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


martin at v

Aug 24, 2008, 12:44 PM

Post #17 of 33 (3502 views)
Permalink
Re: Unicode 5.1.0 [In reply to]

> I can tinker a little with this over the weekend, unless Martin tells
> me not to ;-)

Go ahead; I can't work on this at the moment, anyway. I would also be
confident that a mere replacement of 4.1 with 5.1 should be easy, and
I see no reason to keep the 4.1 version.

Perhaps makeunicodedata should list *why* certain old versions remain
supported; for 3.2, the use case is IDNA.

Regards,
Martin
_______________________________________________
Python-Dev mailing list
Python-Dev[at]python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


barry at python

Aug 25, 2008, 5:50 AM

Post #18 of 33 (3473 views)
Permalink
Re: Unicode 5.1.0 [In reply to]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I was away for the weekend and am struggling to catch up on my email.
Since I haven't digested this entire thread, I'll refrain for the
moment from giving my opinion, however this comment jumped out to me.

On Aug 22, 2008, at 9:42 AM, Facundo Batista wrote:

> - We always could have a beta4 if it's necessary...

I do not want to slip the schedule if at all possible. If serious
security issues, performance problems, show stopper bugs crop up, then
we will obviously slip so that we don't have to put a brown bag over
our heads. Slipping to get yet one more feature in is not (IMO)
acceptable.

An incentive for keeping the schedule: If we hit our October 1st
deadline, then 2.6 and 3.0 will almost certainly be included in some
upcoming major new OS releases. If we slip, then it's unlikely to
happen.

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Darwin)

iQCVAwUBSLKqenEjvBPtnXfVAQJukAP+L93nxTP436Au9GkLZQUhy1Gbk8rDvq2K
jZtJA5Rb9VKUr7TDoqZ2iFRRg9tsxwz+fLzZp0m00WWGRvKHdgqS+c6sHBaXazzk
txFhyspkw0cndD7zsNoqThlY6Q1CkhK3BHYmRLWS+PVhfOm6bRgudL+ePcWneT2X
24pFB83GSjo=
=/lq8
-----END PGP SIGNATURE-----
_______________________________________________
Python-Dev mailing list
Python-Dev[at]python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


ismail at namtrac

Aug 25, 2008, 6:43 AM

Post #19 of 33 (3466 views)
Permalink
Re: Unicode 5.1.0 [In reply to]

Hi,

On Thu, Aug 21, 2008 at 23:35, Guido van Rossum <guido[at]python.org> wrote:
> I was just paid a visit by my Google colleague Mark Davis, co-founder
> of the Unicode project and the president of the Unicode Consortium. He
> would like to see improved Unicode support for Python. (Well duh. :-)
> On his list of top priorities are:
>
> 1. Upgrade the unicodata module to the Unicode 5.1.0 standard
> 2. Extende the unicodedata module with some additional properties
> 3. Add support for Unicode properties to the regex syntax, including
> Boolean combinations

Adding support for SpecialCasing rules[0] would be good for full
Unicode support too. It would fix i/I problems that are currently
going on with Turkish locale.

[0] http://unicode.org/Public/UNIDATA/SpecialCasing.txt

Regards,
ismail

--
Programmer Excuse #17: The processor stack spring has worn out.
_______________________________________________
Python-Dev mailing list
Python-Dev[at]python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


mal at egenix

Aug 25, 2008, 7:49 AM

Post #20 of 33 (3473 views)
Permalink
Re: Unicode 5.1.0 [In reply to]

On 2008-08-22 03:25, Guido van Rossum wrote:
> On Thu, Aug 21, 2008 at 2:26 PM, M.-A. Lemburg <mal[at]egenix.com> wrote:
>> On 2008-08-21 22:35, Guido van Rossum wrote:
>>> I was just paid a visit by my Google colleague Mark Davis, co-founder
>>> of the Unicode project and the president of the Unicode Consortium. He
>>> would like to see improved Unicode support for Python. (Well duh. :-)
>>> On his list of top priorities are:
>>>
>>> 1. Upgrade the unicodata module to the Unicode 5.1.0 standard
>>> 2. Extende the unicodedata module with some additional properties
>>> 3. Add support for Unicode properties to the regex syntax, including
>>> Boolean combinations
>>>
>>> I've tried to explain our release schedule and
>>> no-new-features-in-point-releases policies to him, and he understands
>>> that it's too late to add #2 or #3 to 2.6 and 3.0, and that these will
>>> have to wait for 2.7 and 3.1, respectively. However, I've kept the
>>> door sligthtly ajar for adding #1 -- it can't be too much work and it
>>> can't have too much impact. Or can it? I don't actually know what the
>>> impact would be, so I'd like some impact from developers who are
>>> closer to the origins of the unicodedata module.
>>>
>>> The two, quite separate, questions, then, are (a) how much work would
>>> it be to upgrade to version 5.1.0 of the database; and (b) would it be
>>> acceptable to do this post-beta3 (but before rc1). If the answer to
>>> (b) is positive, Google can help with (a).
>>>
>>> In general, Google has needs in this area that can't wait for 2.7/3.1,
>>> so what we may end up doing is create internal implementations of all
>>> three features (compatible with Python 2.4 and later), publish them as
>>> open source on Google Code, and fold them into core Python at the
>>> first opportunity, which would likely be 2.7 and 3.1.
>>>
>>> Comments?
>> There are two things to consider:
>>
>> unicodedata is just an optimized database for accessing code
>> point properties of a specific Unicode version (currently 4.1.0
>> and 3.2.0). Adding support for a new version needs some work on
>> the generation script, perhaps keeping the 4.1.0 version of it
>> like we did for 3.2.0, but that's about it.
>>
>> However, there are other implications to consider when moving to
>> Unicode 5.1.0.
>>
>> Just see the top of http://www.unicode.org/versions/Unicode5.1.0/
>> for a summary of changes compared to 5.0, plus
>> http://www.unicode.org/versions/Unicode5.0.0/ for changes between
>> 4.1.0 and 5.0.
>>
>> So while we could say: "we provide access to the Unicode 5.1.0
>> database", we cannot say: "we support Unicode 5.1.0", simply because
>> we have not reviewed the all the necessary changes and implications.
>
> Mark's response to this was:
>
> """
> I'd suspect that you'll be as conformant to U5.1.0 as you were to U4.1.0 ;-)
>
> More seriously, I don't think this is a roadblock -- I doubt that
> there are real differences between U5.1.0 and U4.10 in terms of
> conformance that would be touched by Python -- the conformance changes
> tend to be either completely backward compatible or very esoteric.
> What I can do is to review the Python support to see if and where
> there are any problems, but I wouldn't anticipate any.
> """
>
> Which suggests that he believes that the differences in the database
> are very minor, and that upgrading just the database would not cause
> any problems for code that worked well with the 4.1.0 database.

Fine with me.

>> I think it's better to look through all the changes and then come
>> up with proper support for 2.7/3.1. If Google wants to contribute
>> to this, even better. To avoid duplication of work or heading in
>> different directions, it may be a good idea to create a
>> unicode-sig to discuss things.
>
> Not me. :-)

I would really like to see more Unicode support in Python, e.g.
for collation, compression, indexing based on graphemes and
code points, better support for special casing situations (to
cover e.g. the dotted vs. non-dotted i in the Turkish scripts),
etc.

There are also a few changes that we'd need to incorporate into
the UTF codecs, e.g. warn about more ill-formed byte sequences.

Would Google be willing to contribute such support or part
of it ?

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Aug 25 2008)
>>> Python/Zope Consulting and Support ... http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
________________________________________________________________________

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::


eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
_______________________________________________
Python-Dev mailing list
Python-Dev[at]python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


guido at python

Aug 25, 2008, 9:04 AM

Post #21 of 33 (3456 views)
Permalink
Re: Unicode 5.1.0 [In reply to]

2008/8/25 M.-A. Lemburg <mal[at]egenix.com>:
> I would really like to see more Unicode support in Python, e.g.
> for collation, compression, indexing based on graphemes and
> code points, better support for special casing situations (to
> cover e.g. the dotted vs. non-dotted i in the Turkish scripts),
> etc.
>
> There are also a few changes that we'd need to incorporate into
> the UTF codecs, e.g. warn about more ill-formed byte sequences.
>
> Would Google be willing to contribute such support or part
> of it ?

That depends purely on how much need Google itself has for these features.
I'll ask around, but for now I wouldn't bet on anything beyond the three
points I raised at the start of this thread:

1. Upgrade the unicodata module to the Unicode 5.1.0 standard
2. Extende the unicodedata module with some additional properties
3. Add support for Unicode properties to the regex syntax, including
Boolean combinations

--
--Guido van Rossum (home page: http://www.python.org/~guido/)


tjreedy at udel

Aug 25, 2008, 10:13 AM

Post #22 of 33 (3461 views)
Permalink
Re: Unicode 5.1.0 [In reply to]

Guido van Rossum wrote:
> 2008/8/25 M.-A. Lemburg <mal[at]egenix.com <mailto:mal[at]egenix.com>>:
> > I would really like to see more Unicode support in Python, e.g.
> > for collation, compression, indexing based on graphemes and
> > code points, better support for special casing situations (to
> > cover e.g. the dotted vs. non-dotted i in the Turkish scripts),
> > etc.
> >
> > There are also a few changes that we'd need to incorporate into
> > the UTF codecs, e.g. warn about more ill-formed byte sequences.
> >
> > Would Google be willing to contribute such support or part
> > of it ?
>
> That depends purely on how much need Google itself has for these
> features. I'll ask around, but for now I wouldn't bet on anything beyond
> the three points I raised at the start of this thread:
>
> 1. Upgrade the unicodata module to the Unicode 5.1.0 standard
> 2. Extende the unicodedata module with some additional properties
> 3. Add support for Unicode properties to the regex syntax, including
> Boolean combinations

I think an Improve Unicode Support PEP would be a good idea to collect
(and get approval or not for) various ideas from various people, even if
Google only implements part of the PEP.

_______________________________________________
Python-Dev mailing list
Python-Dev[at]python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


barry at python

Aug 25, 2008, 10:34 AM

Post #23 of 33 (3450 views)
Permalink
Re: Unicode 5.1.0 [In reply to]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Aug 21, 2008, at 6:30 PM, Terry Reedy wrote:
>
> http://www.unicode.org/versions/Unicode5.1.0/
> "Unicode 5.1.0 contains over 100,000 characters, and provides
> significant additions and improvements..." to existing features,
> including new files and upgrades to existing files. Sounds close to
> adding features ;-)

I agree. This seriously feels like new, potentially high risk code to
be adding this late in the game. The BDFL can always override, but
unless someone is really convincing that this is low risk high
benefit, I'd vote no for 2.6/3.0.

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Darwin)

iQCVAwUBSLLtMnEjvBPtnXfVAQKg0wP+LJ1XYXhEQHUAvT3fPbPzStCN8Lb+D7XG
hZOANnTCbPGaeCY19B8mYZbXkvjkCBptauKGB5yGOAnb1KCkSaQWx0wCInkeyIFE
mVMupGZCUsdsO7KreEwvyhBpOJ/HNY0+eacv8GZKCwC9xW3WmhaOjry7sZFhjffw
hAX1AuxaPWA=
=2j8a
-----END PGP SIGNATURE-----
_______________________________________________
Python-Dev mailing list
Python-Dev[at]python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


musiccomposition at gmail

Aug 25, 2008, 10:52 AM

Post #24 of 33 (3460 views)
Permalink
Re: Unicode 5.1.0 [In reply to]

On Mon, Aug 25, 2008 at 12:34 PM, Barry Warsaw <barry[at]python.org> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On Aug 21, 2008, at 6:30 PM, Terry Reedy wrote:
>>
>> http://www.unicode.org/versions/Unicode5.1.0/
>> "Unicode 5.1.0 contains over 100,000 characters, and provides significant
>> additions and improvements..." to existing features, including new files and
>> upgrades to existing files. Sounds close to adding features ;-)
>
> I agree. This seriously feels like new, potentially high risk code to be
> adding this late in the game. The BDFL can always override, but unless
> someone is really convincing that this is low risk high benefit, I'd vote no
> for 2.6/3.0.

+1

Something I think we should also be considering is the 2.7/3.1 release
cycle. I propose that we shorten it to ~1 year from 2.6/3.0's release
with our main aim being binding 2.x and 3.x more closely. This would
get the new unicode features out fairly quickly without having to wait
another 2.5 years like 2.5 -> 2.6.



--
Cheers,
Benjamin Peterson
"There's no place like 127.0.0.1."
_______________________________________________
Python-Dev mailing list
Python-Dev[at]python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


fredrik at pythonware

Aug 25, 2008, 10:53 AM

Post #25 of 33 (3455 views)
Permalink
Re: Unicode 5.1.0 [In reply to]

Barry Warsaw wrote:

> I agree. This seriously feels like new, potentially high risk code to
> be adding this late in the game. The BDFL can always override, but
> unless someone is really convincing that this is low risk high benefit,
> I'd vote no for 2.6/3.0.

at least two Unicode experts have stated that they don't think the
changes are that important. determining exactly what the changes to the
*core* character database was the whole point of my offer to tinker with
this.

(I got distracted due to compiler issues and certain other things to be
announced later, but I expect to have some results later this week).

</F>

_______________________________________________
Python-Dev mailing list
Python-Dev[at]python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com

First page Previous page 1 2 Next page Last page  View All Python dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.