Mailman 3 Unicode 5.1.0 - Python-Dev

Unicode 5.1.0

Guido van Rossum

Aug. 21, 2008

8:35 p.m.

I was just paid a visit by my Google colleague Mark Davis, co-founder of the Unicode project and the president of the Unicode Consortium. He would like to see improved Unicode support for Python. (Well duh. :-) On his list of top priorities are: 1. Upgrade the unicodata module to the Unicode 5.1.0 standard 2. Extende the unicodedata module with some additional properties 3. Add support for Unicode properties to the regex syntax, including Boolean combinations I've tried to explain our release schedule and no-new-features-in-point-releases policies to him, and he understands that it's too late to add #2 or #3 to 2.6 and 3.0, and that these will have to wait for 2.7 and 3.1, respectively. However, I've kept the door sligthtly ajar for adding #1 -- it can't be too much work and it can't have too much impact. Or can it? I don't actually know what the impact would be, so I'd like some impact from developers who are closer to the origins of the unicodedata module. The two, quite separate, questions, then, are (a) how much work would it be to upgrade to version 5.1.0 of the database; and (b) would it be acceptable to do this post-beta3 (but before rc1). If the answer to (b) is positive, Google can help with (a). In general, Google has needs in this area that can't wait for 2.7/3.1, so what we may end up doing is create internal implementations of all three features (compatible with Python 2.4 and later), publish them as open source on Google Code, and fold them into core Python at the first opportunity, which would likely be 2.7 and 3.1. Comments? -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Show replies by date

M.-A. Lemburg

August 2008

9:26 p.m.

On 2008-08-21 22:35, Guido van Rossum wrote:

...

I was just paid a visit by my Google colleague Mark Davis, co-founder of the Unicode project and the president of the Unicode Consortium. He would like to see improved Unicode support for Python. (Well duh. :-) On his list of top priorities are:

1. Upgrade the unicodata module to the Unicode 5.1.0 standard 2. Extende the unicodedata module with some additional properties 3. Add support for Unicode properties to the regex syntax, including Boolean combinations

I've tried to explain our release schedule and no-new-features-in-point-releases policies to him, and he understands that it's too late to add #2 or #3 to 2.6 and 3.0, and that these will have to wait for 2.7 and 3.1, respectively. However, I've kept the door sligthtly ajar for adding #1 -- it can't be too much work and it can't have too much impact. Or can it? I don't actually know what the impact would be, so I'd like some impact from developers who are closer to the origins of the unicodedata module.

The two, quite separate, questions, then, are (a) how much work would it be to upgrade to version 5.1.0 of the database; and (b) would it be acceptable to do this post-beta3 (but before rc1). If the answer to (b) is positive, Google can help with (a).

In general, Google has needs in this area that can't wait for 2.7/3.1, so what we may end up doing is create internal implementations of all three features (compatible with Python 2.4 and later), publish them as open source on Google Code, and fold them into core Python at the first opportunity, which would likely be 2.7 and 3.1.

Comments?

There are two things to consider: unicodedata is just an optimized database for accessing code point properties of a specific Unicode version (currently 4.1.0 and 3.2.0). Adding support for a new version needs some work on the generation script, perhaps keeping the 4.1.0 version of it like we did for 3.2.0, but that's about it. However, there are other implications to consider when moving to Unicode 5.1.0. Just see the top of http://www.unicode.org/versions/Unicode5.1.0/ for a summary of changes compared to 5.0, plus http://www.unicode.org/versions/Unicode5.0.0/ for changes between 4.1.0 and 5.0. So while we could say: "we provide access to the Unicode 5.1.0 database", we cannot say: "we support Unicode 5.1.0", simply because we have not reviewed the all the necessary changes and implications. I think it's better to look through all the changes and then come up with proper support for 2.7/3.1. If Google wants to contribute to this, even better. To avoid duplication of work or heading in different directions, it may be a good idea to create a unicode-sig to discuss things. Offline 'til next week-ly, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 21 2008)

...

...
...
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611

Guido van Rossum

1:25 a.m.

On Thu, Aug 21, 2008 at 2:26 PM, M.-A. Lemburg <mal@egenix.com> wrote:

...

On 2008-08-21 22:35, Guido van Rossum wrote:

...
I was just paid a visit by my Google colleague Mark Davis, co-founder of the Unicode project and the president of the Unicode Consortium. He would like to see improved Unicode support for Python. (Well duh. :-) On his list of top priorities are:

1. Upgrade the unicodata module to the Unicode 5.1.0 standard 2. Extende the unicodedata module with some additional properties 3. Add support for Unicode properties to the regex syntax, including Boolean combinations

I've tried to explain our release schedule and no-new-features-in-point-releases policies to him, and he understands that it's too late to add #2 or #3 to 2.6 and 3.0, and that these will have to wait for 2.7 and 3.1, respectively. However, I've kept the door sligthtly ajar for adding #1 -- it can't be too much work and it can't have too much impact. Or can it? I don't actually know what the impact would be, so I'd like some impact from developers who are closer to the origins of the unicodedata module.

The two, quite separate, questions, then, are (a) how much work would it be to upgrade to version 5.1.0 of the database; and (b) would it be acceptable to do this post-beta3 (but before rc1). If the answer to (b) is positive, Google can help with (a).

In general, Google has needs in this area that can't wait for 2.7/3.1, so what we may end up doing is create internal implementations of all three features (compatible with Python 2.4 and later), publish them as open source on Google Code, and fold them into core Python at the first opportunity, which would likely be 2.7 and 3.1.

Comments?

There are two things to consider:

unicodedata is just an optimized database for accessing code point properties of a specific Unicode version (currently 4.1.0 and 3.2.0). Adding support for a new version needs some work on the generation script, perhaps keeping the 4.1.0 version of it like we did for 3.2.0, but that's about it.

However, there are other implications to consider when moving to Unicode 5.1.0.

Just see the top of http://www.unicode.org/versions/Unicode5.1.0/ for a summary of changes compared to 5.0, plus http://www.unicode.org/versions/Unicode5.0.0/ for changes between 4.1.0 and 5.0.

So while we could say: "we provide access to the Unicode 5.1.0 database", we cannot say: "we support Unicode 5.1.0", simply because we have not reviewed the all the necessary changes and implications.

Mark's response to this was: """ I'd suspect that you'll be as conformant to U5.1.0 as you were to U4.1.0 ;-) More seriously, I don't think this is a roadblock -- I doubt that there are real differences between U5.1.0 and U4.10 in terms of conformance that would be touched by Python -- the conformance changes tend to be either completely backward compatible or very esoteric. What I can do is to review the Python support to see if and where there are any problems, but I wouldn't anticipate any. """ Which suggests that he believes that the differences in the database are very minor, and that upgrading just the database would not cause any problems for code that worked well with the 4.1.0 database.

...

I think it's better to look through all the changes and then come up with proper support for 2.7/3.1. If Google wants to contribute to this, even better. To avoid duplication of work or heading in different directions, it may be a good idea to create a unicode-sig to discuss things.

Not me. :-) -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Fredrik Lundh

10:47 a.m.

On Fri, Aug 22, 2008 at 3:25 AM, Guido van Rossum <guido@python.org> wrote:

...

...
So while we could say: "we provide access to the Unicode 5.1.0 database", we cannot say: "we support Unicode 5.1.0", simply because we have not reviewed the all the necessary changes and implications.

Mark's response to this was:

""" I'd suspect that you'll be as conformant to U5.1.0 as you were to U4.1.0 ;-)

is the suggestion to *replace* the 4.1.0 database with a 5.1.0 database, or to add yet another database in that module? (how's the 3.2/4.1 dual support implemented? do we have two distinct datasets, or are the differences encoded in some clever way? would it make sense to split the unicodedata module into three separate modules, one for each major Unicode version?) </F>

Guido van Rossum

2:59 p.m.

On Fri, Aug 22, 2008 at 3:47 AM, Fredrik Lundh <fredrik@pythonware.com> wrote:

...

On Fri, Aug 22, 2008 at 3:25 AM, Guido van Rossum <guido@python.org> wrote: [MAL]

...
...
So while we could say: "we provide access to the Unicode 5.1.0 database", we cannot say: "we support Unicode 5.1.0", simply because we have not reviewed the all the necessary changes and implications.

Mark's response to this was:

""" I'd suspect that you'll be as conformant to U5.1.0 as you were to U4.1.0 ;-)

is the suggestion to *replace* the 4.1.0 database with a 5.1.0 database, or to add yet another database in that module?

That's up to us. I don't know what the reason was for keeping the 3.2.0 database around -- does anyone here recall ever using it? For what? I think Mark believes that 5.1.0 is very much backwards compatible with 4.1.0 so that there is no need to retain access to 4.1.0; but as I said I don't know the use case so who knows.

...

(how's the 3.2/4.1 dual support implemented? do we have two distinct datasets, or are the differences encoded in some clever way? would it make sense to split the unicodedata module into three separate modules, one for each major Unicode version?)

The current API looks fine to me: unicodedata is the latest version whereas unicodedata.ucd_3_2_0 is the older version. The APIs are the same; there's a tiny bit of code in the generated _db.h file that expresses the differences: static const change_record* get_change_3_2_0(Py_UCS4 n) { int index; if (n >= 0x110000) index = 0; else { index = changes_3_2_0_index[n>>7]; index = changes_3_2_0_data[(index<<7)+(n & 127)]; } return change_records_3_2_0+index; } static Py_UCS4 normalization_3_2_0(Py_UCS4 n) { switch(n) { case 0x2f868: return 0x2136A; case 0x2f874: return 0x5F33; case 0x2f91f: return 0x43AB; case 0x2f95f: return 0x7AAE; case 0x2f9bf: return 0x4D57; default: return 0; } } -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Fredrik Lundh

3:13 p.m.

On Fri, Aug 22, 2008 at 4:59 PM, Guido van Rossum <guido@python.org> wrote:

...

...
(how's the 3.2/4.1 dual support implemented? do we have two distinct datasets, or are the differences encoded in some clever way? would it make sense to split the unicodedata module into three separate modules, one for each major Unicode version?)

The current API looks fine to me: unicodedata is the latest version whereas unicodedata.ucd_3_2_0 is the older version. The APIs are the same; there's a tiny bit of code in the generated _db.h file that expresses the differences:

static const change_record* get_change_3_2_0(Py_UCS4 n) { int index; if (n >= 0x110000) index = 0; else { index = changes_3_2_0_index[n>>7]; index = changes_3_2_0_data[(index<<7)+(n & 127)]; } return change_records_3_2_0+index; }

there's a bunch of data tables as well, but they don't seem to be very large. looks like Martin did a thorough job here. ... digging digging digging ... yes, the generator script produces difference tables between the main version and a list of older versions. I'd say it's worth running the script on the 5.1.0 tables, and if it doesn't choke, compare the resulting table with the corresponding table for 4.1.0 (a simple loop fetching the main properties for all code points). if the differences look reasonably small, switch 5.1.0 and keep the others. I can tinker a little with this over the weekend, unless Martin tells me not to ;-) </F>

Fredrik Lundh

3:15 p.m.

when did Python-Dev turn into a members only list, btw? --- Your mail to 'Python-Dev' with the subject Re: Unicode 5.1.0 Is being held until the list moderator can review it for approval. The reason it is being held: Post by non-member to a members-only list ---

Guido van Rossum

4:51 p.m.

I think it's an anti-spam measure. Anybody can be a member though. :-) On Fri, Aug 22, 2008 at 8:15 AM, Fredrik Lundh <fredrik@pythonware.com> wrote:

...

when did Python-Dev turn into a members only list, btw?

---

Your mail to 'Python-Dev' with the subject

Re: Unicode 5.1.0

Is being held until the list moderator can review it for approval.

The reason it is being held:

Post by non-member to a members-only list

-- --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum

4:12 p.m.

2008/8/22 Fredrik Lundh <fredrik@pythonware.com>:

...

On Fri, Aug 22, 2008 at 4:59 PM, Guido van Rossum <guido@python.org> wrote:

...
...
(how's the 3.2/4.1 dual support implemented? do we have two distinct datasets, or are the differences encoded in some clever way? would it make sense to split the unicodedata module into three separate modules, one for each major Unicode version?)

The current API looks fine to me: unicodedata is the latest version whereas unicodedata.ucd_3_2_0 is the older version. The APIs are the same; there's a tiny bit of code in the generated _db.h file that expresses the differences:

static const change_record* get_change_3_2_0(Py_UCS4 n) { int index; if (n >= 0x110000) index = 0; else { index = changes_3_2_0_index[n>>7]; index = changes_3_2_0_data[(index<<7)+(n & 127)]; } return change_records_3_2_0+index; }

there's a bunch of data tables as well, but they don't seem to be very large. looks like Martin did a thorough job here.

... digging digging digging ...

yes, the generator script produces difference tables between the main version and a list of older versions. I'd say it's worth running the script on the 5.1.0 tables, and if it doesn't choke, compare the resulting table with the corresponding table for 4.1.0 (a simple loop fetching the main properties for all code points). if the differences look reasonably small, switch 5.1.0 and keep the others.

Right, that's my hope as well. I believe the changes between 3.2 and 4.1 were much larger than more recent changes. (Yay convergence! :-)

...

I can tinker a little with this over the weekend, unless Martin tells me not to ;-)

That would be great! -- --Guido van Rossum (home page: http://www.python.org/~guido/)

"Martin v. Löwis"

7:44 p.m.

...

I can tinker a little with this over the weekend, unless Martin tells me not to ;-)

Go ahead; I can't work on this at the moment, anyway. I would also be confident that a mere replacement of 4.1 with 5.1 should be easy, and I see no reason to keep the 4.1 version. Perhaps makeunicodedata should list *why* certain old versions remain supported; for 3.2, the use case is IDNA. Regards, Martin

A.M. Kuchling

4:52 p.m.

On Fri, Aug 22, 2008 at 07:59:46AM -0700, Guido van Rossum wrote:

...

That's up to us. I don't know what the reason was for keeping the 3.2.0 database around -- does anyone here recall ever using it? For what?

RFC 3491, one of the internationalized domain name RFCs, explicitly requires Unicode 3.2.0, so Lib/stringprep.py needs to use the old database and we have to keep 3.2.0 available. Maybe no specs depend on 4.1.0, so it could simply be replaced by 5.1.0. --amk

"Martin v. Löwis"

7:40 p.m.

...

That's up to us. I don't know what the reason was for keeping the 3.2.0 database around -- does anyone here recall ever using it? For what?

It's needed for IDNA. The IDNA RFC requires that Unicode 3.2 is used for performing IDNA (in particular, for determining what a valid domain name is). The IDNA people consider it security-relevant that it is really the 3.2 database, and would probably consider it a serious security bug if newer Python versions suddenly started to use newer Unicode databases for IDNA. At some point, IDNA might get updated to a newer version of the Unicode spec; we can then drop 3.2 (and stick with whatever the RFC then specifies). Regards, Martin

Antoine Pitrou

7:46 p.m.

New subject: IDNA

Martin v. Löwis <martin <at> v.loewis.de> writes:

...

It's needed for IDNA. The IDNA RFC requires that Unicode 3.2 is used for performing IDNA (in particular, for determining what a valid domain name is).

Speaking of which, Martin, did you take a look at http://bugs.python.org/issue3232 ? I suppose the fix is trivial, but I don't know what it should be :-) Regards Antoine.

"Martin v. Löwis"

7:35 p.m.

...

is the suggestion to *replace* the 4.1.0 database with a 5.1.0 database, or to add yet another database in that module?

I would replace it.

...

(how's the 3.2/4.1 dual support implemented?

The compiler needs data files for all supported versions, with old_versions listing the, well, old versions. It then computes deltas, expecting that they should mostly consist of new assignments (i.e. characters unassigned in 3.2 might be assigned in newer versions). It detects all differences, but might not be able to represent all changes.

...

do we have two distinct datasets, or are the differences encoded in some clever way?

The latter. It doesn't really need to be that clever: primarily just a compressed list of "new" characters is needed, per version.

...

would it make sense to split the unicodedata module into three separate modules, one for each major Unicode version?)

You couldn't use the space savings then, I suppose. Regards, Martin

M.-A. Lemburg

2:49 p.m.

On 2008-08-22 03:25, Guido van Rossum wrote:

...

On Thu, Aug 21, 2008 at 2:26 PM, M.-A. Lemburg <mal@egenix.com> wrote:

...
On 2008-08-21 22:35, Guido van Rossum wrote:

...
I was just paid a visit by my Google colleague Mark Davis, co-founder of the Unicode project and the president of the Unicode Consortium. He would like to see improved Unicode support for Python. (Well duh. :-) On his list of top priorities are:

1. Upgrade the unicodata module to the Unicode 5.1.0 standard 2. Extende the unicodedata module with some additional properties 3. Add support for Unicode properties to the regex syntax, including Boolean combinations

I've tried to explain our release schedule and no-new-features-in-point-releases policies to him, and he understands that it's too late to add #2 or #3 to 2.6 and 3.0, and that these will have to wait for 2.7 and 3.1, respectively. However, I've kept the door sligthtly ajar for adding #1 -- it can't be too much work and it can't have too much impact. Or can it? I don't actually know what the impact would be, so I'd like some impact from developers who are closer to the origins of the unicodedata module.

The two, quite separate, questions, then, are (a) how much work would it be to upgrade to version 5.1.0 of the database; and (b) would it be acceptable to do this post-beta3 (but before rc1). If the answer to (b) is positive, Google can help with (a).

In general, Google has needs in this area that can't wait for 2.7/3.1, so what we may end up doing is create internal implementations of all three features (compatible with Python 2.4 and later), publish them as open source on Google Code, and fold them into core Python at the first opportunity, which would likely be 2.7 and 3.1.

Comments? There are two things to consider:

unicodedata is just an optimized database for accessing code point properties of a specific Unicode version (currently 4.1.0 and 3.2.0). Adding support for a new version needs some work on the generation script, perhaps keeping the 4.1.0 version of it like we did for 3.2.0, but that's about it.

However, there are other implications to consider when moving to Unicode 5.1.0.

Just see the top of http://www.unicode.org/versions/Unicode5.1.0/ for a summary of changes compared to 5.0, plus http://www.unicode.org/versions/Unicode5.0.0/ for changes between 4.1.0 and 5.0.

So while we could say: "we provide access to the Unicode 5.1.0 database", we cannot say: "we support Unicode 5.1.0", simply because we have not reviewed the all the necessary changes and implications.

Mark's response to this was:

""" I'd suspect that you'll be as conformant to U5.1.0 as you were to U4.1.0 ;-)

More seriously, I don't think this is a roadblock -- I doubt that there are real differences between U5.1.0 and U4.10 in terms of conformance that would be touched by Python -- the conformance changes tend to be either completely backward compatible or very esoteric. What I can do is to review the Python support to see if and where there are any problems, but I wouldn't anticipate any. """

Which suggests that he believes that the differences in the database are very minor, and that upgrading just the database would not cause any problems for code that worked well with the 4.1.0 database.

Fine with me.

...

...
I think it's better to look through all the changes and then come up with proper support for 2.7/3.1. If Google wants to contribute to this, even better. To avoid duplication of work or heading in different directions, it may be a good idea to create a unicode-sig to discuss things.

Not me. :-)

I would really like to see more Unicode support in Python, e.g. for collation, compression, indexing based on graphemes and code points, better support for special casing situations (to cover e.g. the dotted vs. non-dotted i in the Turkish scripts), etc. There are also a few changes that we'd need to incorporate into the UTF codecs, e.g. warn about more ill-formed byte sequences. Would Google be willing to contribute such support or part of it ? -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 25 2008)

...

...
...
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

Guido van Rossum

4:04 p.m.

2008/8/25 M.-A. Lemburg <mal@egenix.com>:

...

I would really like to see more Unicode support in Python, e.g. for collation, compression, indexing based on graphemes and code points, better support for special casing situations (to cover e.g. the dotted vs. non-dotted i in the Turkish scripts), etc.

There are also a few changes that we'd need to incorporate into the UTF codecs, e.g. warn about more ill-formed byte sequences.

Would Google be willing to contribute such support or part of it ?

That depends purely on how much need Google itself has for these features. I'll ask around, but for now I wouldn't bet on anything beyond the three points I raised at the start of this thread: 1. Upgrade the unicodata module to the Unicode 5.1.0 standard 2. Extende the unicodedata module with some additional properties 3. Add support for Unicode properties to the regex syntax, including Boolean combinations -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Terry Reedy

5:13 p.m.

Guido van Rossum wrote:

...

2008/8/25 M.-A. Lemburg <mal@egenix.com <mailto:mal@egenix.com>>:

...
I would really like to see more Unicode support in Python, e.g. for collation, compression, indexing based on graphemes and code points, better support for special casing situations (to cover e.g. the dotted vs. non-dotted i in the Turkish scripts), etc.

There are also a few changes that we'd need to incorporate into the UTF codecs, e.g. warn about more ill-formed byte sequences.

Would Google be willing to contribute such support or part of it ?

That depends purely on how much need Google itself has for these features. I'll ask around, but for now I wouldn't bet on anything beyond the three points I raised at the start of this thread:

1. Upgrade the unicodata module to the Unicode 5.1.0 standard 2. Extende the unicodedata module with some additional properties 3. Add support for Unicode properties to the regex syntax, including Boolean combinations

I think an Improve Unicode Support PEP would be a good idea to collect (and get approval or not for) various ideas from various people, even if Google only implements part of the PEP.

Terry Reedy

10:30 p.m.

Guido van Rossum wrote:

...

I was just paid a visit by my Google colleague Mark Davis, co-founder of the Unicode project and the president of the Unicode Consortium. He would like to see improved Unicode support for Python. (Well duh. :-) On his list of top priorities are:

1. Upgrade the unicodata module to the Unicode 5.1.0 standard 2. Extende the unicodedata module with some additional properties 3. Add support for Unicode properties to the regex syntax, including Boolean combinations

I've tried to explain our release schedule and no-new-features-in-point-releases policies to him, and he understands that it's too late to add #2 or #3 to 2.6 and 3.0, and that these will have to wait for 2.7 and 3.1, respectively. However, I've kept the door sligthtly ajar for adding #1 -- it can't be too much work and it can't have too much impact. Or can it? I don't actually know what the impact would be, so I'd like some impact from developers who are closer to the origins of the unicodedata module.

The two, quite separate, questions, then, are (a) how much work would it be to upgrade to version 5.1.0 of the database; and (b) would it be acceptable to do this post-beta3 (but before rc1). If the answer to (b) is positive, Google can help with (a).

http://www.unicode.org/versions/Unicode5.1.0/ "Unicode 5.1.0 contains over 100,000 characters, and provides significant additions and improvements..." to existing features, including new files and upgrades to existing files. Sounds close to adding features ;-)

...

In general, Google has needs in this area that can't wait for 2.7/3.1, so what we may end up doing is create internal implementations of all three features (compatible with Python 2.4 and later), publish them as open source on Google Code, and fold them into core Python at the first opportunity, which would likely be 2.7 and 3.1.

If possible, I would suggest going a bit further and release a '3rd' party replacement/extension package, including a Windows installer, that is also listed on PyPI. Revised releases could and might need to be done even more rapidly than the bugfix release schedule would allow. (This could be done with other proposed new/revised modules also.) What would need to be done now, I believe, if possible and acceptable, it to slightly repackage the core to put unicode (3.0 strings) and _re* code in a separate library so that they can be drop-in replaced or masked. Terry Jan Reedy

Barry Warsaw

5:34 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Aug 21, 2008, at 6:30 PM, Terry Reedy wrote:

...

http://www.unicode.org/versions/Unicode5.1.0/ "Unicode 5.1.0 contains over 100,000 characters, and provides significant additions and improvements..." to existing features, including new files and upgrades to existing files. Sounds close to adding features ;-)

I agree. This seriously feels like new, potentially high risk code to be adding this late in the game. The BDFL can always override, but unless someone is really convincing that this is low risk high benefit, I'd vote no for 2.6/3.0. - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Darwin) iQCVAwUBSLLtMnEjvBPtnXfVAQKg0wP+LJ1XYXhEQHUAvT3fPbPzStCN8Lb+D7XG hZOANnTCbPGaeCY19B8mYZbXkvjkCBptauKGB5yGOAnb1KCkSaQWx0wCInkeyIFE mVMupGZCUsdsO7KreEwvyhBpOJ/HNY0+eacv8GZKCwC9xW3WmhaOjry7sZFhjffw hAX1AuxaPWA= =2j8a -----END PGP SIGNATURE-----

Benjamin Peterson

5:52 p.m.

On Mon, Aug 25, 2008 at 12:34 PM, Barry Warsaw <barry@python.org> wrote:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Aug 21, 2008, at 6:30 PM, Terry Reedy wrote:

...
http://www.unicode.org/versions/Unicode5.1.0/ "Unicode 5.1.0 contains over 100,000 characters, and provides significant additions and improvements..." to existing features, including new files and upgrades to existing files. Sounds close to adding features ;-)

I agree. This seriously feels like new, potentially high risk code to be adding this late in the game. The BDFL can always override, but unless someone is really convincing that this is low risk high benefit, I'd vote no for 2.6/3.0.

+1 Something I think we should also be considering is the 2.7/3.1 release cycle. I propose that we shorten it to ~1 year from 2.6/3.0's release with our main aim being binding 2.x and 3.x more closely. This would get the new unicode features out fairly quickly without having to wait another 2.5 years like 2.5 -> 2.6. -- Cheers, Benjamin Peterson "There's no place like 127.0.0.1."

Guido van Rossum

5:56 p.m.

On Mon, Aug 25, 2008 at 10:52 AM, Benjamin Peterson <musiccomposition@gmail.com> wrote:

...

On Mon, Aug 25, 2008 at 12:34 PM, Barry Warsaw <barry@python.org> wrote:

...
On Aug 21, 2008, at 6:30 PM, Terry Reedy wrote:

...
http://www.unicode.org/versions/Unicode5.1.0/ "Unicode 5.1.0 contains over 100,000 characters, and provides significant additions and improvements..." to existing features, including new files and upgrades to existing files. Sounds close to adding features ;-)

I agree. This seriously feels like new, potentially high risk code to be adding this late in the game. The BDFL can always override, but unless someone is really convincing that this is low risk high benefit, I'd vote no for 2.6/3.0.

+1

Something I think we should also be considering is the 2.7/3.1 release cycle. I propose that we shorten it to ~1 year from 2.6/3.0's release with our main aim being binding 2.x and 3.x more closely. This would get the new unicode features out fairly quickly without having to wait another 2.5 years like 2.5 -> 2.6.

I was never proposing to support any new features in 2.6/3.0. I was only proposing to update the data files that we already support to the versions provided by 5.1.0. Those data files should have the same format, just slightly improved content: some new characters, some corrected properties. Fredrik says it best:

...

at least two Unicode experts have stated that they don't think the changes are that important. determining exactly what the changes to the *core* character database was the whole point of my offer to tinker with this.

(I got distracted due to compiler issues and certain other things to be announced later, but I expect to have some results later this week).

-- --Guido van Rossum (home page: http://www.python.org/~guido/)

Brett Cannon

6:16 p.m.

On Mon, Aug 25, 2008 at 10:56 AM, Guido van Rossum <guido@python.org> wrote:

...

On Mon, Aug 25, 2008 at 10:52 AM, Benjamin Peterson <musiccomposition@gmail.com> wrote:

...
On Mon, Aug 25, 2008 at 12:34 PM, Barry Warsaw <barry@python.org> wrote:

...
On Aug 21, 2008, at 6:30 PM, Terry Reedy wrote:

...
http://www.unicode.org/versions/Unicode5.1.0/ "Unicode 5.1.0 contains over 100,000 characters, and provides significant additions and improvements..." to existing features, including new files and upgrades to existing files. Sounds close to adding features ;-)

I agree. This seriously feels like new, potentially high risk code to be adding this late in the game. The BDFL can always override, but unless someone is really convincing that this is low risk high benefit, I'd vote no for 2.6/3.0.

+1

Something I think we should also be considering is the 2.7/3.1 release cycle. I propose that we shorten it to ~1 year from 2.6/3.0's release with our main aim being binding 2.x and 3.x more closely. This would get the new unicode features out fairly quickly without having to wait another 2.5 years like 2.5 -> 2.6.

I was never proposing to support any new features in 2.6/3.0. I was only proposing to update the data files that we already support to the versions provided by 5.1.0. Those data files should have the same format, just slightly improved content: some new characters, some corrected properties. Fredrik says it best:

...
at least two Unicode experts have stated that they don't think the changes are that important. determining exactly what the changes to the *core* character database was the whole point of my offer to tinker with this.

(I got distracted due to compiler issues and certain other things to be announced later, but I expect to have some results later this week).

Plus the Europeans who probably use Unicode more than the dissenting Americans also seem to think it's a good idea. It's just a data table, and it's auto-generated, *and* one of the main guys from the Unicode Consortium is willing to help. I say let the change go in. -Brett

Fredrik Lundh

5:53 p.m.

Barry Warsaw wrote:

...

I agree. This seriously feels like new, potentially high risk code to be adding this late in the game. The BDFL can always override, but unless someone is really convincing that this is low risk high benefit, I'd vote no for 2.6/3.0.

at least two Unicode experts have stated that they don't think the changes are that important. determining exactly what the changes to the *core* character database was the whole point of my offer to tinker with this. (I got distracted due to compiler issues and certain other things to be announced later, but I expect to have some results later this week). </F>

Barry Warsaw

7:04 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Aug 25, 2008, at 1:53 PM, Fredrik Lundh wrote:

...

Barry Warsaw wrote:

...
I agree. This seriously feels like new, potentially high risk code to be adding this late in the game. The BDFL can always override, but unless someone is really convincing that this is low risk high benefit, I'd vote no for 2.6/3.0.

at least two Unicode experts have stated that they don't think the changes are that important. determining exactly what the changes to the *core* character database was the whole point of my offer to tinker with this.

You don't mean the experts claimed they weren't important, right? Unimportant changes definitely don't need to go in now <wink>. - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Darwin) iQCVAwUBSLMCSHEjvBPtnXfVAQKgLwP/YlrqcdlmoeBsK9JdJMnxkgN92L1K86cg lzvQT6bv8vda64Su8bV81UT+NdoB+/ZGpZ1t+Dn4Z0uvB0uaVrZZ7uOUoqQTkvG7 yrj/Clbedi2v35vYjudqAaZyBnJtz+V0rH8tdgpDVU5zILSK4gQm385nFuoUXQpC iJlqok3tjuU= =YfQR -----END PGP SIGNATURE-----

Fredrik Lundh

7:17 p.m.

Barry Warsaw wrote:

...

You don't mean the experts claimed they weren't important, right? Unimportant changes definitely don't need to go in now <wink>.

Well, at least Guido managed to figure out what I was trying to say ;-) </F>

Barry Warsaw

7:36 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Aug 25, 2008, at 3:17 PM, Fredrik Lundh wrote:

...

Barry Warsaw wrote:

...
You don't mean the experts claimed they weren't important, right? Unimportant changes definitely don't need to go in now <wink>.

Well, at least Guido managed to figure out what I was trying to say ;-)

Yeah, I was just being curmudgeonly. :) - -B -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Darwin) iQCVAwUBSLMJq3EjvBPtnXfVAQKG6QQAl3MliLaqhaibf12INX6EUoeIiYBEJlhY IiFgdc6VXe7evMsxUj2xE+1S+rg9BEhwiY38NTZdaqDCRiHBUY6aKFYlnawuyaKf 8m+jIdkJyudBpT5dBvfBCvYlwmXl/RwKHzDRDCHUjKfBVgAo9elv+EDy2kFLUpM1 W8dVEYwo3dg= =fue6 -----END PGP SIGNATURE-----

M.-A. Lemburg

6:15 p.m.

On 2008-08-25 19:34, Barry Warsaw wrote:

...

On Aug 21, 2008, at 6:30 PM, Terry Reedy wrote:

...
http://www.unicode.org/versions/Unicode5.1.0/ "Unicode 5.1.0 contains over 100,000 characters, and provides significant additions and improvements..." to existing features, including new files and upgrades to existing files. Sounds close to adding features ;-)

I agree. This seriously feels like new, potentially high risk code to be adding this late in the game. The BDFL can always override, but unless someone is really convincing that this is low risk high benefit, I'd vote no for 2.6/3.0.

The above quote from the Unicode site is misleading in this context. Guido's request was just for updating the Unicode database with the data from 5.1 - without adding new support for properties or changing the interfaces. See this page for a list of changes to the Unicode database: http://www.unicode.org/Public/UNIDATA/UCD.html The main file used for the unicodedata module is called "UnicodeData.txt". -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 25 2008)

...

...
...
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

Barry Warsaw

7:07 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Aug 25, 2008, at 2:15 PM, M.-A. Lemburg wrote:

...

Guido's request was just for updating the Unicode database with the data from 5.1 - without adding new support for properties or changing the interfaces.

See this page for a list of changes to the Unicode database:

http://www.unicode.org/Public/UNIDATA/UCD.html

The main file used for the unicodedata module is called "UnicodeData.txt".

That's much less scary. - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Darwin) iQCVAwUBSLMDCHEjvBPtnXfVAQIrYQQAoABjn/KWd8VdFeplR1i3Lqx5lNAWiJu9 6QWhE/4PPGpCBWhsUejnqHTdCOHmo7y6g3YWwPJ1hDZbl+oXrHd4/bcnHWMJgbOO BV7ACRdVcf9tuewiyNkWXGDn99WcXrGHSTXEnhQsQWL58PDbLxbbDgDUPdbXsGgC zCQbSykYv2E= =yheO -----END PGP SIGNATURE-----

Steve Holden

10:55 p.m.

Barry Warsaw wrote:

...

On Aug 21, 2008, at 6:30 PM, Terry Reedy wrote:

...
http://www.unicode.org/versions/Unicode5.1.0/ "Unicode 5.1.0 contains over 100,000 characters, and provides significant additions and improvements..." to existing features, including new files and upgrades to existing files. Sounds close to adding features ;-)

I agree. This seriously feels like new, potentially high risk code to be adding this late in the game. The BDFL can always override, but unless someone is really convincing that this is low risk high benefit, I'd vote no for 2.6/3.0.

But it's [the] wafer-thin [end of the wedge] ... The difficulties with subprocess suggest there's plenty to do without adding yet one more tiny little task. regards Steve -- Steve Holden +1 571 484 6266 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/

Facundo Batista

1:42 p.m.

2008/8/21 Guido van Rossum <guido@python.org>:

...

The two, quite separate, questions, then, are (a) how much work would it be to upgrade to version 5.1.0 of the database; and (b) would it be acceptable to do this post-beta3 (but before rc1). If the answer to (b) is positive, Google can help with (a).

Two thoughts: - In view of jumping to a new standard at *this* point, what I'd like to have is a comprehensive test suite for unicodedata in a similar sense to what happens with Decimal... It would be great to have from the Unicode Consortium a series of test cases (in Python, or in something we could process), to verify that we support Unicode properly. - We always could have a beta4 if it's necessary... Just my two pesos cents. Regards, -- . Facundo Blog: http://www.taniquetil.com.ar/plog/ PyAr: http://www.python.org/ar/

Antoine Pitrou

2:54 p.m.

Facundo Batista <facundobatista <at> gmail.com> writes:

...

Two thoughts:

- In view of jumping to a new standard at *this* point, what I'd like to have is a comprehensive test suite for unicodedata in a similar sense to what happens with Decimal... It would be great to have from the Unicode Consortium a series of test cases (in Python, or in something we could process), to verify that we support Unicode properly.

And another question: would it be hard for Google to maintain this separately until at least it's integrated to 3.1?

...

- We always could have a beta4 if it's necessary...

If we go this route there are lots of attractive things that might justify yet and yet another beta :-) Just my two over-evaluated euro cents. Regards Antoine.

Guido van Rossum

3:05 p.m.

On Fri, Aug 22, 2008 at 6:42 AM, Facundo Batista <facundobatista@gmail.com> wrote:

...

- In view of jumping to a new standard at *this* point, what I'd like to have is a comprehensive test suite for unicodedata in a similar sense to what happens with Decimal... It would be great to have from the Unicode Consortium a series of test cases (in Python, or in something we could process), to verify that we support Unicode properly.

Unicode conformance isn't specified in the same way as Decimal conformance. While there are certain algorithms that can be tested (e.g. normalization, encoding, decoding), much of the conformance requirements (AFAIK) are expressed in lots of words about providing certain facilities etc. I don't actually think putting lots of effort into this is well-spent; given the mechanical nature of the translation from the unicode database files into C code (see Tools/unicode/makeunicodedata.py) a bug in the translation is likely to result in either bad C code or a systematic error that is easily spotted.

...

- We always could have a beta4 if it's necessary...

No way. On Fri, Aug 22, 2008 at 7:54 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:

...

And another question: would it be hard for Google to maintain this separately until at least it's integrated to 3.1?

That's the plan. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Barry Warsaw

12:50 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I was away for the weekend and am struggling to catch up on my email. Since I haven't digested this entire thread, I'll refrain for the moment from giving my opinion, however this comment jumped out to me. On Aug 22, 2008, at 9:42 AM, Facundo Batista wrote:

...

- We always could have a beta4 if it's necessary...

I do not want to slip the schedule if at all possible. If serious security issues, performance problems, show stopper bugs crop up, then we will obviously slip so that we don't have to put a brown bag over our heads. Slipping to get yet one more feature in is not (IMO) acceptable. An incentive for keeping the schedule: If we hit our October 1st deadline, then 2.6 and 3.0 will almost certainly be included in some upcoming major new OS releases. If we slip, then it's unlikely to happen. - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Darwin) iQCVAwUBSLKqenEjvBPtnXfVAQJukAP+L93nxTP436Au9GkLZQUhy1Gbk8rDvq2K jZtJA5Rb9VKUr7TDoqZ2iFRRg9tsxwz+fLzZp0m00WWGRvKHdgqS+c6sHBaXazzk txFhyspkw0cndD7zsNoqThlY6Q1CkhK3BHYmRLWS+PVhfOm6bRgudL+ePcWneT2X 24pFB83GSjo= =/lq8 -----END PGP SIGNATURE-----

İsmail Dönmez

1:43 p.m.

Hi, On Thu, Aug 21, 2008 at 23:35, Guido van Rossum <guido@python.org> wrote:

...

I was just paid a visit by my Google colleague Mark Davis, co-founder of the Unicode project and the president of the Unicode Consortium. He would like to see improved Unicode support for Python. (Well duh. :-) On his list of top priorities are:

1. Upgrade the unicodata module to the Unicode 5.1.0 standard 2. Extende the unicodedata module with some additional properties 3. Add support for Unicode properties to the regex syntax, including Boolean combinations

Adding support for SpecialCasing rules[0] would be good for full Unicode support too. It would fix i/I problems that are currently going on with Turkish locale. [0] http://unicode.org/Public/UNIDATA/SpecialCasing.txt Regards, ismail -- Programmer Excuse #17: The processor stack spring has worn out.

6043

Age (days ago)

6047

Last active (days ago)

List overview

Download

33 comments

13 participants

participants (13)

"Martin v. Löwis"
A.M. Kuchling
Antoine Pitrou
Barry Warsaw
Benjamin Peterson
Brett Cannon
Facundo Batista
Fredrik Lundh
Guido van Rossum
İsmail Dönmez
M.-A. Lemburg
Steve Holden
Terry Reedy

Unicode 5.1.0

Benjamin Peterson

İsmail Dönmez

tags

participants (13)