[Python-Dev] Unicode 5.1.0

Mon Aug 25 16:49:45 CEST 2008

On 2008-08-22 03:25, Guido van Rossum wrote:
> On Thu, Aug 21, 2008 at 2:26 PM, M.-A. Lemburg <mal at egenix.com> wrote:
>> On 2008-08-21 22:35, Guido van Rossum wrote:
>>> I was just paid a visit by my Google colleague Mark Davis, co-founder
>>> of the Unicode project and the president of the Unicode Consortium. He
>>> would like to see improved Unicode support for Python. (Well duh. :-)
>>> On his list of top priorities are:
>>>
>>> 1. Upgrade the unicodata module to the Unicode 5.1.0 standard
>>> 2. Extende the unicodedata module with some additional properties
>>> 3. Add support for Unicode properties to the regex syntax, including
>>> Boolean combinations
>>>
>>> I've tried to explain our release schedule and
>>> no-new-features-in-point-releases policies to him, and he understands
>>> that it's too late to add #2 or #3 to 2.6 and 3.0, and that these will
>>> have to wait for 2.7 and 3.1, respectively. However, I've kept the
>>> door sligthtly ajar for adding #1 -- it can't be too much work and it
>>> can't have too much impact. Or can it? I don't actually know what the
>>> impact would be, so I'd like some impact from developers who are
>>> closer to the origins of the unicodedata module.
>>>
>>> The two, quite separate, questions, then, are (a) how much work would
>>> it be to upgrade to version 5.1.0 of the database; and (b) would it be
>>> acceptable to do this post-beta3 (but before rc1). If the answer to
>>> (b) is positive, Google can help with (a).
>>>
>>> In general, Google has needs in this area that can't wait for 2.7/3.1,
>>> so what we may end up doing is create internal implementations of all
>>> three features (compatible with Python 2.4 and later), publish them as
>>> open source on Google Code, and fold them into core Python at the
>>> first opportunity, which would likely be 2.7 and 3.1.
>>>
>>> Comments?
>> There are two things to consider:
>>
>> unicodedata is just an optimized database for accessing code
>> point properties of a specific Unicode version (currently 4.1.0
>> and 3.2.0). Adding support for a new version needs some work on
>> the generation script, perhaps keeping the 4.1.0 version of it
>> like we did for 3.2.0, but that's about it.
>>
>> However, there are other implications to consider when moving to
>> Unicode 5.1.0.
>>
>> Just see the top of http://www.unicode.org/versions/Unicode5.1.0/
>> for a summary of changes compared to 5.0, plus
>> http://www.unicode.org/versions/Unicode5.0.0/ for changes between
>> 4.1.0 and 5.0.
>>
>> So while we could say: "we provide access to the Unicode 5.1.0
>> database", we cannot say: "we support Unicode 5.1.0", simply because
>> we have not reviewed the all the necessary changes and implications.
> 
> Mark's response to this was:
> 
> """
> I'd suspect that you'll be as conformant to U5.1.0 as you were to U4.1.0 ;-)
> 
> More seriously, I don't think this is a roadblock -- I doubt that
> there are real differences between U5.1.0 and U4.10 in terms of
> conformance that would be touched by Python -- the conformance changes
> tend to be either completely backward compatible or very esoteric.
> What I can do is to review the Python support to see if and where
> there are any problems, but I wouldn't anticipate any.
> """
> 
> Which suggests that he believes that the differences in the database
> are very minor, and that upgrading just the database would not cause
> any problems for code that worked well with the 4.1.0 database.

Fine with me.

>> I think it's better to look through all the changes and then come
>> up with proper support for 2.7/3.1. If Google wants to contribute
>> to this, even better. To avoid duplication of work or heading in
>> different directions, it may be a good idea to create a
>> unicode-sig to discuss things.
> 
> Not me. :-)

I would really like to see more Unicode support in Python, e.g.
for collation, compression, indexing based on graphemes and
code points, better support for special casing situations (to
cover e.g. the dotted vs. non-dotted i in the Turkish scripts),
etc.

There are also a few changes that we'd need to incorporate into
the UTF codecs, e.g. warn about more ill-formed byte sequences.

Would Google be willing to contribute such support or part
of it ?

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Aug 25 2008)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611