[Python-Dev] Python and the Unicode Character Database

Mon Nov 29 20:38:46 CET 2010

On Mon, Nov 29, 2010 at 1:33 PM, Antoine Pitrou <solipsis at pitrou.net> wrote:
> On Mon, 29 Nov 2010 08:22:46 +0100
> "Martin v. Löwis" <martin at v.loewis.de> wrote:
>> > The former ensures that literals in code are always readable; the later
>> > allows users to enter numbers in their own number system. How could that
>> > be a bad thing?
>>
>> It's YAGNI, feature bloat. It gives the illusion of supporting something
>> that actually isn't supported very well (namely, parsing local number
>> strings). I claim that there is no meaningful application
>> of this feature.
>
> Still, if it's not detrimental and it it's not difficult to support,
> then why do you care?

It is difficult to support.  A fix for issue10557 would be much
simpler if we did not support non-European digits.  I now added a
patch that handles non-ascii digits, so you can see what's involved.
Note that when Unicode Consortium inevitably adds more Nd characters
to the non-BMP planes, we will have to add surrogate pairs' support to
this code.

In any case, there is little we can do about it in 3.2 other than fix
bugs like issue10557 without breaking currently valid code, so I
created a separate issue to continue this debate in context of 3.3.
[issue10581]

Now, I would like to bring this thread back to it's subject.  Given
that UCD is now affecting the language definition and the standard
library behavior, how should changes to UCD be handled?

- Should Python documentation refer to the specific version of Unicode
that it supports?

Current documentation refers to old versions.  Should version be
updated or removed to imply the latest?

- How UCD updates should be handled during the language moratorium?

During PEP 3003 discussion, it was suggested to handle it on a case by
case basis, but I don't see discussion of the upgrade to 6.0.0 in PEP
3003.  Should this upgrade be backported to 2.7?

- How specific should library reference manual be in defining methods
affected by UCD such as str.upper()?

- What is an acceptable level of variation between Python
implementations?  For example, if '\UXXXXXXXX'.isalpha() returns true
in one implementation, can it return false in another?  Note that even
CPython narrow and wide builds are presently not consistent in this
respect.

 [issue10581] http://bugs.python.org/issue10581