[Python-Dev] Python and the Unicode Character Database

Tue Nov 30 16:05:42 CET 2010

On Mon, Nov 29, 2010 at 4:13 PM, "Martin v. Löwis" <martin at v.loewis.de> wrote:
>> - Should Python documentation refer to the specific version of Unicode
>> that it supports?
>
> You mean, mention it somewhere? Sure (although it would be nice if the
> documentation generator would automatically extract it from the source,
> just as it extracts the Python version number).
>
> Of course, such mentioning should explain that this is specific to
> CPython, and not an aspect of Python-the-language.
>
>> Current documentation refers to old versions.  Should version be
>> updated or removed to imply the latest?
>
> What specific reference are you referring to?
>
I found two places: A reference to Unicode 3.0 (!) in the Data Model
section and a reference to 5.2.0 in unicodedata docs.

See http://mail.python.org/pipermail/docs/2010-November/002074.html

>> - How UCD updates should be handled during the language moratorium?
>
> It's clearly not affected.
>

This is not what Guido said last year:
"""
> One question:
>
> There are currently number of patch waiting on the tracker for
> additional Unicode feature support and it's also likely that we'll
> want to upgrade to a more recent Unicode version within the
> next few years.
>
> How would such indirect changes be seen under the moratorium ?

That would fall under the Case-by-Case Exemptions section. "Within the
next few years" sounds like it might well wait until the moratorium is
ended though. :-)
"""

http://mail.python.org/pipermail/python-dev/2009-November/093666.html

I don't see it as a big deal, but technically speaking, with Unicode
6.0 changing properties of two characters to become identifiers Python
language definition is affected.  For example, an alternative
implementation based on 5.2.0 will not accept a valid CPython program
that uses one of these characters.

>> During PEP 3003 discussion, it was suggested to handle it on a case by
>> case basis, but I don't see discussion of the upgrade to 6.0.0 in PEP
>> 3003.
>
> It's covered by "As the standard library is not directly tied to the
> language definition it is not covered by this moratorium."
>

See above.  Also, it has been suggested that semantics of built-ins
cannot change.  (If that was so, it would put int('١٢٣٤') debate to
rest at least for the time being.:-)

>>  Should this upgrade be backported to 2.7?
>
> No, it's a new feature.
>
Given that 2.7 will be maintained for 5 years and arguably Unicode
Consortium takes backward compatibility very seriously, wouldn't it
make sense to consider a backport at some point?

I am sure we will soon see a bug report that the following does not
work in 2.7: :-)
>>> ord('\N{CAT FACE WITH WRY SMILE}')
128572

>> - How specific should library reference manual be in defining methods
>> affected by UCD such as str.upper()?
>
> It should specify what this actually does in Unicode terminology
> (probably in addition to a layman's rephrase of that)
>

I opened an issue for this:

http://bugs.python.org/issue10587

>> .. For example, if '\UXXXXXXXX'.isalpha() returns true
>> in one implementation, can it return false in another?
>
> Implementations are free to use any version of the UCD.

I was more concerned about wide an narrow unicode CPython builds.  Is
it a bug that   '\UXXXXXXXX'.isalpha() may disagree even when the two
implementations are based on the same version of UCD?

Thanks for your answers.