[Python-Dev] Python and the Unicode Character Database
Alexander Belopolsky
alexander.belopolsky at gmail.com
Sun Nov 28 21:24:37 CET 2010
Two recently reported issues brought into light the fact that Python
language definition is closely tied to character properties maintained
by the Unicode Consortium. [1,2] For example, when Python switches to
Unicode 6.0.0 (planned for the upcoming 3.2 release), we will gain two
additional characters that Python can use in identifiers. [3]
With Python 3.1:
>>> exec('\u0CF1 = 1')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<string>", line 1
ೱ = 1
^
SyntaxError: invalid character in identifier
but with Python 3.2a4:
>>> exec('\u0CF1 = 1')
>>> eval('\u0CF1')
1
Of course, the likelihood is low that this change will affect any
user, but the change in str.isspace() reported in [1] is likely to
cause some trouble:
Python 2.6.5:
>>> u'A\u200bB'.split()
[u'A', u'B']
Python 2.7:
>>> u'A\u200bB'.split()
[u'A\u200bB']
While we have little choice but to follow UCD in defining
str.isidentifier(), I think Python can promise users more stability in
what it treats as space or as a digit in its builtins. For example,
I don't think that supporting
>>> float('١٢٣٤.٥٦')
1234.56
is more important than to assure users that once their program
accepted some text as a number, they can assume that the text is
ASCII.
[1] http://bugs.python.org/issue10567
[2] http://bugs.python.org/issue10557
[3] http://www.unicode.org/versions/Unicode6.0.0/#Database_Changes
More information about the Python-Dev
mailing list