Python and the Unicode Character Database
Two recently reported issues brought into light the fact that Python language definition is closely tied to character properties maintained by the Unicode Consortium. [1,2] For example, when Python switches to Unicode 6.0.0 (planned for the upcoming 3.2 release), we will gain two additional characters that Python can use in identifiers. [3] With Python 3.1:
exec('\u0CF1 = 1') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<string>", line 1 ೱ = 1 ^ SyntaxError: invalid character in identifier
but with Python 3.2a4:
exec('\u0CF1 = 1') eval('\u0CF1') 1
Of course, the likelihood is low that this change will affect any user, but the change in str.isspace() reported in [1] is likely to cause some trouble: Python 2.6.5:
u'A\u200bB'.split() [u'A', u'B']
Python 2.7:
u'A\u200bB'.split() [u'A\u200bB']
While we have little choice but to follow UCD in defining str.isidentifier(), I think Python can promise users more stability in what it treats as space or as a digit in its builtins. For example, I don't think that supporting
float('١٢٣٤.٥٦') 1234.56
is more important than to assure users that once their program accepted some text as a number, they can assume that the text is ASCII. [1] http://bugs.python.org/issue10567 [2] http://bugs.python.org/issue10557 [3] http://www.unicode.org/versions/Unicode6.0.0/#Database_Changes
participants (24)
-
"Martin v. Löwis"
-
Alexander Belopolsky
-
Antoine Pitrou
-
Ben Finney
-
Benjamin Peterson
-
Eric Smith
-
Georg Brandl
-
Guido van Rossum
-
Hagen Fürstenau
-
haiyang kang
-
James Y Knight
-
Joao S. O. Bueno
-
Lennart Regebro
-
M.-A. Lemburg
-
Mark Dickinson
-
Michael Foord
-
Neil Hodgson
-
Nick Coghlan
-
Stefan Krah
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Terry Reedy
-
Tim Lesher
-
Vlastimil Brom