[Python-Dev] Python and the Unicode Character Database
Alexander Belopolsky
alexander.belopolsky at gmail.com
Tue Nov 30 04:46:33 CET 2010
On Mon, Nov 29, 2010 at 5:09 PM, Steven D'Aprano <steve at pearwood.info> wrote:
..
> But in any case, please don't conflate the question of whether Python should
> accept j and/or i for complex numbers with the question of supporting
> non-arabic numerals. The two issues are unrelated.
The two issues are related because they are both about how strict
numerical constructors should be. If we want to accept wide
variations in how numbers can be spelled, then surely using i for the
imaginary unit is much more common than using ७ for the digit 7.
I see two problems with supporting non-ascii spellings:
1. Support costs.
2. User confusion.
The two are related because when users are confused, they will report
invalid bugs when Python does not meet their expectations. For
example, why
>>> int('123', 10)
123
works, but
>>> int('123ABC', 16)
Traceback (most recent call last):
..
UnicodeEncodeError: 'decimal' codec can't encode character '\uff21' in
position 3: invalid decimal Unicode string
does not? And if 'decimal' is a codec, why
>>> '123'.encode('decimal')
Traceback (most recent call last):
...
LookupError: unknown encoding: decimal
Before anyone suggests that int(.., 16) should consult the new
Hex_Digit property in the UCD, let me remind that int() supports bases
from 2 through 36.
I thought Python design was primarily driven by practicality. Here
the only plausible argument that one can make is that if Unicode says
it is a digit, we should treat it as a digit. Purity over
practicality.
In practical terms, UCD comes at a price. The unicodedata module size
is over 700K on my machine. This is almost half the size of the
python executable and by far the largest extension module. (only CJK
encodings come close.) Making builtins depend on the largest
extension module for operation does not strike me as sound design.
More information about the Python-Dev
mailing list