[Python-Dev] Python and the Unicode Character Database
Alexander Belopolsky
alexander.belopolsky at gmail.com
Mon Nov 29 03:32:15 CET 2010
On Sun, Nov 28, 2010 at 6:43 PM, Steven D'Aprano <steve at pearwood.info> wrote:
..
>> is more important than to assure users that once their program
>> accepted some text as a number, they can assume that the text is
>> ASCII.
>
> Seems like a pretty foolish assumption, if you ask me, pretty much akin to
> assuming that if string.isalpha() returns true that string is ASCII.
>
It is not to 99.9% of Python users whose code is written for 2.x.
Their strings are byte strings and string.isdigit() does imply ASCII
even if string.isalpha() does not in many locales.
..
> The fact that this is (apparently) only being raised now means that it isn't
> actually a problem in real life. I'd even say that it's a feature, and that
> if Python didn't support non-Arabic numerals, it should.
>
I raised this problem because I found a bug that is related to this
feature. The bug is also a regression from 2.x.
In 2.7:
>>> float(u'1234\xa1')
..
ValueError: invalid literal for float(): 1234?
The last character is lost, but the error message is still meaningful.
In 3.x, however:
>>> float('1234\xa1')
..
ValueError
See http://bugs.python.org/issue10557
While investigating this issue I found that by the time the string
gets to the number parser (_Py_dg_strtod), all non-ascii characters
are dropped by PyUnicode_EncodeDecimal() so it cannot produce
meaningful diagnostic.
Of course, PyUnicode_EncodeDecimal(), can be fixed by making it pass
non-ascii chars through as UTF-8 bytes, but I was wondering if
preserving the ability to parse exotic numerals was worth the effort.
More information about the Python-Dev
mailing list