[Python-Dev] Python and the Unicode Character Database

Mon Nov 29 03:32:15 CET 2010

On Sun, Nov 28, 2010 at 6:43 PM, Steven D'Aprano <steve at pearwood.info> wrote:
..
>> is more important than to assure users that once their program
>> accepted some text as a number, they can assume that the text is
>> ASCII.
>
> Seems like a pretty foolish assumption, if you ask me, pretty much akin to
> assuming that if string.isalpha() returns true that string is ASCII.
>

It is not to 99.9% of Python users whose code is written for 2.x.
Their strings are byte strings and string.isdigit() does imply ASCII
even if string.isalpha() does not in many locales.

..
> The fact that this is (apparently) only being raised now means that it isn't
> actually a problem in real life. I'd even say that it's a feature, and that
> if Python didn't support non-Arabic numerals, it should.
>

I raised this problem because I found a bug that is related to this
feature.  The bug is also a regression from 2.x.

In 2.7:

>>> float(u'1234\xa1')
..
ValueError: invalid literal for float(): 1234?

The last character is lost, but the error message is still meaningful.

In 3.x, however:

>>> float('1234\xa1')
..
ValueError

See http://bugs.python.org/issue10557

While investigating this issue I found that by the time the string
gets to the number parser (_Py_dg_strtod), all non-ascii characters
are dropped by PyUnicode_EncodeDecimal() so it cannot produce
meaningful diagnostic.

Of course, PyUnicode_EncodeDecimal(), can be fixed by making it pass
non-ascii chars through as UTF-8 bytes, but I was wondering if
preserving the ability to parse exotic numerals was worth the effort.