[Python-Dev] Python and the Unicode Character Database

Mark Dickinson dickinsm at gmail.com
Thu Dec 2 22:57:45 CET 2010


On Thu, Dec 2, 2010 at 8:23 PM, "Martin v. Löwis" <martin at v.loewis.de> wrote:
> In the case of number parsing, I think Python would be better if
> float() rejected non-ASCII strings, and any support for such parsing
> should be redone correctly in a different place (preferably along with
> printing of numbers).

+1.  The set of strings currently accepted by the float constructor
just seems too ad hoc to be at all useful.  Apart from the decimal
separator issue, and the question of exactly which decimal digits are
accepted and which aren't, there are issues like this one:

>>> x = '\uff11\uff25\uff0b\uff11\uff10'
>>> x
'1E+10'
>>> float(x)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'decimal' codec can't encode character '\uff25' in
position 1: invalid decimal Unicode string
>>> y = '\uff11E+\uff11\uff10'
>>> y
'1E+10'
>>> float(y)
10000000000.0

That is, fullwidth *digits* are allowed, but none of the other
characters can be fullwidth variants.  Unfortunately, a float string
doesn't consist solely of digits, and it seems to me to make little
sense to allow variation in the digits without allowing corresponding
variations in the other characters that might appear ('.', 'e', 'E',
'+', '-').

A couple of slightly trickier decisions: (1) the float constructor
currently does accept leading and trailing whitespace;  should it
allow any Unicode whitespace characters here? I'd say yes. (2) For
int() rather than float(), there's a bit more value in allowing the
variant digits, since it provides an easy way to interpret those
digits.  The decimal module currently makes use of this, for example
(the decimal spec requires that non-European digits be accepted).  I'd
be happier if this functionality were moved elsewhere, though.  The
int constructor is, if anything, currently worse off than float,
thanks to its attempts to support non-decimal bases.

There's value in having an easy-to-specify, easy-to-maintain API for
these basic builtin functions.  For one thing, it helps non-CPython
implementations.

[MAL]
>> The Python 3.x docs apparently
>> introduced a reference to the language spec which is clearly not
>> capturing the wealth of possible inputs.

That documentation update was my fault;  I was motivated to make the
update by issues unrelated to this one (mostly to do with Python 3's
more consistent handling of inf and nan, as a result of all the new
float<->string conversion code).  If I'd been thinking harder, I would
have remembered that float accepted the non-European digits and added
a note to that effect.  This (unintentional) omission does underline
the point that it's difficult right now to document and understand
exactly what the float constructor does or doesn't accept.

Mark


More information about the Python-Dev mailing list