[Python-Dev] Python and the Unicode Character Database

"Martin v. Löwis" martin at v.loewis.de
Thu Dec 2 21:23:41 CET 2010


>> Then these users should speak up and indicate their need, or somebody
>> should speak up and confirm that there are users who actually want
>> '١٢٣٤.٥٦' to denote 1234.56. To my knowledge, there is no writing
>> system in which '١٢٣٤.٥٦e4' means 12345600.0.
> 
> I'm not sure what you're after here.

That the current float() constructor accepts tons of bogus character
strings and accepts them as numbers, and that it should stop doing so.

> The decision to add this support was deliberate based on the desire
> to support as much of the nice features of Unicode in Python as
> we could. At least that was what was driving me at the time.

At the time, this may have been the right thing to do. With the
experience gained, we should now conclude to revert this particular aspect.

> Some references you may want to read up on:
> 
> http://en.wikipedia.org/wiki/Numbers_in_Chinese_culture
> http://en.wikipedia.org/wiki/Vietnamese_numerals
> http://en.wikipedia.org/wiki/Korean_numerals
> http://en.wikipedia.org/wiki/Japanese_numerals

I don't question that people use non-ASCII characters to
denote numbers. I claim that the specific support in Python for that
has no connection to reality. I further claim that the use of non-ASCII
numbers is a local convention, and that if you provide a library to
parse numbers, users (of that library) will somehow have to specify
which notational convention(s) is reasonable for the input they have.

> Even MS Office supports them:
> 
> http://languages.siuc.edu/Chinese/Language_Settings.html

That's printing, though, not parsing.

Notice that Python does *not* currently support printing numbers in
other scripts - even though this may actually be more useful than
parsing.

>>> Note that the support in float() (and the other numeric constructors)
>>> to work with Unicode code points was explicitly added when Unicode
>>> support was added to Python and has been available since Python 1.6.
>>
>> That doesn't necessarily make it useful. Alexander's complaint is that
>> it makes Python unstable (i.e. changing as the UCD changes).
> 
> If that were true, then all Unicode database (UCD) changes would make
> Python unstable.

That's indeed the case - they do (see the recent bug report on white
space processing). However, any change makes Python unstable (in the
sense that it can potentially break existing applications), and, in
many cases, the risk of breaking something is well worth it.

In the case of number parsing, I think Python would be better if
float() rejected non-ASCII strings, and any support for such parsing
should be redone correctly in a different place (preferably along with
printing of numbers).

>> Most certainly it is: the documentation is either underspecified,
>> or deviates from the implementation (when taking the most plausible
>> interpretation). This is the very definition of "bug".
> 
> The implementation is not a bug and neither was this a bug in the
> 2.x series of the Python documentation.

Of course the 2.x documentation is wrong, in that it is severely
underspecified, and the most straight-forward interpretation of the
specific wording gives an incorrect impression of the implementation.

> The Python 3.x docs apparently
> introduced a reference to the language spec which is clearly not
> capturing the wealth of possible inputs.

Right - but only because the 2.x documentation *already* suggested that
the supported syntax matches the literal syntax - as that's the most
natural thing to assume.

Regards,
Martin


More information about the Python-Dev mailing list