Devanagari int literals [was Re: Should non-security 2.7 bugs be fixed?]

Chris Angelico rosuav at gmail.com
Sun Jul 19 23:16:55 CEST 2015


On Mon, Jul 20, 2015 at 5:55 AM, Tim Chase
<python.list at tim.thechases.com> wrote:
> On 2015-07-20 04:07, Chris Angelico wrote:
>> The int() and float() functions accept, if I'm not mistaken,
>> anything with Unicode category "Nd" (Number, decimal digit). In
>> your examples, the fraction (U+215B) is No, and the Roman numerals
>> (U+2168, U+2182) are Nl, so they're not supported. Adding support
>> for these forms might be accepted as a feature request, but it's
>> not a bug.
>
> Ah, that makes sense.  Some simple testing (thanks, unicodedata
> module) supports your conjecture.
>
> It's not a particularly big deal so not really worth the brain-cycles
> to add support for them.  Just upon hearing "Python's int() does
> smart things with Unicode characters", those were some of my first
> characters to try.  The failure struck me as odd until you explained
> the simple difference.

The other part of the problem is: What should float("2⅛3") be? Should
it be equal to 21.0/83.0? Should the first part be parsed as a classic
mixed number (2 + 1/8), and then what should the 3 mean? While it's
easy to see what an individual character should represent (just check
unicodedata.numeric(ch) - for ⅛ it's 0.125), the true meaning of a
string of such characters is less than clear. Similarly, Roman
numerals aren't meant to be used after the decimal point, so "Ⅸ.Ⅴ"
does not normally mean nine and a half... not to mention the confusing
situation that "ⅠⅤ" would naively parse as 15 but "Ⅳ" is definitely 4.
Since these kinds of complexities exist, it's safest to reserve this
level of parsing for a special-purpose function. If someone can come
up with a really strong argument for the float() and int()
constructors interpreting these, I'd expect to see it deployed as a
third-party module first, before being pointed out as "see, you can
use float() for all these, but if you want to use those, you should
use Float() instead". (Incidentally, I fully expect to see, some day,
pytz.localize() semantics brought into the standard library
datetime.datetime class, for precisely this reason.)

Unicode is awesome, but it's not a panacea :)

ChrisA


More information about the Python-list mailing list