[issue6632] Include more fullwidth chars in the decimal codec

Mon Aug 3 18:34:45 CEST 2009

New submission from Ezio Melotti <ezio.melotti at gmail.com>:

The decimal codec only handles characters in the Nd (Number, decimal)
Unicode category and whitespaces [a]. It is used by int(), float(),
complex() and indirectly by Decimal(), Fraction() and possibly others.
This works well only for plain digits (e.g. int(u'１２３')) but it
doesn't work for all the other characters used to represent numbers, like:
1. plus or minus sign, e.g. int(u'＋１２３') or int(u'－１２３')
2. decimal point, e.g. float(u'１．２３')
   2.1 some languages/alphabets use other chars (e.g. a comma or other
       symbols) instead of the decimal point.
3. exponential notation, e.g. float(u'１ｅ５')
4. the 'j' in complex numbers, e.g. complex(u'３ｊ')
5. the 'x' and 'p' in hexadecimal floats, e.g.
float.fromhex(u'０ｘ１．７ｐ３')
   5.1 hex floats also uses hexadecimal digits, see 6.3
6. digits > 9 for numbers with a base > 10, e.g. int(u'７Ｆ', 16)
    6.1 not all the alphabets have the equivalent of the letters a-z
    6.2 afaik there are no standards that specify how to deal with
        digits >9
    6.3 in the Unicode FAQ [b] there's a link to a table [c] that says
        "Code points not listed in this file are not hexadecimal
        digits." This is not a standard though, and even if in the
        UCD [d] there's a file [e] where the numbers with the Hex_Digit
        property are defined, it doesn't say that *only* these numbers
        are valid hex digits. Also it doesn't say anything about
        different bases.
        Python currently accepts int(u'１０', 16), int(u'७', 16)
        (U+096D - DEVANAGARI DIGIT SEVEN) and even int(u'７F', 16)
        (with a normal F it works, with a fullwidth Ｆ it fails).
    6.4 UTS #18 [f] includes in the property 'xdigit' [g] (hexadecimal
        digit) all the chars defined in [c] and also all the chars with
        a Nd category. This also is not a standard, and it doesn't
        give indications about the valid hex digits and how int()
        should behave.
    6.5 if possible re and int() should agree. Any string that matches
        /^[[:xdigit:]]+$/ should work fine with int(s, 16) and vice 
        versa. See also #6561 [h] and #2636 [i].
7. possibly others

For all the chars listed in the points 1-5 there's no way, AFAIK, to
know their equivalents in other alphabets (if they exist at all) and
since (apparently) there's no standard that specifies how to handle
them, they should be kept out.
This will also avoid a number of problems, e.g. 2.1.

The fullwidth forms are an exception though: they seem to be the only
set of characters with a direct equivalent for all these chars, and they
are also the only non-ascii chars included in the list of chars with the
Unicode Hex_Digits property.

Including all the necessary chars from this range in the decimal codec
seems to me the best thing to do. The chars listed in the points 1-5
should all be implemented and they should work everywhere. The regex
used by Decimal/Fraction should be updated as well, since the decimal
codec is not accessible from Python (maybe it should be accessible, but
this is another issue).

Point 6 is a slightly different issue, even if it can be partially
solved if the fullwidth forms will be included. One of the possible
options is to limit the valid chars used by int() with bases > 10 only
to the characters listed in [c], but this won't be backward-compatible
with existing code and forward-compatible with [[:xdigit:]].
OTOH if we keep the current behavior it will be possible to express the
digits from 0 to 9 using several alphabets, but all the digits > 9 will
be limited to [a-fA-F] (and possibly [ａ－ｆＡ－Ｆ]).
For example, '7F' in the devanagari alphabet will result in a mix of
devanagari numbers and ascii letters, i.e. int(u'७F', 16) (this already
works in Python).

[a]:
http://svn.python.org/view/python/trunk/Objects/unicodeobject.c?view=markup
under 'Decimal Encoder'
[b]: http://unicode.org/faq/casemap_charprop.html#13
[c]: http://unicode.org/faq/hex-digit-values.txt - [0-9a-fA-
F０－９ａ－ｆＡ－Ｆ]
[d]: http://unicode.org/Public/UNIDATA/UCD.html#UCD_Files - PropList.txt
section
[e]: http://unicode.org/Public/UNIDATA/PropList.txt
[f]: http://unicode.org/reports/tr18/ - UTS #18: Unicode Regular Expressions
[g]: http://unicode.org/reports/tr18/#Compatibility_Properties - xdigit row
[h]: http://bugs.python.org/issue6561#msg90878 point (1) about int() and re
[i]: http://bugs.python.org/issue2636#msg65513 point 8) will introduce
[[:xdigit:]]

(Thanks to Mark Dickinson and Adam Olsen for pointing out some of these
issues.)

----------
components: Interpreter Core, Unicode
messages: 91225
nosy: ezio.melotti
priority: normal
severity: normal
status: open
title: Include more fullwidth chars in the decimal codec
type: feature request
versions: Python 2.7, Python 3.2

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue6632>
_______________________________________