[docs] [issue26483] docs unclear on difference between str.isdigit() and str.isdecimal()

Julien report at bugs.python.org
Sat Mar 12 16:31:15 EST 2016


Julien added the comment:

To dig further, the DIGIT_MASK and DECIMAL_MASK used in `unicodeobject.c` are from `unicodectype.c` and they match values from `unicodetype_db.h` witch is generated by `Tools/unicode/makeunicodedata.py` which built those masks this way:

    # decimal digit, integer digit
    decimal = 0
    if record[6]:
        flags |= DECIMAL_MASK
        decimal = int(record[6])
    digit = 0
    if record[7]:
        flags |= DIGIT_MASK
        digit = int(record[7])
    if record[8]:
        flags |= NUMERIC_MASK
        numeric.setdefault(record[8], []).append(char)

Those "record"s are documented in ftp://unicode.org/Public/3.2-Update/UnicodeData-3.2.0.html in which fields 6, 7, and 8 are:

 - 6	Decimal digit value	N	This is a numeric field. If the character has the decimal digit property, as specified in Chapter 4 of the Unicode Standard, the value of that digit is represented with an integer value in this field

 - 7	Digit value	N	This is a numeric field. If the character represents a digit, not necessarily a decimal digit, the value is here. This covers digits which do not form decimal radix forms, such as the compatibility superscript digits

 - 8	Numeric value	N	This is a numeric field. If the character has the numeric property, as specified in Chapter 4 of the Unicode Standard, the value of that character is represented with an integer or rational number in this field. This includes fractions as, e.g., "1/5" for U+2155 VULGAR FRACTION ONE FIFTH Also included are numerical values for compatibility characters such as circled numbers.

Which is very close of the actual documentation. Yet the documentation is misleading using "This category includes digit characters" in the "isdecimal" documentation.

Posssible rewriting:

isdecimal: Return true if all characters in the string are decimal characters and there is at least one character, false otherwise. Decimal characters are those that can be used to form decimal-radix numbers, e.g. U+0660, ARABIC-INDIC DIGIT ZERO. Formally a decimal character is a character in the Unicode General Category "Nd".

isdigit: Return true if all characters in the string are digits and there is at least one character, false otherwise. Digits include decimal characters and digits that need special handling, such as the compatibility superscript digits. This covers digits which do not form decimal radix forms. Formally, a digit is a character that has the property value Numeric_Type=Digit or Numeric_Type=Decimal.

I don't think we can refactor more than this without rewriting documentation for isnumeric which mentions the Unicode standard the same way.

----------
nosy: +sizeof

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue26483>
_______________________________________


More information about the docs mailing list