[Tutor] three numbers for one

Sat Jun 8 07:49:03 CEST 2013

On Fri, Jun 7, 2013 at 11:11 PM, Jim Mooney <cybervigilante at gmail.com> wrote:
> I'm puzzling out the difference between isdigit, isdecimal, and
> isnumeric. But at this point, for simple  practice programs, which is
> the best to use for plain old 0123456589 , without special characters?

The isnumeric, isdigit, and isdecimal predicates use Unicode character
properties that are defined in UnicodeData.txt:

http://www.unicode.org/Public/6.1.0/ucd

The most restrictive of the 3 is isdecimal. If a string isdecimal(),
you can convert it with int() -- even if you're mixing scripts:

    >>> unicodedata.name('\u06f0')
    'EXTENDED ARABIC-INDIC DIGIT ZERO'
    >>> unicodedata.decimal('\u06f0')
    0
    >>> '1234\u06f0'.isdecimal()
    True
    >>> int('1234\u06f0')
    12340

The relevant fields in the database are described in Unicode Standard Annex #44:

http://www.unicode.org/reports/tr44/tr44-6.html#UnicodeData.txt

    (6) If the character has the property value Numeric_Type=Decimal,
    then the Numeric_Value of that digit is represented with an
    integer value (limited to the range 0..9) in fields 6, 7, and 8.
    Characters with the property value Numeric_Type=Decimal are
    restricted to digits which can be used in a decimal radix
    positional numeral system and which are encoded in the standard
    in a contiguous ascending range 0..9. See the discussion of
    decimal digits in Chapter 4, Character Properties in [Unicode].

    (7) If the character has the property value Numeric_Type=Digit,
    then the Numeric_Value of that digit is represented with an
    integer value (limited to the range 0..9) in fields 7 and 8,
    and field 6 is null. This covers digits that need special
    handling, such as the compatibility superscript digits.

    (8) If the character has the property value Numeric_Type=Numeric,
    then the Numeric_Value of that character is represented with a
    positive or negative integer or rational number in this field,
    and fields 6 and 7 are null. This includes fractions such as,
    for example, "1/5" for U+2155 VULGAR FRACTION ONE FIFTH.

    Some characters have these properties based on values from the
    Unihan data files. See Numeric_Type, Han.

Here are the records for ASCII 0-9:

    0030;DIGIT ZERO;Nd;0;EN;;0;0;0;N;;;;;
    0031;DIGIT ONE;Nd;0;EN;;1;1;1;N;;;;;
    0032;DIGIT TWO;Nd;0;EN;;2;2;2;N;;;;;
    0033;DIGIT THREE;Nd;0;EN;;3;3;3;N;;;;;
    0034;DIGIT FOUR;Nd;0;EN;;4;4;4;N;;;;;
    0035;DIGIT FIVE;Nd;0;EN;;5;5;5;N;;;;;
    0036;DIGIT SIX;Nd;0;EN;;6;6;6;N;;;;;
    0037;DIGIT SEVEN;Nd;0;EN;;7;7;7;N;;;;;
    0038;DIGIT EIGHT;Nd;0;EN;;8;8;8;N;;;;;
    0039;DIGIT NINE;Nd;0;EN;;9;9;9;N;;;;;

Notice the decimal value is repeated for fields 6-8. The category is
'Nd' (decimal number).

Here's the record for superscript two (U+00B2):

    00B2;SUPERSCRIPT TWO;No;0;EN;<super> 0032;;2;2;N;
        SUPERSCRIPT DIGIT TWO;;;;

Notice in this case that field 6 is null (empty), so this is not a
decimal number. The category is 'No' (other number). int('\xb2')
raises a ValueError, but you can use unicodedata.digit() to get the
value:

    >>> '\xb2'
    '²'
    >>> unicodedata.digit('\xb2')
    2

unicodedata.numeric() returns the value as a float:

    >>> unicodedata.numeric('\xb2')
    2.0

Finally, here's the record for the character "1/5" (U+2155):

    2155;VULGAR FRACTION ONE FIFTH;No;0;ON;
        <fraction> 0031 2044 0035;;;1/5;N;
        FRACTION ONE FIFTH;;;;

In this case both field 6 and field 7 are null. The category is 'No',
which is the same as superscript two, but this character is *not*
flagged as a digit. That's why the predicate functions don't use the
General_Category, but instead use the more specific information
provided by the Numeric_Type.

Recall that unicodedata.numeric() outputs a float:

    >>> '\u2155'
    '⅕'
    >>> unicodedata.numeric('\u2155')
    0.2

====

The following are just some random observations that you can feel free
to ignore:

The award for the biggest numeric value of all goes to CJK ideograph 5146:

    >>> '\u5146'
    '兆'
    >>> unicodedata.numeric('兆')
    1000000000000.0

The following Bengali/Oriya/North Indic characters tie for the
smallest magnitude (1/16):

    >>> '\u09f4', '\u0b75', '\ua833'
    ('৴', '୵', '꠳')
    >>> unicodedata.name('৴')
    'BENGALI CURRENCY NUMERATOR ONE'
    >>> unicodedata.name('୵')
    'ORIYA FRACTION ONE SIXTEENTH'
    >>> unicodedata.name('\ua833')
    'NORTH INDIC FRACTION ONE SIXTEENTH'

    >>> unicodedata.numeric('৴')
    0.0625

Tibet wins an award for having the only character with a negative value:

    >>> '\u0f33'
    '༳'
    >>> unicodedata.name('༳')
    'TIBETAN DIGIT HALF ZERO'
    >>> unicodedata.numeric('༳')
    -0.5