[Tutor] three numbers for one
eryksun
eryksun at gmail.com
Sat Jun 8 07:49:03 CEST 2013
On Fri, Jun 7, 2013 at 11:11 PM, Jim Mooney <cybervigilante at gmail.com> wrote:
> I'm puzzling out the difference between isdigit, isdecimal, and
> isnumeric. But at this point, for simple practice programs, which is
> the best to use for plain old 0123456589 , without special characters?
The isnumeric, isdigit, and isdecimal predicates use Unicode character
properties that are defined in UnicodeData.txt:
http://www.unicode.org/Public/6.1.0/ucd
The most restrictive of the 3 is isdecimal. If a string isdecimal(),
you can convert it with int() -- even if you're mixing scripts:
>>> unicodedata.name('\u06f0')
'EXTENDED ARABIC-INDIC DIGIT ZERO'
>>> unicodedata.decimal('\u06f0')
0
>>> '1234\u06f0'.isdecimal()
True
>>> int('1234\u06f0')
12340
The relevant fields in the database are described in Unicode Standard Annex #44:
http://www.unicode.org/reports/tr44/tr44-6.html#UnicodeData.txt
(6) If the character has the property value Numeric_Type=Decimal,
then the Numeric_Value of that digit is represented with an
integer value (limited to the range 0..9) in fields 6, 7, and 8.
Characters with the property value Numeric_Type=Decimal are
restricted to digits which can be used in a decimal radix
positional numeral system and which are encoded in the standard
in a contiguous ascending range 0..9. See the discussion of
decimal digits in Chapter 4, Character Properties in [Unicode].
(7) If the character has the property value Numeric_Type=Digit,
then the Numeric_Value of that digit is represented with an
integer value (limited to the range 0..9) in fields 7 and 8,
and field 6 is null. This covers digits that need special
handling, such as the compatibility superscript digits.
(8) If the character has the property value Numeric_Type=Numeric,
then the Numeric_Value of that character is represented with a
positive or negative integer or rational number in this field,
and fields 6 and 7 are null. This includes fractions such as,
for example, "1/5" for U+2155 VULGAR FRACTION ONE FIFTH.
Some characters have these properties based on values from the
Unihan data files. See Numeric_Type, Han.
Here are the records for ASCII 0-9:
0030;DIGIT ZERO;Nd;0;EN;;0;0;0;N;;;;;
0031;DIGIT ONE;Nd;0;EN;;1;1;1;N;;;;;
0032;DIGIT TWO;Nd;0;EN;;2;2;2;N;;;;;
0033;DIGIT THREE;Nd;0;EN;;3;3;3;N;;;;;
0034;DIGIT FOUR;Nd;0;EN;;4;4;4;N;;;;;
0035;DIGIT FIVE;Nd;0;EN;;5;5;5;N;;;;;
0036;DIGIT SIX;Nd;0;EN;;6;6;6;N;;;;;
0037;DIGIT SEVEN;Nd;0;EN;;7;7;7;N;;;;;
0038;DIGIT EIGHT;Nd;0;EN;;8;8;8;N;;;;;
0039;DIGIT NINE;Nd;0;EN;;9;9;9;N;;;;;
Notice the decimal value is repeated for fields 6-8. The category is
'Nd' (decimal number).
Here's the record for superscript two (U+00B2):
00B2;SUPERSCRIPT TWO;No;0;EN;<super> 0032;;2;2;N;
SUPERSCRIPT DIGIT TWO;;;;
Notice in this case that field 6 is null (empty), so this is not a
decimal number. The category is 'No' (other number). int('\xb2')
raises a ValueError, but you can use unicodedata.digit() to get the
value:
>>> '\xb2'
'²'
>>> unicodedata.digit('\xb2')
2
unicodedata.numeric() returns the value as a float:
>>> unicodedata.numeric('\xb2')
2.0
Finally, here's the record for the character "1/5" (U+2155):
2155;VULGAR FRACTION ONE FIFTH;No;0;ON;
<fraction> 0031 2044 0035;;;1/5;N;
FRACTION ONE FIFTH;;;;
In this case both field 6 and field 7 are null. The category is 'No',
which is the same as superscript two, but this character is *not*
flagged as a digit. That's why the predicate functions don't use the
General_Category, but instead use the more specific information
provided by the Numeric_Type.
Recall that unicodedata.numeric() outputs a float:
>>> '\u2155'
'⅕'
>>> unicodedata.numeric('\u2155')
0.2
====
The following are just some random observations that you can feel free
to ignore:
The award for the biggest numeric value of all goes to CJK ideograph 5146:
>>> '\u5146'
'兆'
>>> unicodedata.numeric('兆')
1000000000000.0
The following Bengali/Oriya/North Indic characters tie for the
smallest magnitude (1/16):
>>> '\u09f4', '\u0b75', '\ua833'
('৴', '୵', '꠳')
>>> unicodedata.name('৴')
'BENGALI CURRENCY NUMERATOR ONE'
>>> unicodedata.name('୵')
'ORIYA FRACTION ONE SIXTEENTH'
>>> unicodedata.name('\ua833')
'NORTH INDIC FRACTION ONE SIXTEENTH'
>>> unicodedata.numeric('৴')
0.0625
Tibet wins an award for having the only character with a negative value:
>>> '\u0f33'
'༳'
>>> unicodedata.name('༳')
'TIBETAN DIGIT HALF ZERO'
>>> unicodedata.numeric('༳')
-0.5
More information about the Tutor
mailing list