[Python-Dev] Unicode character property methods

Guido van Rossum guido@python.org
Mon, 06 Mar 2000 18:12:33 -0500


[MAL]
> > > As you may have noticed, the Unicode objects provide
> > > new methods .islower(), .isupper() and .istitle(). Finn Bock
> > > mentioned that Java also provides .isdigit() and .isspace().
> > >
> > > Question: should Unicode also provide these character
> > > property methods: .isdigit(), .isnumeric(), .isdecimal()
> > > and .isspace() ? Plus maybe .digit(), .numeric() and
> > > .decimal() for the corresponding decoding ?

[Guido]
> > What would be the difference between isdigit, isnumeric, isdecimal?
> > I'd say don't do more than Java.  I don't understand what the
> > "corresponding decoding" refers to.  What would "3".decimal() return?

[MAL]
> These originate in the Unicode database; see
> 
> ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html
> 
> Here are the descriptions:
> 
> """
> 6
>       Decimal digit value
>                         normative
>                                      This is a numeric field. If the
>                                      character has the decimal digit
>                                      property, as specified in Chapter
>                                      4 of the Unicode Standard, the
>                                      value of that digit is represented
>                                      with an integer value in this field
>    7
>       Digit value
>                         normative
>                                      This is a numeric field. If the
>                                      character represents a digit, not
>                                      necessarily a decimal digit, the
>                                      value is here. This covers digits
>                                      which do not form decimal radix
>                                      forms, such as the compatibility
>                                      superscript digits
>    8
>       Numeric value
>                         normative
>                                      This is a numeric field. If the
>                                      character has the numeric
>                                      property, as specified in Chapter
>                                      4 of the Unicode Standard, the
>                                      value of that character is
>                                      represented with an integer or
>                                      rational number in this field. This
>                                      includes fractions as, e.g., "1/5" for
>                                      U+2155 VULGAR FRACTION
>                                      ONE FIFTH Also included are
>                                      numerical values for compatibility
>                                      characters such as circled
>                                      numbers.
> 
> u"3".decimal() would return 3. u"\u2155".
> 
> Some more examples from the unicodedata module (which makes
> all fields of the database available in Python):
> 
> >>> unicodedata.decimal(u"3")
> 3
> >>> unicodedata.decimal(u"²")
> 2
> >>> unicodedata.digit(u"²")
> 2
> >>> unicodedata.numeric(u"²")
> 2.0
> >>> unicodedata.numeric(u"\u2155")
> 0.2
> >>> unicodedata.numeric(u'\u215b')
> 0.125

Hm, very Unicode centric.  Probably best left out of the general
string methods.  Isspace() seems useful, and an isdigit() that is only
true for ASCII '0' - '9' also makes sense.

What about "123".isdigit()?  What does Java say?  Or do these only
apply to single chars there?  I think "123".isdigit() should be true
if "abc".islower() is true.

> > > Similar APIs are already available through the unicodedata
> > > module, but could easily be moved to the Unicode object
> > > (they cause the builtin interpreter to grow a bit in size
> > > due to the new mapping tables).
> > >
> > > BTW, string.atoi et al. are currently not mapped to
> > > string methods... should they be ?
> > 
> > They are mapped to int() c.s.
> 
> Hmm, I just noticed that int() et friends don't like
> Unicode... shouldn't they use the "t" parser marker 
> instead of requiring a string or tp_int compatible
> type ?

Good catch.  Go ahead.

--Guido van Rossum (home page: http://www.python.org/~guido/)