Unicode character property methods
As you may have noticed, the Unicode objects provide new methods .islower(), .isupper() and .istitle(). Finn Bock mentioned that Java also provides .isdigit() and .isspace(). Question: should Unicode also provide these character property methods: .isdigit(), .isnumeric(), .isdecimal() and .isspace() ? Plus maybe .digit(), .numeric() and .decimal() for the corresponding decoding ? Similar APIs are already available through the unicodedata module, but could easily be moved to the Unicode object (they cause the builtin interpreter to grow a bit in size due to the new mapping tables). BTW, string.atoi et al. are currently not mapped to string methods... should they be ? -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
As you may have noticed, the Unicode objects provide new methods .islower(), .isupper() and .istitle(). Finn Bock mentioned that Java also provides .isdigit() and .isspace().
Question: should Unicode also provide these character property methods: .isdigit(), .isnumeric(), .isdecimal() and .isspace() ? Plus maybe .digit(), .numeric() and .decimal() for the corresponding decoding ?
What would be the difference between isdigit, isnumeric, isdecimal? I'd say don't do more than Java. I don't understand what the "corresponding decoding" refers to. What would "3".decimal() return?
Similar APIs are already available through the unicodedata module, but could easily be moved to the Unicode object (they cause the builtin interpreter to grow a bit in size due to the new mapping tables).
BTW, string.atoi et al. are currently not mapped to string methods... should they be ?
They are mapped to int() c.s. --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum wrote:
As you may have noticed, the Unicode objects provide new methods .islower(), .isupper() and .istitle(). Finn Bock mentioned that Java also provides .isdigit() and .isspace().
Question: should Unicode also provide these character property methods: .isdigit(), .isnumeric(), .isdecimal() and .isspace() ? Plus maybe .digit(), .numeric() and .decimal() for the corresponding decoding ?
What would be the difference between isdigit, isnumeric, isdecimal? I'd say don't do more than Java. I don't understand what the "corresponding decoding" refers to. What would "3".decimal() return?
These originate in the Unicode database; see ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html Here are the descriptions: """ 6 Decimal digit value normative This is a numeric field. If the character has the decimal digit property, as specified in Chapter 4 of the Unicode Standard, the value of that digit is represented with an integer value in this field 7 Digit value normative This is a numeric field. If the character represents a digit, not necessarily a decimal digit, the value is here. This covers digits which do not form decimal radix forms, such as the compatibility superscript digits 8 Numeric value normative This is a numeric field. If the character has the numeric property, as specified in Chapter 4 of the Unicode Standard, the value of that character is represented with an integer or rational number in this field. This includes fractions as, e.g., "1/5" for U+2155 VULGAR FRACTION ONE FIFTH Also included are numerical values for compatibility characters such as circled numbers. u"3".decimal() would return 3. u"\u2155". Some more examples from the unicodedata module (which makes all fields of the database available in Python):
unicodedata.decimal(u"3") 3 unicodedata.decimal(u"²") 2 unicodedata.digit(u"²") 2 unicodedata.numeric(u"²") 2.0 unicodedata.numeric(u"\u2155") 0.2 unicodedata.numeric(u'\u215b') 0.125
Similar APIs are already available through the unicodedata module, but could easily be moved to the Unicode object (they cause the builtin interpreter to grow a bit in size due to the new mapping tables).
BTW, string.atoi et al. are currently not mapped to string methods... should they be ?
They are mapped to int() c.s.
Hmm, I just noticed that int() et friends don't like Unicode... shouldn't they use the "t" parser marker instead of requiring a string or tp_int compatible type ? -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
[MAL]
As you may have noticed, the Unicode objects provide new methods .islower(), .isupper() and .istitle(). Finn Bock mentioned that Java also provides .isdigit() and .isspace().
Question: should Unicode also provide these character property methods: .isdigit(), .isnumeric(), .isdecimal() and .isspace() ? Plus maybe .digit(), .numeric() and .decimal() for the corresponding decoding ?
[Guido]
What would be the difference between isdigit, isnumeric, isdecimal? I'd say don't do more than Java. I don't understand what the "corresponding decoding" refers to. What would "3".decimal() return?
[MAL]
These originate in the Unicode database; see
ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html
Here are the descriptions:
""" 6 Decimal digit value normative This is a numeric field. If the character has the decimal digit property, as specified in Chapter 4 of the Unicode Standard, the value of that digit is represented with an integer value in this field 7 Digit value normative This is a numeric field. If the character represents a digit, not necessarily a decimal digit, the value is here. This covers digits which do not form decimal radix forms, such as the compatibility superscript digits 8 Numeric value normative This is a numeric field. If the character has the numeric property, as specified in Chapter 4 of the Unicode Standard, the value of that character is represented with an integer or rational number in this field. This includes fractions as, e.g., "1/5" for U+2155 VULGAR FRACTION ONE FIFTH Also included are numerical values for compatibility characters such as circled numbers.
u"3".decimal() would return 3. u"\u2155".
Some more examples from the unicodedata module (which makes all fields of the database available in Python):
unicodedata.decimal(u"3") 3 unicodedata.decimal(u"²") 2 unicodedata.digit(u"²") 2 unicodedata.numeric(u"²") 2.0 unicodedata.numeric(u"\u2155") 0.2 unicodedata.numeric(u'\u215b') 0.125
Hm, very Unicode centric. Probably best left out of the general string methods. Isspace() seems useful, and an isdigit() that is only true for ASCII '0' - '9' also makes sense. What about "123".isdigit()? What does Java say? Or do these only apply to single chars there? I think "123".isdigit() should be true if "abc".islower() is true.
Similar APIs are already available through the unicodedata module, but could easily be moved to the Unicode object (they cause the builtin interpreter to grow a bit in size due to the new mapping tables).
BTW, string.atoi et al. are currently not mapped to string methods... should they be ?
They are mapped to int() c.s.
Hmm, I just noticed that int() et friends don't like Unicode... shouldn't they use the "t" parser marker instead of requiring a string or tp_int compatible type ?
Good catch. Go ahead. --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum wrote:
[MAL about adding .isdecimal(), .isdigit() and .isnumeric()]
Some more examples from the unicodedata module (which makes all fields of the database available in Python):
unicodedata.decimal(u"3") 3 unicodedata.decimal(u"²") 2 unicodedata.digit(u"²") 2 unicodedata.numeric(u"²") 2.0 unicodedata.numeric(u"\u2155") 0.2 unicodedata.numeric(u'\u215b') 0.125
Hm, very Unicode centric. Probably best left out of the general string methods. Isspace() seems useful, and an isdigit() that is only true for ASCII '0' - '9' also makes sense.
Well, how about having all three on Unicode objects and only .isdigit() on string objects ?
What about "123".isdigit()? What does Java say? Or do these only apply to single chars there? I think "123".isdigit() should be true if "abc".islower() is true.
In the current uPython implementation u"123".isdigit() is true; same for the other two methods.
Similar APIs are already available through the unicodedata module, but could easily be moved to the Unicode object (they cause the builtin interpreter to grow a bit in size due to the new mapping tables).
BTW, string.atoi et al. are currently not mapped to string methods... should they be ?
They are mapped to int() c.s.
Hmm, I just noticed that int() et friends don't like Unicode... shouldn't they use the "t" parser marker instead of requiring a string or tp_int compatible type ?
Good catch. Go ahead.
Done. float(), int() and long() now accept charbuf compatible objects as argument. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
participants (2)
-
Guido van Rossum -
M.-A. Lemburg