[Python-Dev] String module
Martin v. Loewis
martin@v.loewis.de
31 May 2002 00:05:44 +0200
Guido van Rossum <guido@python.org> writes:
> Thanks! But now we have a diverging set of isxxx methods for 8-bit
> strings and Unicode. I really don't know what the equivalent of these
> (ispunct, iscntrl, isgraph, isprint) is in Unicode -- maybe MAL or MvL
> know?
I don't think there is an "official" mapping between these categories
and Unicode character categories. I believe an "intuitive"
relationship would be:
ispunct: Punctuation (Pc, Pd, Ps, Pe, Pi, Pf, Po)
iscntrl: Other, control (Cc); perhaps other Other
isprint: Letters (L*), Marks (M*), Numbers (N*), Separators (Z*),
perhaps informative categories (Symbol, Punctuation)
isgraph: everything isprint, except Separators
Another approach is to use the classification found in other
libraries, such as Qt, Perl, or Win32 (GetStringTypeW).
Marcin Kowalczyk presented his intuition in
http://mail.nl.linux.org/linux-utf8/2000-09/msg00076.html
but some of his classification was challenged later on; I guess glibc
would be just another library to draw classificiations from.
> Unicode also has a wider definition of digits; do we want to
> extend isxdigit() for that? (Probably not, but I'm not sure.)
Certainly not. We have to remember the common use for these, which is
in computer stuff. There, hexdigit is 0..9{a..f|A..F}.
Regards,
Martin