[Python-Dev] String module

Martin v. Loewis martin@v.loewis.de
31 May 2002 00:05:44 +0200


Guido van Rossum <guido@python.org> writes:

> Thanks!  But now we have a diverging set of isxxx methods for 8-bit
> strings and Unicode.  I really don't know what the equivalent of these
> (ispunct, iscntrl, isgraph, isprint) is in Unicode -- maybe MAL or MvL
> know?  

I don't think there is an "official" mapping between these categories
and Unicode character categories. I believe an "intuitive"
relationship would be:

ispunct: Punctuation (Pc, Pd, Ps, Pe, Pi, Pf, Po)
iscntrl: Other, control (Cc); perhaps other Other
isprint: Letters (L*), Marks (M*), Numbers (N*), Separators (Z*),
         perhaps informative categories (Symbol, Punctuation)
isgraph: everything isprint, except Separators

Another approach is to use the classification found in other
libraries, such as Qt, Perl, or Win32 (GetStringTypeW).

Marcin Kowalczyk presented his intuition in

http://mail.nl.linux.org/linux-utf8/2000-09/msg00076.html

but some of his classification was challenged later on; I guess glibc
would be just another library to draw classificiations from.

> Unicode also has a wider definition of digits; do we want to
> extend isxdigit() for that?  (Probably not, but I'm not sure.)

Certainly not. We have to remember the common use for these, which is
in computer stuff. There, hexdigit is 0..9{a..f|A..F}.

Regards,
Martin