Re: [Python-Dev] String module

Maybe you can do a patch for isxxx() methods instead?
Will do.
I need to un-volunteer on this one.
The string module already has a bunch of isxxx methods and I'm not clear which others are needed or how to make them vary by locale.
Raymond Hettinger

I need to un-volunteer on this one.
The string module already has a bunch of isxxx methods and I'm not clear which others are needed or how to make them vary by locale.
OK. Nothing needs to be done to vary them by locale as long as you use the isxxx macros from <ctype.h>. I think all that's needed is to add the missing ones.
--Guido van Rossum (home page: http://www.python.org/~guido/)

From: "Guido van Rossum" guido@python.org
I need to un-volunteer on this one.
The string module already has a bunch of isxxx methods and I'm not clear which others are needed or how to make them vary by locale.
OK. Nothing needs to be done to vary them by locale as long as you use the isxxx macros from <ctype.h>. I think all that's needed is to add the missing ones.
It's not that hard once you know what to do.
See www.python.org/sf/562501
Raymond Hettinger

OK. Nothing needs to be done to vary them by locale as long as you use the isxxx macros from <ctype.h>. I think all that's needed is to add the missing ones.
It's not that hard once you know what to do.
See www.python.org/sf/562501
Thanks! But now we have a diverging set of isxxx methods for 8-bit strings and Unicode. I really don't know what the equivalent of these (ispunct, iscntrl, isgraph, isprint) is in Unicode -- maybe MAL or MvL know? Unicode also has a wider definition of digits; do we want to extend isxdigit() for that? (Probably not, but I'm not sure.)
Someone commented that isxdigit is a poor name. OTOH it's what C uses. I'm not sure what to say.
--Guido van Rossum (home page: http://www.python.org/~guido/)

From: "Guido van Rossum" guido@python.org
Thanks! But now we have a diverging set of isxxx methods for 8-bit strings and Unicode. I really don't know what the equivalent of these (ispunct, iscntrl, isgraph, isprint) is in Unicode -- maybe MAL or MvL know? Unicode also has a wider definition of digits; do we want to extend isxdigit() for that? (Probably not, but I'm not sure.)
I'll spend some time with the big Unicode 3.0 book this evening and chat with some Unicode techno-weenies. When I've got an answer will add the unicodeobject.c methods to the patch.
Someone commented that isxdigit is a poor name. OTOH it's what C uses. I'm not sure what to say.
I concur. I had to look it up on google to make sure in meant what I surmised it meant. ishexdigit() is more explicit. Besides, C naming conventions aren't exactly role models for clarity ;)
While we're at it: isgraph() --> isvisible() iscntrl() --> iscontrol() isprint() --> isprintable()
I'm sure everyone will have an opinion or two.
Raymond Hettinger

On Thu, 30 May 2002, Raymond Hettinger wrote:
Someone commented that isxdigit is a poor name. OTOH it's what C uses. I'm not sure what to say.
Obviously, it refers to the ex-digits, once honoured and respected along with the other ten Arabic numerals but now long since neglected. This includes, for example, I, V, and X, the first nine Greek letters (supported if the argument is a Unicode string), and so on.
-- ?!ng

Someone commented that isxdigit is a poor name. OTOH it's what C uses. I'm not sure what to say.
I concur. I had to look it up on google to make sure in meant what I surmised it meant. ishexdigit() is more explicit. Besides, C naming conventions aren't exactly role models for clarity ;)
While we're at it: isgraph() --> isvisible() iscntrl() --> iscontrol() isprint() --> isprintable()
Sure. But I still can't guess the relationships between these three graph excludes space, print includes it).
And then maybe also ispunctuation()? Or is that too long?
My ctype.h man page also has these:
isblank -- space or tab (GNU extension) isascii -- 7-bit value (BSD/SVID extension)
--Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
Someone commented that isxdigit is a poor name. OTOH it's what C uses. I'm not sure what to say.
I concur. I had to look it up on google to make sure in meant what I surmised it meant. ishexdigit() is more explicit. Besides, C naming conventions aren't exactly role models for clarity ;)
While we're at it: isgraph() --> isvisible() iscntrl() --> iscontrol() isprint() --> isprintable()
Sure. But I still can't guess the relationships between these three graph excludes space, print includes it).
And then maybe also ispunctuation()? Or is that too long?
My ctype.h man page also has these:
isblank -- space or tab (GNU extension) isascii -- 7-bit value (BSD/SVID extension)
Do you really think this proliferation of issomething() methods is a good idea ?
The same can be had using re.metch() with *much* more flexibility since you're not stuck with an "intuitive" definition relying on the intuition of some non-standard body.
FWIW, I've never used a single one of these single character based classification APIs. The only non-trivial issomething() method I can think of is .istitle() because of it's complicated definition. All others can easily be had with re.match().
I'm -0 on adding more of these .issomething() APIs to strings and Unicode.

mal wrote:
Do you really think this proliferation of issomething() methods is a good idea ?
The same can be had using re.metch() with *much* more flexibility since you're not stuck with an "intuitive" definition relying on the intuition of some non-standard body.
FWIW, I've never used a single one of these single character based classification APIs. The only non-trivial issomething() method I can think of is .istitle() because of it's complicated definition. All others can easily be had with re.match().
I fully agree.
the SRE engine already supports character classes based on ASCII, the current locale (via ctype.h), and the Unicode char- set.
we should probably add more classes -- at least the full list of POSIX [:name:] classes, and probably also unicode categories.
fwiw, I've played with adding "charset" objects to SRE, which would allow you to plug in custom [:spam:] sets as well (e.g. xml name chars).
</F>

the SRE engine already supports character classes based on ASCII, the current locale (via ctype.h), and the Unicode char- set.
How do I ask for a locale letter? The re module only defines escapes for "word" and "digit".
we should probably add more classes -- at least the full list of POSIX [:name:] classes, and probably also unicode categories.
+1
fwiw, I've played with adding "charset" objects to SRE, which would allow you to plug in custom [:spam:] sets as well (e.g. xml name chars).
+1
--Guido van Rossum (home page: http://www.python.org/~guido/)

Do you really think this proliferation of issomething() methods is a good idea ?
To tell you the truth, I'm not sure. Maybe we should stick to supplying replacements for the character set variables in the string module; this would mean adding replacements for hexdigits, octdigits, punctuation, and printable. Also, the string module defines ascii_lowercase, ascii_uppercase, and ascii_letters; maybe an isascii() replacement would be the best approach there. (I've never seen those used, but they exist, so if we want to slowly start discouraging people from using the string module, I feel we're obliged to privide an alternative.
The same can be had using re.metch() with *much* more flexibility since you're not stuck with an "intuitive" definition relying on the intuition of some non-standard body.
I'm not sure what you're trying to say. The re module's definition of "word" characters (which includes '_') is definitely non-standard. How do you spell "letter according to locale" in a regexp?
FWIW, I've never used a single one of these single character based classification APIs. The only non-trivial issomething() method I can think of is .istitle() because of it's complicated definition. All others can easily be had with re.match().
I'm -0 on adding more of these .issomething() APIs to strings and Unicode.
Noted. But what do you propose we should do in 3.0 about the character set variables in the string module?
--Guido van Rossum (home page: http://www.python.org/~guido/)

[Raymond Hettinger]
While we're at it: isgraph() --> isvisible() iscntrl() --> iscontrol() isprint() --> isprintable()
I'm sure everyone will have an opinion or two.
+1 on the (only slightly) longer versions of these names.
--- Patrick K. O'Brien Orbtech

On Thu, 30 May 2002, Patrick K. O'Brien wrote:
[Raymond Hettinger]
While we're at it: isgraph() --> isvisible() iscntrl() --> iscontrol() isprint() --> isprintable()
I'm sure everyone will have an opinion or two.
+1 on the (only slightly) longer versions of these names.
Whu not
isgraph() --> is_visible() iscntrl() --> is_control() isprint() --> is_printable()
so "is" is more... visible?
Patrick K. O'Brien Orbtech
Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev
Sincerely yours, Roman Suzi

Whu not
isgraph() --> is_visible() iscntrl() --> is_control() isprint() --> is_printable()
so "is" is more... visible?
Please no -- we already have islower() etc. without the _. (And _ is a pain to type.)
--Guido van Rossum (home page: http://www.python.org/~guido/)

[Guido van Rossum]
[...] (And _ is a pain to type.)
By the way, I wonder why the `_' in `sys._getframe()'. Was it to mark that this is a function giving access to internals? If yes, this only increases my wondering, since there are other functions in `sys' which give access to internals, and which do not use such a `_' prefix. I presume this has been debated in the proper time, I was probably elsewhere and missed the discussion. Could the reason be summarised in one sentence? :-)

By the way, I wonder why the `_' in `sys._getframe()'. Was it to mark that this is a function giving access to internals? If yes, this only increases my wondering, since there are other functions in `sys' which give access to internals, and which do not use such a `_' prefix. I presume this has been debated in the proper time, I was probably elsewhere and missed the discussion. Could the reason be summarised in one sentence? :-)
To discourage people from using it. This function enables a programming idiom where a function digs in its caller's namespace. I find that highly undesirable so I wanted to emphasize that this function does not exist for that purpose. (Neither does inspect.currentframe(). :-)
--Guido van Rossum (home page: http://www.python.org/~guido/)

[Guido van Rossum]
By the way, I wonder why the `_' in `sys._getframe()'.
To discourage people from using it.
Why was it introduced then, since there was already a way without it? Execution speed, maybe? Writing speed is probably not a consideration, as the previous way was a bot discouraging already, in that respect :-).
This function enables a programming idiom where a function digs in its caller's namespace. I find that highly undesirable so I wanted to emphasize that this function does not exist for that purpose.
Does it exist for another purpose? Getting the frame, you will say! :-)

By the way, I wonder why the `_' in `sys._getframe()'.
To discourage people from using it.
Why was it introduced then, since there was already a way without it?
Because there are legitimate uses -- mostly in the area of introspection or debugging -- and the existing way (catching an exception) was clumsy.
--Guido van Rossum (home page: http://www.python.org/~guido/)

On Fri, 31 May 2002, Guido van Rossum wrote:
By the way, I wonder why the `_' in `sys._getframe()'.
To discourage people from using it.
Why was it introduced then, since there was already a way without it?
Because there are legitimate uses -- mostly in the area of introspection or debugging -- and the existing way (catching an exception) was clumsy.
Clumsy and broken -- the old trick destroys the current exception context, so it is not safe to use in places where that side-effect is undesirable (which is all of the places I actually wanted to use it).
-Kevin
-- Kevin Jacobs The OPAL Group - Enterprise Systems Architect Voice: (216) 986-0710 x 19 E-mail: jacobs@theopalgroup.com Fax: (216) 986-0714 WWW: http://www.theopalgroup.com

"FP" == François Pinard pinard@iro.umontreal.ca writes:
>> By the way, I wonder why the `_' in `sys._getframe()'.
>> To discourage people from using it.
FP> Why was it introduced then, since there was already a way FP> without it? Execution speed, maybe?
Because
try: 1/0 except ZeroDivisionError: return sys.exc_info()[2].tb_frame.f_back
is slow, hard to remember, and too magical. Also, sys._getframe() gives you a better interface for getting frames farther back than currentframe().
-Barry

[François]
>> By the way, I wonder why the `_' in `sys._getframe()'.
[Guido]
>> To discourage people from using it.
[Barry]
Because [the previous way] is slow, hard to remember, and too magical.
That is, in short, discouraging, exactly as per the wish of Guido! :-)
A bit more seriously, it is fun writing:
try: frame = sys._getframe(1) except AttributeError: frame = sys.exc_info()[2].tb_frame.f_back
putting the error to good use whenever `_getframe' is not available!

"FP" == François Pinard pinard@iro.umontreal.ca writes:
FP> A bit more seriously, it is fun writing:
| try: | frame = sys._getframe(1) | except AttributeError: | frame = sys.exc_info()[2].tb_frame.f_back
FP> putting the error to good use whenever `_getframe' is not FP> available!
LOL! Very cute! -Barry

Guido van Rossum guido@python.org writes:
Thanks! But now we have a diverging set of isxxx methods for 8-bit strings and Unicode. I really don't know what the equivalent of these (ispunct, iscntrl, isgraph, isprint) is in Unicode -- maybe MAL or MvL know?
I don't think there is an "official" mapping between these categories and Unicode character categories. I believe an "intuitive" relationship would be:
ispunct: Punctuation (Pc, Pd, Ps, Pe, Pi, Pf, Po) iscntrl: Other, control (Cc); perhaps other Other isprint: Letters (L*), Marks (M*), Numbers (N*), Separators (Z*), perhaps informative categories (Symbol, Punctuation) isgraph: everything isprint, except Separators
Another approach is to use the classification found in other libraries, such as Qt, Perl, or Win32 (GetStringTypeW).
Marcin Kowalczyk presented his intuition in
http://mail.nl.linux.org/linux-utf8/2000-09/msg00076.html
but some of his classification was challenged later on; I guess glibc would be just another library to draw classificiations from.
Unicode also has a wider definition of digits; do we want to extend isxdigit() for that? (Probably not, but I'm not sure.)
Certainly not. We have to remember the common use for these, which is in computer stuff. There, hexdigit is 0..9{a..f|A..F}.
Regards, Martin
participants (11)
-
barry@zope.com
-
Fredrik Lundh
-
Guido van Rossum
-
Ka-Ping Yee
-
Kevin Jacobs
-
M.-A. Lemburg
-
martin@v.loewis.de
-
Patrick K. O'Brien
-
pinard@iro.umontreal.ca
-
Raymond Hettinger
-
Roman Suzi