Re: [Python-ideas] This seems like a wart to me...

Ron Adam wrote:
Feel what you like, but I assumed that .split meant .splitonchars when I was learning Python in 2007 and was confused when my script didn't work. I was also confused about why it stopped getting rid of empty strings. And I still don't know how to write regexs, so now I when I want to split on multiple chars, I end up .replace-ing a bunch first, which I recognize to be terribly inefficient, but the scripts are throwaways, so it's hardly worth the time to learn a whole other language first. -- Carl

Rereading your message along with your other ones, I see that I misinterpreted it. I thought you meant, "I can't imagine a new programmer wanting something as regex-like as string.splitonchars," but what you meant was "I can't imagine new programmers wanting to go into the re module to learn how to do something like string.splitonchars." To which I say: Yes! I heartily agree! :-D Embarrassedly-yours, Carl

Carl Johnson writes:
I don't understand this point of view at all. True, regexps are a complex subject, with an unfortunately large number of dialects. Is it the confusion of dialects problem, or do you really never use regexps in any language? Anyway, for this purpose you only have to learn one idiom, that longstring.splitonchars (["x", "y", "z"]) is spelled import re re.split ("[xyz]", longstring) In fact, I personally would like to deprecate the with-argument implementation of string.split(), and have def split (self, delimiter = None): if delimiters is None: return self.usual_magic_splitting () else: import re return re.split (delimiter, self) (of course, that's because that's precisely the way split-string works in Emacs). Then the idiom would be longstring.split ("[xyz]") Would that work for you?

Stephen J. Turnbull wrote:
I have half-heartedly tried to learn regexps before, but always given up after reading about the basics. Obviously, this would be shameless behavior for a professional programmer, but I'm just a dilettante, and the famed saying of Jamie Zawinski ("Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems.") is not highly motivating. :-D
Wouldn't that subtly break the code of everyone who has written something like: lines = bigtext.splitlines() delimiter = lines[0] del lines[0] splitlines = [line.split(delimiter) for line in lines] ? Since suddenly if your delimiter uses one of the reserved regexp characters, such as brackets and parentheses, the code would stop working. (That's one of the things I dislike about regexps -- too many magical characters.) Here's a backward compatible idea instead: def split (self, delimiter = None): if delimiter is None: return self.usual_magic_splitting () elif isinstance(delimiter, str): return self.usual_delimiter_based_splitting() elif isinstance(delimiter, Sequence): return self.treat_delimiters_given_by_sequence_as_interchangable() else: raise TypeError("coercing to Unicode: need string or buffer or Sequence, " + repr(type(delimiter)) + " found") Since right now passing a list or tuple raises a TypeError, this would be backwards compatible. The idiom for doing re.split-like things would then be bigtext.split(list(" ;.,-!?")). It might even be a good idea to a keyword (only?) argument called "dropempty" to recreate the magical behavior of passing None as the delimiter where empty strings are dropped. That would also solve skip's original problem: just set it to text.split(None, dropempty=False). -- Carl

Carl Johnson writes:
Jamie was talking about the "to a man with a hammer, all problems look like thumbs" phenomenon. I've never heard anybody complain that shell globs are complex. But regexps will take you a lot farther with just character classes [] (which most modern shells implement), the wildcard character . (usually ? in shells), and the repetition operators * and/or + (available only as a variable-length wildcard * in shell globs).
Indeed it would. That was not a serious proposal. At this point, I'm trying to understand the resistence to regexps, not propose an improvement for .split().

On Fri, Dec 12, 2008 at 1:43 AM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
I'd say the lack of diagnostics when they "fail" is the biggest issue. I could easily spend half an hour trying random permutations of a pattern before I figure out why the original didn't work... and I've had a moderate amount of experience. -- Adam Olsen, aka Rhamphoryncus

Adam Olsen writes:
It takes a moderate amount of experience to get that far, though. In particular, in this case, all you need to understand is "[abc]" matches any of the characters "a", "b", or "c", and *that* is familiar to anybody who has used a decent shell (any Unix shell, and I believe 4DOS and friends provided it too but I haven't used them for 20 years). So I don't think that lack of diagnostics explains widespread reluctance to even substitute ".*" for "*", but instead propose something as ugly as .split(list("abc")).

I think this discussion is drifting from the point. We all agree that regexps are great and powerful and no professional programmer should fail to learn them. But at the same time, it's worth noting that they are a different language from Python proper, and it's very easy to get weird results without knowing why. Anyway, apparently the proposal to allow splitting on a list is dead. What do people think of the proposal to add a dropitem keyword to allow the dropping (or retaining) of empty results? -- Carl

[i for i in s.split(x) if i] is simple enough if I don't know how to write "(" + re.escape(x) + ")+". I would like to be able to drop "i for" in cases like this and just write [i in s.split(x) if i]. --- Bruce On Sat, Dec 13, 2008 at 12:05 PM, Carl Johnson <carl@carlsensei.com> wrote:

Carl> I have half-heartedly tried to learn regexps before, but always Carl> given up after reading about the basics. Just out of curiosity, what editor do you use? Reading and doing are two different things. Carl> ... the famed saying of Jamie Zawinski ("Some people, when Carl> confronted with a problem, think 'I know, I'll use regular Carl> expressions.' Now they have two problems.") is not highly Carl> motivating. :-D Sure, but that addresses the topic of some peoples' desire to use regular expressions to parse everything from LL1 grammars to the tea leaves in the bottom of a cup. If you use them in an environment where there is almost no penalty for mistakes (incremental search) I think you will quickly gain an understanding of the syntax. Then your challenge will be not to fall into Jamie's re tar pit. ;-) Skip

>> what you meant was "I can't imagine new programmers wanting to go >> into the re module to learn how to do something like >> string.splitonchars." To which I say: Yes! I heartily agree! :-D Steve> I don't understand this point of view at all. True, regexps are Steve> a complex subject, with an unfortunately large number of Steve> dialects. Is it the confusion of dialects problem, or do you Steve> really never use regexps in any language? Getting more than a little bit off the original topic, but... I think a person's affinity for regular expressions has a lot to do with their editing & programming environments. I work with some very experienced programmers (C++ & Python mostly, not much Perl, and generally very basic Emacs usage) who never (or almost never) use regular expressions. * C/C++: My impression was always that the C regex(3) API presented a lot of barriers to casual use. Maybe that's changed over time. * Python: You can go a long way without using regular expressions in Python because it has other easy-to-use string searching stuff (str.find, etc) as well as shell-style globbing for file name matching. * Emacs: I think part of the reason that I find re's so easy-to-use is that I've been using some dialect of Emacs for about 20 years and it exposes re's in a way that is real easy to experiment with: incremental search. i-search+re's - what a fabulous combination. * Perl: I suspect Perl mongers are as adept at re's as Emacs types because that's the primary (only?) way to search for patterns in strings. * vi: Probably somewhere between Perl and Emacs. vim does support incremental search but it's not the default. Are there other editors besides Emacs and vi for which regular expressions are so common? Bringing this back on-topic, I can see that I'm going to lose this argument. I still view "".split(':') as a wart. I guess I'll have to live with it though. Skip

From: "Carl Johnson" <carl@carlsensei.com>
And I still don't know how to write regexs, ...
Maybe you should learn some of the fundamental tools provided by the langauge before you get in the business of demanding that the language be changed. Regexes occur in other languages and some command-line tools. Taking a little time to learn them will provide you with a life long skill that will serve you well in a number of contexts. This is doubly true in your case (since you've show an interest in text processing). Raymond

Rereading your message along with your other ones, I see that I misinterpreted it. I thought you meant, "I can't imagine a new programmer wanting something as regex-like as string.splitonchars," but what you meant was "I can't imagine new programmers wanting to go into the re module to learn how to do something like string.splitonchars." To which I say: Yes! I heartily agree! :-D Embarrassedly-yours, Carl

Carl Johnson writes:
I don't understand this point of view at all. True, regexps are a complex subject, with an unfortunately large number of dialects. Is it the confusion of dialects problem, or do you really never use regexps in any language? Anyway, for this purpose you only have to learn one idiom, that longstring.splitonchars (["x", "y", "z"]) is spelled import re re.split ("[xyz]", longstring) In fact, I personally would like to deprecate the with-argument implementation of string.split(), and have def split (self, delimiter = None): if delimiters is None: return self.usual_magic_splitting () else: import re return re.split (delimiter, self) (of course, that's because that's precisely the way split-string works in Emacs). Then the idiom would be longstring.split ("[xyz]") Would that work for you?

Stephen J. Turnbull wrote:
I have half-heartedly tried to learn regexps before, but always given up after reading about the basics. Obviously, this would be shameless behavior for a professional programmer, but I'm just a dilettante, and the famed saying of Jamie Zawinski ("Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems.") is not highly motivating. :-D
Wouldn't that subtly break the code of everyone who has written something like: lines = bigtext.splitlines() delimiter = lines[0] del lines[0] splitlines = [line.split(delimiter) for line in lines] ? Since suddenly if your delimiter uses one of the reserved regexp characters, such as brackets and parentheses, the code would stop working. (That's one of the things I dislike about regexps -- too many magical characters.) Here's a backward compatible idea instead: def split (self, delimiter = None): if delimiter is None: return self.usual_magic_splitting () elif isinstance(delimiter, str): return self.usual_delimiter_based_splitting() elif isinstance(delimiter, Sequence): return self.treat_delimiters_given_by_sequence_as_interchangable() else: raise TypeError("coercing to Unicode: need string or buffer or Sequence, " + repr(type(delimiter)) + " found") Since right now passing a list or tuple raises a TypeError, this would be backwards compatible. The idiom for doing re.split-like things would then be bigtext.split(list(" ;.,-!?")). It might even be a good idea to a keyword (only?) argument called "dropempty" to recreate the magical behavior of passing None as the delimiter where empty strings are dropped. That would also solve skip's original problem: just set it to text.split(None, dropempty=False). -- Carl

Carl Johnson writes:
Jamie was talking about the "to a man with a hammer, all problems look like thumbs" phenomenon. I've never heard anybody complain that shell globs are complex. But regexps will take you a lot farther with just character classes [] (which most modern shells implement), the wildcard character . (usually ? in shells), and the repetition operators * and/or + (available only as a variable-length wildcard * in shell globs).
Indeed it would. That was not a serious proposal. At this point, I'm trying to understand the resistence to regexps, not propose an improvement for .split().

On Fri, Dec 12, 2008 at 1:43 AM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
I'd say the lack of diagnostics when they "fail" is the biggest issue. I could easily spend half an hour trying random permutations of a pattern before I figure out why the original didn't work... and I've had a moderate amount of experience. -- Adam Olsen, aka Rhamphoryncus

Adam Olsen writes:
It takes a moderate amount of experience to get that far, though. In particular, in this case, all you need to understand is "[abc]" matches any of the characters "a", "b", or "c", and *that* is familiar to anybody who has used a decent shell (any Unix shell, and I believe 4DOS and friends provided it too but I haven't used them for 20 years). So I don't think that lack of diagnostics explains widespread reluctance to even substitute ".*" for "*", but instead propose something as ugly as .split(list("abc")).

I think this discussion is drifting from the point. We all agree that regexps are great and powerful and no professional programmer should fail to learn them. But at the same time, it's worth noting that they are a different language from Python proper, and it's very easy to get weird results without knowing why. Anyway, apparently the proposal to allow splitting on a list is dead. What do people think of the proposal to add a dropitem keyword to allow the dropping (or retaining) of empty results? -- Carl

[i for i in s.split(x) if i] is simple enough if I don't know how to write "(" + re.escape(x) + ")+". I would like to be able to drop "i for" in cases like this and just write [i in s.split(x) if i]. --- Bruce On Sat, Dec 13, 2008 at 12:05 PM, Carl Johnson <carl@carlsensei.com> wrote:

Carl> I have half-heartedly tried to learn regexps before, but always Carl> given up after reading about the basics. Just out of curiosity, what editor do you use? Reading and doing are two different things. Carl> ... the famed saying of Jamie Zawinski ("Some people, when Carl> confronted with a problem, think 'I know, I'll use regular Carl> expressions.' Now they have two problems.") is not highly Carl> motivating. :-D Sure, but that addresses the topic of some peoples' desire to use regular expressions to parse everything from LL1 grammars to the tea leaves in the bottom of a cup. If you use them in an environment where there is almost no penalty for mistakes (incremental search) I think you will quickly gain an understanding of the syntax. Then your challenge will be not to fall into Jamie's re tar pit. ;-) Skip

>> what you meant was "I can't imagine new programmers wanting to go >> into the re module to learn how to do something like >> string.splitonchars." To which I say: Yes! I heartily agree! :-D Steve> I don't understand this point of view at all. True, regexps are Steve> a complex subject, with an unfortunately large number of Steve> dialects. Is it the confusion of dialects problem, or do you Steve> really never use regexps in any language? Getting more than a little bit off the original topic, but... I think a person's affinity for regular expressions has a lot to do with their editing & programming environments. I work with some very experienced programmers (C++ & Python mostly, not much Perl, and generally very basic Emacs usage) who never (or almost never) use regular expressions. * C/C++: My impression was always that the C regex(3) API presented a lot of barriers to casual use. Maybe that's changed over time. * Python: You can go a long way without using regular expressions in Python because it has other easy-to-use string searching stuff (str.find, etc) as well as shell-style globbing for file name matching. * Emacs: I think part of the reason that I find re's so easy-to-use is that I've been using some dialect of Emacs for about 20 years and it exposes re's in a way that is real easy to experiment with: incremental search. i-search+re's - what a fabulous combination. * Perl: I suspect Perl mongers are as adept at re's as Emacs types because that's the primary (only?) way to search for patterns in strings. * vi: Probably somewhere between Perl and Emacs. vim does support incremental search but it's not the default. Are there other editors besides Emacs and vi for which regular expressions are so common? Bringing this back on-topic, I can see that I'm going to lose this argument. I still view "".split(':') as a wart. I guess I'll have to live with it though. Skip

From: "Carl Johnson" <carl@carlsensei.com>
And I still don't know how to write regexs, ...
Maybe you should learn some of the fundamental tools provided by the langauge before you get in the business of demanding that the language be changed. Regexes occur in other languages and some command-line tools. Taking a little time to learn them will provide you with a life long skill that will serve you well in a number of contexts. This is doubly true in your case (since you've show an interest in text processing). Raymond
participants (7)
-
Adam Olsen
-
Bruce Leban
-
Carl Johnson
-
Raymond Hettinger
-
skip@pobox.com
-
Stephen J. Turnbull
-
Stephen J. Turnbull