str.split with multiple individual split characters
Here's another str.split() suggestion, this time an extension (Pythonic, I think) rather than a change of semantics. There are cases where, especially in handling user input, I'd like to be able to treat any of a series of possible delimiters as acceptable. Let's say that I want commas, underscores, and hyphens to all be treated as delimiters (as I did in some code I was writing today). I guessed, based on some other Python std lib behaviours, that this might work: usertokens = userstr.split([",", "_", "-"]) It doesn't work though, since the sep argument *has* to be a string. I think it would be nice for an extension like this to be supported, although I would guess a 90% probability of there being an insightful reason for why it's not such a great idea after all* ;-) Unlike many extensions, I don't think that the general solution to this is *very* quick and idiomatic in current Python. As for a compelling use-case... well, I'm very sympathetic to not adding functions for which there is no demand (I forget the relevant acronym) but this is a case where I suddenly found that I did have that problem to solve and that Python didn't have the nice built-in answer that I semi-expected it to. Extension of single arguments to iterables of them is quite a common Python design feature: one of those things where you think "ooh, this really is a nice, consistent, powerful language" when you find it. So I hope that this suggestion finds some favour. Best wishes, Andy [*] Such as "how do you distinguish between a string, which is iterable over its characters, and a list/tuple/blah of individual strings?" Well, that doesn't strike me as too big a technical issue, but maybe it is.
It's so easy to do this using re.split() that it's not worth the added complexity in str.split(). On Sun, Feb 27, 2011 at 4:14 PM, Andy Buckley <andy@insectnation.org> wrote:
Here's another str.split() suggestion, this time an extension (Pythonic, I think) rather than a change of semantics.
There are cases where, especially in handling user input, I'd like to be able to treat any of a series of possible delimiters as acceptable. Let's say that I want commas, underscores, and hyphens to all be treated as delimiters (as I did in some code I was writing today). I guessed, based on some other Python std lib behaviours, that this might work:
usertokens = userstr.split([",", "_", "-"])
It doesn't work though, since the sep argument *has* to be a string. I think it would be nice for an extension like this to be supported, although I would guess a 90% probability of there being an insightful reason for why it's not such a great idea after all* ;-)
Unlike many extensions, I don't think that the general solution to this is *very* quick and idiomatic in current Python. As for a compelling use-case... well, I'm very sympathetic to not adding functions for which there is no demand (I forget the relevant acronym) but this is a case where I suddenly found that I did have that problem to solve and that Python didn't have the nice built-in answer that I semi-expected it to. Extension of single arguments to iterables of them is quite a common Python design feature: one of those things where you think "ooh, this really is a nice, consistent, powerful language" when you find it. So I hope that this suggestion finds some favour.
Best wishes, Andy
[*] Such as "how do you distinguish between a string, which is iterable over its characters, and a list/tuple/blah of individual strings?" Well, that doesn't strike me as too big a technical issue, but maybe it is. _______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas
-- --Guido van Rossum (python.org/~guido)
Guido van Rossum wrote:
It's so easy to do this using re.split() that it's not worth the added complexity in str.split().
Also I'm not sure it would be all that useful in practice in the simple form proposed. Whenever I've wanted something like that I've also wanted to know *which* separator occurred at each split point. This is also fairly easy to do with re.split(). -- Greg
Guido van Rossum wrote:
It's so easy to do this using re.split() that it's not worth the added complexity in str.split().
Easy, but slow. If performance is important, it looks to me like re.split is the wrong solution. Using Python 3.1:
from re import split def split_str(s, *args): # quick, dirty and inefficient multi-split ... for a in args[1:]: ... s = s.replace(a, args[0]) ... return s.split(args[0]) ... text = "abc.d-ef_g:h;ijklmn+opqrstu|vw-x_y.z"*1000 assert split(r'[.\-_:;+|]', text) == split_str(text, *'.-_:;+|')
from timeit import Timer t1 = Timer("split(r'[.\-_:;+|]', text)", ... "from re import split; from __main__ import text") t2 = Timer("split_str(text, *'.-_:;+|')", ... "from __main__ import split_str, text")
min(t1.repeat(number=10000, repeat=5)) 72.31230521202087 min(t2.repeat(number=10000, repeat=5)) 17.375113010406494
-- Steven
Steven D'Aprano, 28.02.2011 11:23:
Guido van Rossum wrote:
It's so easy to do this using re.split() that it's not worth the added complexity in str.split().
Easy, but slow. If performance is important, it looks to me like re.split is the wrong solution. Using Python 3.1:
from re import split def split_str(s, *args): # quick, dirty and inefficient multi-split ... for a in args[1:]: ... s = s.replace(a, args[0]) ... return s.split(args[0]) ... text = "abc.d-ef_g:h;ijklmn+opqrstu|vw-x_y.z"*1000 assert split(r'[.\-_:;+|]', text) == split_str(text, *'.-_:;+|')
from timeit import Timer t1 = Timer("split(r'[.\-_:;+|]', text)", ... "from re import split; from __main__ import text") t2 = Timer("split_str(text, *'.-_:;+|')", ... "from __main__ import split_str, text")
min(t1.repeat(number=10000, repeat=5)) 72.31230521202087 min(t2.repeat(number=10000, repeat=5)) 17.375113010406494
You forgot to do the precompilation. Here's what I get: >>> t1 = Timer("split(text)", "import re; from __main__ import text; \ ... split=re.compile(r'[.\-_:;+|]').split") >>> min(t1.repeat(number=1000, repeat=3)) 3.9842870235443115 >>> min(t2.repeat(number=1000, repeat=3)) 0.9261999130249023 Still a factor of 4, using Py3.2. Anyone wants to try it with the alternative regex packages? Stefan
Stefan Behnel wrote:
Steven D'Aprano, 28.02.2011 11:23:
Guido van Rossum wrote:
It's so easy to do this using re.split() that it's not worth the added complexity in str.split().
Easy, but slow. If performance is important, it looks to me like re.split is the wrong solution. Using Python 3.1: [...] You forgot to do the precompilation. Here's what I get:
The re module caches the last 100(?) patterns used, so it only needs compiling once. The other 49,999 times it will be fetched from the cache. -- Steven
On 28/02/2011 00:14, Andy Buckley wrote:
Here's another str.split() suggestion, this time an extension (Pythonic, I think) rather than a change of semantics.
There are cases where, especially in handling user input, I'd like to be able to treat any of a series of possible delimiters as acceptable. Let's say that I want commas, underscores, and hyphens to all be treated as delimiters (as I did in some code I was writing today). I guessed, based on some other Python std lib behaviours, that this might work:
usertokens = userstr.split([",", "_", "-"])
It doesn't work though, since the sep argument *has* to be a string. I think it would be nice for an extension like this to be supported, although I would guess a 90% probability of there being an insightful reason for why it's not such a great idea after all* ;-)
Unlike many extensions, I don't think that the general solution to this is *very* quick and idiomatic in current Python. As for a compelling use-case... well, I'm very sympathetic to not adding functions for which there is no demand (I forget the relevant acronym) but this is a case where I suddenly found that I did have that problem to solve and that Python didn't have the nice built-in answer that I semi-expected it to. Extension of single arguments to iterables of them is quite a common Python design feature: one of those things where you think "ooh, this really is a nice, consistent, powerful language" when you find it. So I hope that this suggestion finds some favour.
Best wishes, Andy
[*] Such as "how do you distinguish between a string, which is iterable over its characters, and a list/tuple/blah of individual strings?" Well, that doesn't strike me as too big a technical issue, but maybe it is.
There are a number of additions which could be useful, such as splitting on multiple separators (compare with str.startswith and str.endswith) and stripping leading and/or trailing /strings/ (perhaps str.stripstr, str.lstripstr and str.rstripstr), but it does come down to use cases. As has been pointed out previously, it's easy to keep adding stuff, but once something is added we'll be stuck with it forever (virtually), so we need to be careful. The relevant acronym, by the way, is "YAGNI" ("You Aren't Going to Need It" or "You Ain't Gonna Need It").
On Feb 27, 2011, at 4:36 PM, MRAB wrote:
As has been pointed out previously, it's easy to keep adding stuff, but once something is added we'll be stuck with it forever (virtually), so we need to be careful.
The real problem is that str.split() is already at its usability limits. The two separate algorithms are a perpetual source of confusion. It took years to get the documentation to be as accurate and helpful as they are now. Extending str.split() in any way would make the problem worse, so it shouldn't be touched again. It would helpful to consider its API to be frozen. Any bright ideas for additional capabilities should be aimed at new methods, modules, or recipes but not at str.split() itself. Useful as it is, we're fortunate that str.splitlines() was implementation as a separate method rather than as an extension to str.split(). IMO, this should be the model for the future. Raymond
FWIW, I'd like it if something like this functionality existed in the basic string methods. I'm aware of re.split, but in spite of learning regular expressions two or three times already, I use them so infrequently, I had already forgotten how to make it work and which characters are special characters (I find this the hardest thing to remember with regular expressions). So, I would appreciate it if something like s.multisplit(["-", "_", ","]) existed. Still, there is a simple enough non-regular expressions way of doing such a split: s = s.replace(",", "-").replace("_", "-") items = s.split("-") So, I don't think this is an urgent need. It's more of an "it would be nice if" but I don't know how to square that against the maintenance costs.
On 2/27/2011 9:06 PM, Carl M. Johnson wrote:
FWIW, I'd like it if something like this functionality existed in the basic string methods. I'm aware of re.split, but in spite of learning regular expressions two or three times already, I use them so infrequently, I had already forgotten how to make it work
I found it so easy to get your particular use case -- multiple individual chars -- right on my first attempt that I have trouble being sympathetic. In the IDLE shell, I just typed re.split( and the tool tip just popped up with (pattern, string, ...). The only thing I had to remember is that brackets [] defines such sets.
and which characters are special characters
It turns out that within a set pattern, special chars are generally not special. However, extra backslashes do not hurt even when not needed. Perhaps the str.split entry should have a cross-reference to re.split. -- Terry Jan Reedy
On Sun, Feb 27, 2011 at 6:40 PM, Terry Reedy <tjreedy@udel.edu> wrote:
I found it so easy to get your particular use case -- multiple individual chars -- right on my first attempt that I have trouble being sympathetic. In the IDLE shell, I just typed re.split( and the tool tip just popped up with (pattern, string, ...). The only thing I had to remember is that brackets [] defines such sets.
Yes, but brackets defining such sets is the exact thing that I had forgotten! :-P
It turns out that within a set pattern, special chars are generally not special. However, extra backslashes do not hurt even when not needed.
Things like this are what make me think it is impossible for regular expressions, as useful as they are, to be really Pythonic. There are too many "convenient" special cases. Anyway, you'll get no argument from me: Regexes are easy once you know regexes. For whatever reason though, I've never been able to successfully, permanently learn regexes. I'm just trying to make the case that it's tough for some users to have to learn a whole separate language in order to do a certain kind of string split more simply. Then again that's not to say that there needs to be such functionality. After all, love them or hate them, there are a lot of tasks for which regexes are just the simplest way to get the job done. It's just that users like me (if there are any) who find regexes hard to get to stick would appreciate being able to avoid learning them for a little longer.
Carl M. Johnson writes:
Anyway, you'll get no argument from me: Regexes are easy once you know regexes. For whatever reason though, I've never been able to successfully, permanently learn regexes.
How about learning them long enough to write
def multisplit (source, char1, char2): ... return re.split("".join(["[",char1,char2,"]"]),source) ... multisplit ("a-b_c","_","-") ['a', 'b', 'c']
or a generalization as needed? I'm not unsympathetic to the need, but there are just too many Zen or near-Zen principles violated by this proposal. I'm getting old and cranky enough myself that I have to explicitly remind myself to do this kind of thing, but arguing against the Zen doesn't work very well, even here on python-ideas. Life is easier for me when I remember to help myself!
On Sun, Feb 27, 2011 at 10:19 PM, Stephen J. Turnbull <stephen@xemacs.org>wrote:
def multisplit (source, char1, char2): ... return re.split("".join(["[",char1,char2,"]"]),source)
actually you need re.escape there in case one of the characters is \ or ]. And if remembering [...] is hard using | makes this a bit more general (accepting multi-character separators) def multisplit(source, *separators): return re.split('|'.join([re.escape(t) for t in separators]), source) multisplit(s, '\r\n', '\r', '\n') Bonus points if you see the problem with the above. Correct code below spoiler space . . . . . . . . . . . The problem is that an |-separated regex matches in order, so if a longer separator appears after a shorter one, the shorter one will take precedence. def multisplit(source, *separators): return re.split('|'.join([re.escape(t) for t in sorted(separators, key=len, reverse=True)]), source)
On Mon, Feb 28, 2011 at 3:03 PM, Carl M. Johnson <cmjohnson.mailinglist@gmail.com> wrote:
Anyway, you'll get no argument from me: Regexes are easy once you know regexes. For whatever reason though, I've never been able to successfully, permanently learn regexes.
Neither have I, I just remember where to find the (quite readable) reference to their syntax in the Python documentation (http://docs.python.org/library/re). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Carl M. Johnson wrote:
Anyway, you'll get no argument from me: Regexes are easy once you know regexes. For whatever reason though, I've never been able to successfully, permanently learn regexes. I'm just trying to make the case that it's tough for some users to have to learn a whole separate language in order to do a certain kind of string split more simply.
I would say, *easy* regexes are easy once you know regexes. But in general, not so much... even Larry Wall is rethinking a lot of regex culture and syntax: http://dev.perl.org/perl6/doc/design/apo/A05.html But this case is relatively easy, although there is at least one obvious trap for the unwary: forgetting to escape the split chars.
Then again that's not to say that there needs to be such functionality. After all, love them or hate them, there are a lot of tasks for which regexes are just the simplest way to get the job done. It's just that users like me (if there are any) who find regexes hard to get to stick would appreciate being able to avoid learning them for a little longer.
I can sympathise with that. Regexes are essentially another programming language (albeit not Turing Complete), and everything we love about Python, regexes are the opposite. They're as far from executable pseudo-code as it's possible to get without becoming one of those esoteric languages that have three commands and one data type... *wink* Anyway, for what it's worth, when I think about the times I've needed something like a multi-split, it has been for mini-parsers. I think a cross between split and partition would be more useful: multisplit(source, seps, maxsplit=None) => [(substring, sep), ...] Here's a pure-Python implementation, limited to single character separators: def multisplit(source, seps, maxsplit=None): def find_first(): for i, c in enumerate(source): if c in seps: return i return -1 count = 0 while True: if maxsplit is not None and count >= maxsplit: yield (source, '') break p = find_first() if p >= 0: yield (source[:p], source[p]) count += 1 source = source[p+1:] else: yield (source, '') break -- Steven
Ok, with everyone at least noticing that regular expressions are hard, if not actively complaining about it (including apparently Larry wall), maybe it's time to add a second pattern matching library - one that's more pythonic? There are any number of languages with readable pattern matching - Icon, Snobol and REXX all come to my mind. Searching pypi for "snobol" reveals two snobol string matching libraries, and I found one on the web based on icon. Possibly we should investigate adding one of those to the standard library, along with a cross-reference from the regexp documentation? <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/consulting.html Independent Software developer/SCM consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org
On Mon, Feb 28, 2011 at 8:04 AM, Mike Meyer <mwm@mired.org> wrote:
Ok, with everyone at least noticing that regular expressions are hard, if not actively complaining about it (including apparently Larry wall), maybe it's time to add a second pattern matching library - one that's more pythonic?
There are any number of languages with readable pattern matching - Icon, Snobol and REXX all come to my mind. Searching pypi for "snobol" reveals two snobol string matching libraries, and I found one on the web based on icon.
Possibly we should investigate adding one of those to the standard library, along with a cross-reference from the regexp documentation?
It's been tried before without much success. I think it may have been a decade ago that Ka-Ping Yee created a pattern matching library that used function calls (and operator overloading? I can't recall) to generate patterns -- compiling to re patterns underneath. It didn't get much use. I fear that regular expressions have this market cornered, and there isn't anything possible that is so much better that it'll drive them out. That doesn't mean you shouldn't try -- I've been wrong before. But maybe instead of striving for stdlib inclusion (which these days is pretty much only accessible for proven successful 3rd party libraries), you should try to create a new 3rd party pattern matching library. While admittedly this gives it a disadvantage to the re module, I really don't think we should experiment in the stdlib, since the release cycle and backwards compatibility requirements make the necessary experimentation too cumbersome. On the third hand, I could see this as an area where a pure library-based approach will always be doomed, and where a proposal to add new syntax would actually make sense. Of course that still has the same problems due to release time and policy. -- --Guido van Rossum (python.org/~guido)
On Mon, 28 Feb 2011, Guido van Rossum wrote:
On Mon, Feb 28, 2011 at 8:04 AM, Mike Meyer <mwm@mired.org> wrote:
Possibly we should investigate adding one of those to the standard library, along with a cross-reference from the regexp documentation?
It's been tried before without much success. I think it may have been a decade ago that Ka-Ping Yee created a pattern matching library that used function calls (and operator overloading? I can't recall) to generate patterns -- compiling to re patterns underneath. It didn't get much use.
Yes, there was operator overloading. The expressions looked like this: letter + 3*digits + anyspace + either(some(digits), some(letters)) If anyone is curious, the module is available here: http://zesty.ca/python/rxb.py You're welcome to experiment with it, modify it, use it as a starting point for your own pattern matcher if you like. --Ping
On Tue, Mar 1, 2011 at 3:15 AM, Guido van Rossum <guido@python.org> wrote:
On the third hand, I could see this as an area where a pure library-based approach will always be doomed, and where a proposal to add new syntax would actually make sense. Of course that still has the same problems due to release time and policy.
I suspect one of the core issues isn't so much that regex syntax is arcane, ugly and hard to remember (although those don't help), but the fact that fully general string pattern matching is inherently hard to remember due to the wide range of options. There's a reason glob-style matching is limited to a couple of simple wildcard characters. As as code based alternatives to regexes go, the one I see come up most often as a suggested, working, alternative is pyparsing (although I've never tried it myself). For example: http://stackoverflow.com/questions/3673388/python-replacing-regex-with-bnf-o... Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Tue, 1 Mar 2011 08:18:43 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
On Tue, Mar 1, 2011 at 3:15 AM, Guido van Rossum <guido@python.org> wrote:
On the third hand, I could see this as an area where a pure library-based approach will always be doomed, and where a proposal to add new syntax would actually make sense. Of course that still has the same problems due to release time and policy. I suspect one of the core issues isn't so much that regex syntax is arcane, ugly and hard to remember (although those don't help), but the fact that fully general string pattern matching is inherently hard to remember due to the wide range of options. There's a reason glob-style matching is limited to a couple of simple wildcard characters.
I disagree. Fully general string pattern matching has a few fundamental operations: sequence, alternation, and repetition. Modern regexp libraries have lots of features that provide shorthands for special cases of those. The "options" tend to either be things that can be duplicated by proper use of the three fundamental features, or for changing the handling of newlines and string ends. Even things like greedy vs. non-greedy can be handled by defining those fundamental operations properly (e.g. - define {m,n} as trying the matches from m to n, rather than just matching from m to n, so {n,m} and {m,n} would be the same match with different greediness). In other words, the problem isn't that fully general string pattern matching is hard, it's that our regular expression language started from an academic tool of formal language and automata theory, and has grown features ad-hoc since then. Worse yet, there are multiple implementations with slightly different, some with multiple behaviors that also change the syntax.
As as code based alternatives to regexes go, the one I see come up most often as a suggested, working, alternative is pyparsing (although I've never tried it myself). For example: http://stackoverflow.com/questions/3673388/python-replacing-regex-with-bnf-o...
I played with an early version of the snobol library now in pypi, and it worked well for what I tried. However, I don't think these will be generally successful, because 1) they aren't more powerful than regex, just more readable. Which winds up hurting them, because writing a book about using them is overkill, but the existence of such a book for regexps favors them. One of the more interesting features of pattern matching is backtracking. I.e. - if a match fails, you start working backwards through the pattern until you find an element that has untried alternatives, go to the next alternative, and then start working forward again. Icon lifts that capability into the language proper - allowing for some interesting capabilities. I think the best alternative to replacing the regexp library would be new syntax to provide that facility, then building string matching on top of that facility. <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/consulting.html Independent Software developer/SCM consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org
On Tue, Mar 1, 2011 at 9:19 AM, Mike Meyer <mwm@mired.org> wrote:
I disagree. Fully general string pattern matching has a few fundamental operations: sequence, alternation, and repetition.
I agree that the fundamental operations are simple in principle. However, I still believe that the elaboration of those operations into fully general pattern matching is a complex combinatorial operation that is difficult to master. regex's certainly make it harder than it needs to be, but anything with similar expressive power is still going to be tricky to completely wrap your head around. Cheers, Nick. P.S. I'm guessing this is the Icon based library you mentioned in the original message: http://www.wilmott.ca/python/patternmatching.html Certainly an interesting read. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Tue, 1 Mar 2011 19:50:44 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
On Tue, Mar 1, 2011 at 9:19 AM, Mike Meyer <mwm@mired.org> wrote:
I disagree. Fully general string pattern matching has a few fundamental operations: sequence, alternation, and repetition.
I agree that the fundamental operations are simple in principle.
However, I still believe that the elaboration of those operations into fully general pattern matching is a complex combinatorial operation that is difficult to master. regex's certainly make it harder than it needs to be, but anything with similar expressive power is still going to be tricky to completely wrap your head around.
True. But I think that the problem - if properly expressed - is like the game of Go: a few simple rules that combine to produce a complex system that is difficult to master. With regexp notation, what we've got is more like 3d chess: multiple complex (just slightly different) sets of operations that do more to obscure the underlying simple rules than to help master the system. <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/consulting.html Independent Software developer/SCM consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org
On Tue, Mar 1, 2011 at 9:05 AM, Mike Meyer <mwm@mired.org> wrote:
On Tue, 1 Mar 2011 19:50:44 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
On Tue, Mar 1, 2011 at 9:19 AM, Mike Meyer <mwm@mired.org> wrote:
I disagree. Fully general string pattern matching has a few fundamental operations: sequence, alternation, and repetition.
I agree that the fundamental operations are simple in principle.
However, I still believe that the elaboration of those operations into fully general pattern matching is a complex combinatorial operation that is difficult to master. regex's certainly make it harder than it needs to be, but anything with similar expressive power is still going to be tricky to completely wrap your head around.
True. But I think that the problem - if properly expressed - is like the game of Go: a few simple rules that combine to produce a complex system that is difficult to master. With regexp notation, what we've got is more like 3d chess: multiple complex (just slightly different) sets of operations that do more to obscure the underlying simple rules than to help master the system.
I'm not sure those are the right analogies (though they may not be all that wrong either). If you ask me there are two problems with regexps: (a) The notation is cryptic and error-prone, its use of \ conflicts with Python strings (using r'...' helps but is yet another gotcha), and the parser is primitive. Until your brain has learned to parse regexps, it will have a hard time understanding examples, which are often the key to solving programming problems. Somehow the regexp syntax is not "natural" for the text parsers we have in our brain -- contrast this with Python's syntax, which was explicitly designed to go with the flow. Perhaps another problem is with composability -- if you know how to solve two simple problems using regexps, that doesn't mean your solutions can be combined to solve a combination of those problems. (b) There often isn't all that great of a match between the high-level goals of the user (e.g. "extract a list of email addresses from a file") and the available primitive operations. It's like writing an operating system for a Turing machine -- we have mathematical proof that it's possible, but that doesn't make it easy. The additional operations provided by modern, Perl-derived (which includes Python's re module) regexp notation are meant to help, but they just extend the basic premises of regexp notation, rather than providing a new, higher-level abstraction layer that is better matched to the way the typical user thinks about the problem. All in all I think it would be a good use of somebody's time to try and come up with something better. But it won't be easy. -- --Guido van Rossum (python.org/~guido)
On Tue, Mar 1, 2011 at 10:30 AM, Guido van Rossum <guido@python.org> wrote:
On Tue, Mar 1, 2011 at 9:05 AM, Mike Meyer <mwm@mired.org> wrote:
On Tue, 1 Mar 2011 19:50:44 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
On Tue, Mar 1, 2011 at 9:19 AM, Mike Meyer <mwm@mired.org> wrote:
I disagree. Fully general string pattern matching has a few fundamental operations: sequence, alternation, and repetition.
I agree that the fundamental operations are simple in principle.
However, I still believe that the elaboration of those operations into fully general pattern matching is a complex combinatorial operation that is difficult to master. regex's certainly make it harder than it needs to be, but anything with similar expressive power is still going to be tricky to completely wrap your head around.
True. But I think that the problem - if properly expressed - is like the game of Go: a few simple rules that combine to produce a complex system that is difficult to master. With regexp notation, what we've got is more like 3d chess: multiple complex (just slightly different) sets of operations that do more to obscure the underlying simple rules than to help master the system.
I'm not sure those are the right analogies (though they may not be all that wrong either). If you ask me there are two problems with regexps:
(a) The notation is cryptic and error-prone, its use of \ conflicts with Python strings (using r'...' helps but is yet another gotcha), and the parser is primitive. Until your brain has learned to parse regexps, it will have a hard time understanding examples, which are often the key to solving programming problems. Somehow the regexp syntax is not "natural" for the text parsers we have in our brain -- contrast this with Python's syntax, which was explicitly designed to go with the flow. Perhaps another problem is with composability -- if you know how to solve two simple problems using regexps, that doesn't mean your solutions can be combined to solve a combination of those problems.
(b) There often isn't all that great of a match between the high-level goals of the user (e.g. "extract a list of email addresses from a file") and the available primitive operations. It's like writing an operating system for a Turing machine -- we have mathematical proof that it's possible, but that doesn't make it easy. The additional operations provided by modern, Perl-derived (which includes Python's re module) regexp notation are meant to help, but they just extend the basic premises of regexp notation, rather than providing a new, higher-level abstraction layer that is better matched to the way the typical user thinks about the problem.
All in all I think it would be a good use of somebody's time to try and come up with something better. But it won't be easy.
-- --Guido van Rossum (python.org/~guido)
It's unfortunate that there isn't a good way to do this kind of long-range work within the auspices of Python. I can imagine a number of projects like this that fail to attract interest due to low perceived chances of success and a dearth of community feedback. Geremy Condra
On Tue, Mar 1, 2011 at 9:53 PM, geremy condra <debatem1@gmail.com> wrote:
It's unfortunate that there isn't a good way to do this kind of long-range work within the auspices of Python. I can imagine a number of projects like this that fail to attract interest due to low perceived chances of success and a dearth of community feedback.
Once a good library had a solid foundation, it could plug itself into some widely used Python programs and gain publicity and support from there, before pushing for inclusion in the stdlib. A good example is Django's URL mapping, which currently uses regexps. I think it would be possible to get Django to support an alternate pattern matching method, in addition to regexps, since this would make learning Django easier for developers who don't grok regexps. - Tal Einat
On Tue, Mar 1, 2011 at 12:25 PM, Tal Einat <taleinat@gmail.com> wrote:
On Tue, Mar 1, 2011 at 9:53 PM, geremy condra <debatem1@gmail.com> wrote:
It's unfortunate that there isn't a good way to do this kind of long-range work within the auspices of Python. I can imagine a number of projects like this that fail to attract interest due to low perceived chances of success and a dearth of community feedback.
Once a good library had a solid foundation, it could plug itself into some widely used Python programs and gain publicity and support from there, before pushing for inclusion in the stdlib.
A good example is Django's URL mapping, which currently uses regexps. I think it would be possible to get Django to support an alternate pattern matching method, in addition to regexps, since this would make learning Django easier for developers who don't grok regexps.
Ah, but geremy is complaining about work that cannot be done as a library, e.g. syntax changes. This is because I suggested a better approach to matching would probably require syntax changes. I don't have an answer -- it may be easier to create a whole new language and experiment with matching syntax than it is to get a PEP approved for a matching syntax extension to Python... That's just how it goes for mature languages. Try getting new syntax added to C++, Java or JavaScript... :-) -- --Guido van Rossum (python.org/~guido)
On Tue, Mar 1, 2011 at 1:23 PM, Guido van Rossum <guido@python.org> wrote:
On Tue, Mar 1, 2011 at 12:25 PM, Tal Einat <taleinat@gmail.com> wrote:
On Tue, Mar 1, 2011 at 9:53 PM, geremy condra <debatem1@gmail.com> wrote:
It's unfortunate that there isn't a good way to do this kind of long-range work within the auspices of Python. I can imagine a number of projects like this that fail to attract interest due to low perceived chances of success and a dearth of community feedback.
Once a good library had a solid foundation, it could plug itself into some widely used Python programs and gain publicity and support from there, before pushing for inclusion in the stdlib.
A good example is Django's URL mapping, which currently uses regexps. I think it would be possible to get Django to support an alternate pattern matching method, in addition to regexps, since this would make learning Django easier for developers who don't grok regexps.
Ah, but geremy is complaining about work that cannot be done as a library, e.g. syntax changes. This is because I suggested a better approach to matching would probably require syntax changes. I don't have an answer -- it may be easier to create a whole new language and experiment with matching syntax than it is to get a PEP approved for a matching syntax extension to Python... That's just how it goes for mature languages. Try getting new syntax added to C++, Java or JavaScript... :-)
Erm... this actually isn't what I was talking about at all. I was basically just saying that I think it would be good if Python had better tools to bring attention to issues that might be considered for inclusion if a better way could be found. Geremy Condra
On Tue, Mar 1, 2011 at 2:50 PM, geremy condra <debatem1@gmail.com> wrote:
On Tue, Mar 1, 2011 at 1:23 PM, Guido van Rossum <guido@python.org> wrote:
On Tue, Mar 1, 2011 at 12:25 PM, Tal Einat <taleinat@gmail.com> wrote:
On Tue, Mar 1, 2011 at 9:53 PM, geremy condra <debatem1@gmail.com> wrote:
It's unfortunate that there isn't a good way to do this kind of long-range work within the auspices of Python. I can imagine a number of projects like this that fail to attract interest due to low perceived chances of success and a dearth of community feedback.
Once a good library had a solid foundation, it could plug itself into some widely used Python programs and gain publicity and support from there, before pushing for inclusion in the stdlib.
A good example is Django's URL mapping, which currently uses regexps. I think it would be possible to get Django to support an alternate pattern matching method, in addition to regexps, since this would make learning Django easier for developers who don't grok regexps.
Ah, but geremy is complaining about work that cannot be done as a library, e.g. syntax changes. This is because I suggested a better approach to matching would probably require syntax changes. I don't have an answer -- it may be easier to create a whole new language and experiment with matching syntax than it is to get a PEP approved for a matching syntax extension to Python... That's just how it goes for mature languages. Try getting new syntax added to C++, Java or JavaScript... :-)
Erm... this actually isn't what I was talking about at all. I was basically just saying that I think it would be good if Python had better tools to bring attention to issues that might be considered for inclusion if a better way could be found.
Ok, sorry. But that sounds so general as to be devoid of meaning. Can you clarify your wish with a few examples? -- --Guido van Rossum (python.org/~guido)
On Tue, Mar 1, 2011 at 3:22 PM, Guido van Rossum <guido@python.org> wrote:
On Tue, Mar 1, 2011 at 2:50 PM, geremy condra <debatem1@gmail.com> wrote:
On Tue, Mar 1, 2011 at 1:23 PM, Guido van Rossum <guido@python.org> wrote:
On Tue, Mar 1, 2011 at 12:25 PM, Tal Einat <taleinat@gmail.com> wrote:
On Tue, Mar 1, 2011 at 9:53 PM, geremy condra <debatem1@gmail.com> wrote:
It's unfortunate that there isn't a good way to do this kind of long-range work within the auspices of Python. I can imagine a number of projects like this that fail to attract interest due to low perceived chances of success and a dearth of community feedback.
Once a good library had a solid foundation, it could plug itself into some widely used Python programs and gain publicity and support from there, before pushing for inclusion in the stdlib.
A good example is Django's URL mapping, which currently uses regexps. I think it would be possible to get Django to support an alternate pattern matching method, in addition to regexps, since this would make learning Django easier for developers who don't grok regexps.
Ah, but geremy is complaining about work that cannot be done as a library, e.g. syntax changes. This is because I suggested a better approach to matching would probably require syntax changes. I don't have an answer -- it may be easier to create a whole new language and experiment with matching syntax than it is to get a PEP approved for a matching syntax extension to Python... That's just how it goes for mature languages. Try getting new syntax added to C++, Java or JavaScript... :-)
Erm... this actually isn't what I was talking about at all. I was basically just saying that I think it would be good if Python had better tools to bring attention to issues that might be considered for inclusion if a better way could be found.
Ok, sorry. But that sounds so general as to be devoid of meaning. Can you clarify your wish with a few examples?
Well, you've noticed yourself how many times the same ideas and questions show up on python-ideas, and how often people think they're the first ones to come up with it. You've also noted that there are more productive problems that people interested in contributing could solve. ISTM that there may be an opportunity to kill two birds with one stone in that. Specifically, I'd suggest starting by putting together a wishlist and a do-not-want-list from some of the core devs and putting it in a prominent place on python.org. That should be fairly easy, and if it doesn't seem to be getting the amount of traffic that it would need to succeed there are a number of good ways to tie it in to other venues- adding tickets to the bug tracker, putting it in a newsletter, having this list spit back an email mentioning it whenever someone starts a new thread, mentioning it on slashdot, etc. It might also be a good way to take advantage of the sprints board, by specifically asking groups that have done successful sprints in the past to look at these ideas and see if they can come up with good ways to solve them. None of that requires a huge outlay of cash or resources. If this was successful, it might be a good idea to look at providing some in-Python support for those working on the wishlist items. With the hg transition already underway it seems like this should be fairly easy- just create an hg repo for the project in question and link it to a page on PyPI. Depending on the size of the project, amount of interest, timescale, and stage of maturity development discussion could take place either on the wiki, here, stdlib-sig, in their own google group, etc. Again, nothing requiring substantial outlay or time. The only investment required would be the effort of marketing the list as a whole.
From there, it would just be a question of what direction to take. I can envision a lot of projects like this or Raymond Hettinger's idea for a stats module eventually seeing inclusion, but there are also a lot of possible tools where maintaining a relationship similar to the Apache Foundation and its projects might be for the best.
I suspect it goes without saying, but I'd be happy to help out with this, and especially with PyCon coming up its a good time to put many eyes on problems like these. Geremy Condra
On Tue, Mar 1, 2011 at 4:23 PM, geremy condra <debatem1@gmail.com> wrote:
Well, you've noticed yourself how many times the same ideas and questions show up on python-ideas, and how often people think they're the first ones to come up with it. You've also noted that there are more productive problems that people interested in contributing could solve. ISTM that there may be an opportunity to kill two birds with one stone in that.
Specifically, I'd suggest starting by putting together a wishlist and a do-not-want-list from some of the core devs and putting it in a prominent place on python.org. That should be fairly easy, and if it doesn't seem to be getting the amount of traffic that it would need to succeed there are a number of good ways to tie it in to other venues- adding tickets to the bug tracker, putting it in a newsletter, having this list spit back an email mentioning it whenever someone starts a new thread, mentioning it on slashdot, etc. It might also be a good way to take advantage of the sprints board, by specifically asking groups that have done successful sprints in the past to look at these ideas and see if they can come up with good ways to solve them. None of that requires a huge outlay of cash or resources.
If this was successful, it might be a good idea to look at providing some in-Python support for those working on the wishlist items. With the hg transition already underway it seems like this should be fairly easy- just create an hg repo for the project in question and link it to a page on PyPI. Depending on the size of the project, amount of interest, timescale, and stage of maturity development discussion could take place either on the wiki, here, stdlib-sig, in their own google group, etc. Again, nothing requiring substantial outlay or time. The only investment required would be the effort of marketing the list as a whole.
From there, it would just be a question of what direction to take. I can envision a lot of projects like this or Raymond Hettinger's idea for a stats module eventually seeing inclusion, but there are also a lot of possible tools where maintaining a relationship similar to the Apache Foundation and its projects might be for the best.
I suspect it goes without saying, but I'd be happy to help out with this, and especially with PyCon coming up its a good time to put many eyes on problems like these.
Okay, I get it now. I don't know how many core developers are actually following python-ideas. If you are serious about putting time into this yourself, maybe the best thing you could do would be to start a draft for such a document, put it in the Wiki (with some kind of "draft" or "tentative" disclaimer) and post it to python-dev (as well as here) to get the core devs' attention. -- --Guido van Rossum (python.org/~guido)
On Tue, Mar 1, 2011 at 7:47 PM, Guido van Rossum <guido@python.org> wrote:
On Tue, Mar 1, 2011 at 4:23 PM, geremy condra <debatem1@gmail.com> wrote:
Well, you've noticed yourself how many times the same ideas and questions show up on python-ideas, and how often people think they're the first ones to come up with it. You've also noted that there are more productive problems that people interested in contributing could solve. ISTM that there may be an opportunity to kill two birds with one stone in that.
Specifically, I'd suggest starting by putting together a wishlist and a do-not-want-list from some of the core devs and putting it in a prominent place on python.org. That should be fairly easy, and if it doesn't seem to be getting the amount of traffic that it would need to succeed there are a number of good ways to tie it in to other venues- adding tickets to the bug tracker, putting it in a newsletter, having this list spit back an email mentioning it whenever someone starts a new thread, mentioning it on slashdot, etc. It might also be a good way to take advantage of the sprints board, by specifically asking groups that have done successful sprints in the past to look at these ideas and see if they can come up with good ways to solve them. None of that requires a huge outlay of cash or resources.
If this was successful, it might be a good idea to look at providing some in-Python support for those working on the wishlist items. With the hg transition already underway it seems like this should be fairly easy- just create an hg repo for the project in question and link it to a page on PyPI. Depending on the size of the project, amount of interest, timescale, and stage of maturity development discussion could take place either on the wiki, here, stdlib-sig, in their own google group, etc. Again, nothing requiring substantial outlay or time. The only investment required would be the effort of marketing the list as a whole.
From there, it would just be a question of what direction to take. I can envision a lot of projects like this or Raymond Hettinger's idea for a stats module eventually seeing inclusion, but there are also a lot of possible tools where maintaining a relationship similar to the Apache Foundation and its projects might be for the best.
I suspect it goes without saying, but I'd be happy to help out with this, and especially with PyCon coming up its a good time to put many eyes on problems like these.
Okay, I get it now. I don't know how many core developers are actually following python-ideas. If you are serious about putting time into this yourself, maybe the best thing you could do would be to start a draft for such a document, put it in the Wiki (with some kind of "draft" or "tentative" disclaimer) and post it to python-dev (as well as here) to get the core devs' attention.
It also might work as an appendix to the dev guide, though that's Brett's call
On Wed, Mar 2, 2011 at 10:47 AM, Guido van Rossum <guido@python.org> wrote:
Okay, I get it now. I don't know how many core developers are actually following python-ideas. If you are serious about putting time into this yourself, maybe the best thing you could do would be to start a draft for such a document, put it in the Wiki (with some kind of "draft" or "tentative" disclaimer) and post it to python-dev (as well as here) to get the core devs' attention.
One specific idea I was considering along these lines when I get back to my PEP 0 fiddling was to separate the big pile of Deferred/Rejected/Withdrawn/Finished PEPs a bit more. In particular, the Deferred PEPs are generally things where the idea being proposed is seen as having some merit, but there are fundamental issues with the proposal which prevent it from being accepted. We could separate those out and expand them to cover "wish list" PEPs which spec out something we would like to do, but don't really have any idea as to how yet. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Guido van Rossum wrote:
It's been tried before without much success. I think it may have been a decade ago that Ka-Ping Yee created a pattern matching library that used function calls ... It didn't get much use.
That may largely be due to marketing issues. A potential user would have to know that Ka-Ping's module existed, or be sufficiently dissatisfied with the status quo to go looking for something like it. Probably it has never even occurred to many people familiar with REs from other contexts that there might be another way. Whereas if there were a set of constructor functions available right at hand in the re module, prominently featured in the examples and reference docs, I suspect they would be used quite a lot. I know that *I* would use them all the time, whereas I've never been motivated enough to pull in another module to get this functionality. Perhaps the best way to think of this is not as a complete replacement for traditional RE syntax, but as a set of convenience functions for building up REs out of smaller REs. It's not entirely straightforward to do that correctly, taking into account escaping, operator precedence, etc., so having some functions available for it makes a lot of sense. They would make it much easier to write readable code involving complicated REs. Since we're a community of people who believe that "readability counts", there shouldn't be any argument that this is a desirable goal.
On the third hand, I could see this as an area where a pure library-based approach will always be doomed, and where a proposal to add new syntax would actually make sense.
I don't think new syntax is necessary -- functions are quite adequate for the task. But they need to be available right at your fingertips when you're working with REs. Having to seek out and obtain a third party library is too high a barrier to entry. -- Greg
Greg Ewing wrote:
Guido van Rossum wrote:
It's been tried before without much success. I think it may have been a decade ago that Ka-Ping Yee created a pattern matching library that used function calls ... It didn't get much use.
That may largely be due to marketing issues. A potential user would have to know that Ka-Ping's module existed, or be sufficiently dissatisfied with the status quo to go looking for something like it. Probably it has never even occurred to many people familiar with REs from other contexts that there might be another way.
If someone wants to experiment with these things, I suggest you use mxTextTools' tagging engine as basis: http://www.egenix.com/products/python/mxBase/mxTextTools/ It provides a really fast matching machine which can be programmed from Python using simple tuples. There are already a few libraries that use it as basis for e.g. grammar-based parsing. It's flexible enough for many different kinds of parsing approaches, can parse a lot more than what you can do with REs and would also allow creating toy-language implementations that implement parsing in ways different than REs. We've used it to parse HTML (including broken HTML), XML, custom macro languages similar to the Excel VBA macros, RTF, various templating languages, etc. The BioPython project uses it to parse genome data. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Mar 02 2011)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
Mike Meyer wrote:
There are any number of languages with readable pattern matching - Icon, Snobol and REXX all come to my mind. Searching pypi for "snobol" reveals two snobol string matching libraries, and I found one on the web based on icon.
Possibly we should investigate adding one of those to the standard library, along with a cross-reference from the regexp documentation?
I've only checked out snopy: http://snopy.sourceforge.net/user-guide.html As far as I can tell, that far from ready for production, and it looks like it hasn't been updated since 2002. I am interested in string-rewriting rules, Markov algorithms and the like, so speaking in the abstract, +1 on the concept. But concretely, I don't think the standard library is the place for such experiments. I think that somebody would need to develop a good quality pattern matcher which gets good real-world testing before it could be considered for the standard library. -- Steven
On 2/27/2011 7:14 PM, Andy Buckley wrote:
usertokens = userstr.split([",", "_", "-"])
re beginner here; I let IDLE tell me the arg order:
import re; re.split('[,_-]','a_b,c-d') ['a', 'b', 'c', 'd']
Python-list is good for such questions. -- Terry Jan Reedy
participants (18)
-
Andy Buckley
-
Bruce Leban
-
Carl M. Johnson
-
geremy condra
-
Greg Ewing
-
Guido van Rossum
-
Jesse Noller
-
Ka-Ping Yee
-
M.-A. Lemburg
-
Mike Meyer
-
MRAB
-
Nick Coghlan
-
Raymond Hettinger
-
Stefan Behnel
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Tal Einat
-
Terry Reedy