
On Tue, 1 Mar 2011 08:18:43 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
On Tue, Mar 1, 2011 at 3:15 AM, Guido van Rossum <guido@python.org> wrote:
On the third hand, I could see this as an area where a pure library-based approach will always be doomed, and where a proposal to add new syntax would actually make sense. Of course that still has the same problems due to release time and policy. I suspect one of the core issues isn't so much that regex syntax is arcane, ugly and hard to remember (although those don't help), but the fact that fully general string pattern matching is inherently hard to remember due to the wide range of options. There's a reason glob-style matching is limited to a couple of simple wildcard characters.
I disagree. Fully general string pattern matching has a few fundamental operations: sequence, alternation, and repetition. Modern regexp libraries have lots of features that provide shorthands for special cases of those. The "options" tend to either be things that can be duplicated by proper use of the three fundamental features, or for changing the handling of newlines and string ends. Even things like greedy vs. non-greedy can be handled by defining those fundamental operations properly (e.g. - define {m,n} as trying the matches from m to n, rather than just matching from m to n, so {n,m} and {m,n} would be the same match with different greediness). In other words, the problem isn't that fully general string pattern matching is hard, it's that our regular expression language started from an academic tool of formal language and automata theory, and has grown features ad-hoc since then. Worse yet, there are multiple implementations with slightly different, some with multiple behaviors that also change the syntax.
As as code based alternatives to regexes go, the one I see come up most often as a suggested, working, alternative is pyparsing (although I've never tried it myself). For example: http://stackoverflow.com/questions/3673388/python-replacing-regex-with-bnf-o...
I played with an early version of the snobol library now in pypi, and it worked well for what I tried. However, I don't think these will be generally successful, because 1) they aren't more powerful than regex, just more readable. Which winds up hurting them, because writing a book about using them is overkill, but the existence of such a book for regexps favors them. One of the more interesting features of pattern matching is backtracking. I.e. - if a match fails, you start working backwards through the pattern until you find an element that has untried alternatives, go to the next alternative, and then start working forward again. Icon lifts that capability into the language proper - allowing for some interesting capabilities. I think the best alternative to replacing the regexp library would be new syntax to provide that facility, then building string matching on top of that facility. <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/consulting.html Independent Software developer/SCM consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org