[Python-ideas] New pattern-matching library (was: str.split with multiple individual split characters)
Mike Meyer
mwm at mired.org
Tue Mar 1 00:19:20 CET 2011
On Tue, 1 Mar 2011 08:18:43 +1000
Nick Coghlan <ncoghlan at gmail.com> wrote:
> On Tue, Mar 1, 2011 at 3:15 AM, Guido van Rossum <guido at python.org> wrote:
> > On the third hand, I could see this as an area where a pure
> > library-based approach will always be doomed, and where a proposal to
> > add new syntax would actually make sense. Of course that still has the
> > same problems due to release time and policy.
> I suspect one of the core issues isn't so much that regex syntax is
> arcane, ugly and hard to remember (although those don't help), but the
> fact that fully general string pattern matching is inherently hard to
> remember due to the wide range of options. There's a reason glob-style
> matching is limited to a couple of simple wildcard characters.
I disagree. Fully general string pattern matching has a few
fundamental operations: sequence, alternation, and repetition. Modern
regexp libraries have lots of features that provide shorthands for
special cases of those. The "options" tend to either be things that
can be duplicated by proper use of the three fundamental features, or
for changing the handling of newlines and string ends. Even things
like greedy vs. non-greedy can be handled by defining those
fundamental operations properly (e.g. - define {m,n} as trying the
matches from m to n, rather than just matching from m to n, so {n,m}
and {m,n} would be the same match with different greediness).
In other words, the problem isn't that fully general string pattern
matching is hard, it's that our regular expression language started
from an academic tool of formal language and automata theory, and has
grown features ad-hoc since then. Worse yet, there are multiple
implementations with slightly different, some with multiple behaviors
that also change the syntax.
> As as code based alternatives to regexes go, the one I see come up
> most often as a suggested, working, alternative is pyparsing (although
> I've never tried it myself). For example:
> http://stackoverflow.com/questions/3673388/python-replacing-regex-with-bnf-or-pyparsing
I played with an early version of the snobol library now in pypi, and
it worked well for what I tried. However, I don't think these will be
generally successful, because 1) they aren't more powerful than regex,
just more readable. Which winds up hurting them, because writing a
book about using them is overkill, but the existence of such a book
for regexps favors them.
One of the more interesting features of pattern matching is
backtracking. I.e. - if a match fails, you start working backwards
through the pattern until you find an element that has untried
alternatives, go to the next alternative, and then start working
forward again. Icon lifts that capability into the language proper -
allowing for some interesting capabilities. I think the best
alternative to replacing the regexp library would be new syntax to
provide that facility, then building string matching on top of that
facility.
<mike
--
Mike Meyer <mwm at mired.org> http://www.mired.org/consulting.html
Independent Software developer/SCM consultant, email for more information.
O< ascii ribbon campaign - stop html mail - www.asciiribbon.org
More information about the Python-ideas
mailing list