[Python-3000] Alternative to standard regular expressions

Sat Sep 27 22:16:32 CEST 2008

unsure if this is the right place to voice my thoughts on such a thing,
but given the idealism of python (particularly as an anti-thesis to much of
the ideas of perl), after trying to fix a broken perl script late at
night, It
occurred to me that regular expressions are somewhat un-pythonic.  I actually
find the python 're' module, although more versatile than regular expressions
in perl, something that I always have to refer to the manual for, in spite of
the number of times I've used it.  In other words, I'm tempted to stretch our
beloved term "unpythonic" to regular expressions.  This is rare for a small
python module.

So I thought it's time to start something new, perhaps as a python module.

I've googled around to see if there's any attempts at an alternative out
there, and found nothing, although there have been some people who have made
some very well written articles about how regular expressions are a
problem in
a number of ways:

1) They look horrible.  Like line noise.  Each character is a functional
unit,
meaning something that would take a paragraph to describe is reduced to a
small number of characters.  Given that programmers tend to spend more time
thinking than typing, I don't see any advantage to this.

2) They can fail in subtle ways.  Exceptional cases can emerge where an
expression which works in 99% of cases starts losing characters whose
possibility were missed by the author

3) They can very quickly become rather long (check the expression for an
email
address in the back of the 'mastering regular expressions' o'reilly book).

4) The use of multi-line switches and other trailing-end characters
complicates things further.

One of the great things about python is that its string, slice, and
split/join
functions mean that I rarely use regular expressions in python.  In fact, I
try to avoid it.  But a more pythonic matching and substitution system could
be a great thing.

The first thing that occurred to me in trying to imagine what an easier to
use
alternative would look like is that they're the wrong way round: the
functional characters - the things that actually do things - are escaped,
while the match strings written in text are the default.  Unless you're
trying
to write a '/' or '\', that is, which you have to escape (carefully, if
you're
writing something exposed to the internet and you don't want your server
hosed
by a hacker).

In other words, it is the match string which should be treated as special,
and
the special functions which should be the norm.  So, for an example first
foray
into this idea (I'm making this up as I go along.. I should point out!)

Instead of:
  /\d+hello/

How about (explanation of syntax to follow):

 boolean = match(input, "oneormore(digit).one('hello')")

I'm using a '.' to separate lexical units here.  The specifying functions
indicate how many times or under what circumstances the unit is matched, and
within the brackets are classes representing what needs to be matched.
'digit' represents '\d' in this case, and a string is just that.

Taking it a bit further:

  /\d{1,3}hello/

is replaced by

  boolean = match(input, "range(digit, (1,3)).one('hello')"

Ok, so what about substitution..

  s/.*(hello).*/$1/

  result = substitute(input, "many(char)|one('hello')|many(char)",
"match(0)")

Instead of dots, matches which should be captured are contained between pipe
symbols.  I'm still having an argument with myself as to whether some sort of
function/keyword should be used instead.  I dunno.  That's why I emailed you
guys :-)

I'm going to have a bigger think about this tomorrow, but I think it could be
a great feature.

Cheers!  (and thanks for a great language),

Giles