thoughts on regular expression improvements

I've been doing a lot of RE hacking lately, and some possible improvements suggest themselves. 1. Multiple occurrences of a named group Right now, you can compose RE's with x = re.compile("...") y = re.compile("..." + x.pattern + "...") But if x contains named groups, you run into trouble if you have something like z = re.compile("..." + x.pattern + "..." + x.pattern + "...") which can easily happen if x could occur at various places in z. The issue is that a named group is only allowed once, which isn't a bad error-prevention mechanism, but it would be nice if it could occur more than once (in alternative subexpressions), perhaps enabled by a another RE flag. 2. Easier composition. Writing y = re.compile("..." + x.pattern + "...") seems a tad groty, to use a term from my childhood, and affords the RE engine no purchase on the composition, which can be an issue if the flags for x are different from the flags for y. If the first argument to re.compile could be a tuple or list, you could write y = re.compile(["...", x, "..."]) and the engine could see that "..." is a string, and that x is a RE, and could inspect x as necessary. 3. Edit distances. The RE engine TRE (http://laurikari.net/tre/about/) supports fuzzy matching of strings, using edit distances. One can write an expression like "(total){~2}" which would any string that's "total" with no more than two edit errors. You can also specify insertions, deletions, and substitution limits separately with "+", "-", and "#". That would be nice to have... Bill

Dirkjan Ochtman <dirkjan@ochtman.nl> wrote:
On Fri, May 6, 2011 at 21:11, Bill Janssen <janssen@parc.com> wrote:
I've been doing a lot of RE hacking lately, and some possible improvements suggest themselves.
Have you looked at the regex module?
From Python 1.4? Not in a long time...
Bill

Dirkjan Ochtman <dirkjan@ochtman.nl> wrote:
On Fri, May 6, 2011 at 21:11, Bill Janssen <janssen@parc.com> wrote:
I've been doing a lot of RE hacking lately, and some possible improvements suggest themselves.
Have you looked at the regex module?
Ah, you mean the PyPI "regex". Looks like it has "branch reset", which might support my #1? Using the same group name multiple times? I don't see fuzzy matches, or support for composition, though. Bill

On Fri, May 6, 2011 at 22:32, Bill Janssen <janssen@parc.com> wrote:
Ah, you mean the PyPI "regex". Looks like it has "branch reset", which might support my #1? Using the same group name multiple times?
I don't see fuzzy matches, or support for composition, though.
I might've been more specific: I think MRAB is working on regex as a playground for new regex-module things (and potentially a replacement for stdlib re), so it might be a good place to implement these kinds of things or discuss them. Cheers, Dirkjan
participants (2)
-
Bill Janssen
-
Dirkjan Ochtman