matching multiple regexs to a single line...

Alex Martelli aleax at aleax.it
Wed Nov 20 04:59:47 EST 2002


Alexander Sendzimir wrote:

> 
> # Alex, the reason I used sre is because I feel it is the right module to
> # use with Python 2.2.2. Re is a wrapper. However, my knowledge in this
> # area is limited and could stand to be corrected.

sre is an implementation detail, not even LISTED among the modules at:

http://www.python.org/doc/current/lib/modindex.html

Depending on an implementation detail, not documented except for a
mention in an "implementation note" in the official Python docs, seems
a weird choice to me -- and I'm not sure what other knowledge except
what can easily be obtained by a cursory scan of Python's docs is
needed to show that.


> # I should have stated in my last post, that if speed is an issue,
> # then I might not be coding in Python but rather a language that
> # compiles to a real processor and not a virtual machine.

If you had mentioned that, I would have noticed that this is anything but 
obvious when regular expression matching is a substantial part of the
computational load: when the process is spending most of its time in the re 
engine, the quality of said engine may well dominate other performance
issues.  That's quite an obvious thing, of course.


> # I will say that what I don't like about the approach you propose is that
> # it throws information away. Functionally, it leads to brittle code which
> # is hard to maintain and can lead to errors which are very hard to track
> # down if one is not familiar with the implementation.

I suspect it would be possible to construct utteances with which I
could disagree more thoroughly than I disagree with this one, but
it would take some doing.  I think that, since re patterns support
the | operator, making use of that operator "throws away" no 
information whatsoever and hence introduces no brittleness.


> # This now said, yours is definitely a fast approach and under the right
> # circumstances would be very useful. In my experience, clear writing
> # takes precedence over clever implementation (most of the time). There
> # are exceptions.

I do agree with this, and I think my coding is quite clear for
anybody with a decent grasp of regular expressions -- and people
WITHOUT such a grasp should simply stay away from RE's (for
production use) until they've acquired said grasp.


> # As for the code itself, the lastindex method counts groups. If
> # there are groups within any of the regular expressions defined, then

Really?! Oh, maybe THAT was why I had this in my post:

> This doesn't work if the patterns define groups of their own

which you didn't even have the decency to *QUOTE*?!  Cheez... are
you so taken up with the "cleverness" of reducing your posts'
readability by making most of them into comments, that ordinary
conventions of Usenet, such as reading what you're responding to,
and quoting relevant snippets, are forgotten...?!

> # lastindex is non-linear with respect to the intended expressions.
> # So, it becomes very difficult to determine which outermost expression
> # lastindex refers to. Perhaps you can see the maintenance problems
> # arising here?

No, I always say "this doesn't work" when referring to something
I can't see *ANY* problems with -- doesn't everybody?  [Not even
worth an emoticon, and I can't find one for "sneer", anyway...]

> # I've labelled the groups below for reference.

If instead of "labeling" in COMMENTS, you took the tiny trouble
of studying the FUNDAMENTALS of Python's regular expressions, and
named groups in particular, you might be able to see how to "label"
*IN THE EXPRESSION ITSELF*.  As I continued in my post which you
didn't bother quoting,

> a slightly more sophisticated approach can help -- use named
> groups for the join...

I may as well complete this (even though, on past performance,
you may simply ignore this, not quote anything, and post a
"huge comment" re-saying what I already said -- there MIGHT be
some more normal people following this thread...:-): if the
re-patterns you're joining may in turn contain named groups,
you need to make the outer grouping naming unique, by any of
the usual "naming without a namespace" idioms such as unique
prefixing.  Usual generality/performance tradeoffs also apply.

Most typically, documenting that the identifiers the user
passes must not be ALSO used as group names in the patterns
the user also passes will be sufficient -- doing things by
sensible convention rather than by mandated fiat is Pythonic.

In many cases it may not be a problem for client code to avoid
using groups in the RE patterns, and then the trivially simple
solution I gave last time works -- if you want, you can check
whether that's the case, via len(mo.groups()) where mo is the
match object you get, and fallback to the more sophisticated
approach, or raise a suitable exception, etc, otherwise.

In some other cases client code may not only need to have no
constraints against using groups in the RE patterns, but also
want "the match-object" as an argument to the action code.  In
this case, an interesting approach is to synthesize a polymorphic
equivalent of the match object that would result without the
or-joining of the original patterns -- however, the tradeoffs
in terms of complication, performance, and generality, are a bit 
different in this case.


> # (abcd(cd){1,4})|(abcd)|(abcd(cd)?)|(ab)|(zh192)
> # 1    2          3      4    5      6    7
> 
> # As you can see some expressions are 'identified' by more than one
> # group. This is not desirable and is difficult to maintain. If you

Nope, that's not a problem at all -- the re module deals just
fine with nested groups.  The issue is, rather, that since group
_numbering_ is flat, i.e. ignores nesting, the group numbers
corresponding to the outer-grouping depend on what othe groups
are used in the caller-supplied re patterns.  Identifying which
is the widest (thus from the outermost set of groups) group that
participated in the match is easy (go backwards from lastindex
0 or more steps to the last group that participated in the match)
but unless you've associated a name with that group it's then
non-trivial to recover the associated outer-identifier.


Alex




More information about the Python-list mailing list