Python and regexp efficiency.. again.. :)

Mon Dec 13 05:05:24 EST 1999

[Markus Stenberg, with a series of too-vague descriptions of a
 difficult problem]

Try posting the actual pattern and some actual data -- when you're getting
into regexp nano-tuning, no amount of talking around the data is going to
yield a good enough approach.

> ...
> One order of magnitude optimization gain was received by writing
> a specialized regexp optimization tool - as the regexps are mostly
> of type
>
> 		^commonstringuniquetail
> 		^commonstringanothertail
>
> optimizing that to ^commonstring(?:uniquetail|anothertail) seemed
> to make things much snappier ...

Python's engine does a backtracking search, trying each alternative one at a
time; lots of alternatives in a rarely-matching pattern means lots of
backtracking.  Factoring out a common prefix is a fine idea that saves many
redundant comparisions.  A "traditional DFA" engine is enormously better
suited to this task speedwise, because it does no backtracking.  I don't
know of a TDFA engine available from Python (maybe pregex (search for it) --
can't remember).

> ...
> Hmm.. actually, I'm not sure if re.match('^.*(foo|bar|baz)') is as
> fast as re.search('(foo|bar|baz)').

It's impossible to say without seeing the pattern and typical data.  A
leading .* initially matches the entire string, then backs off one character
at a time trying the rest of the pattern all over again; .search without .*
moves *forward* one character at a time instead.  So it depends on (at
least) which end of your strings is more likely to contain a hit.

Best general hint:  try to find a *simple* regexp R such that if R doesn't
match, neither can your full-blown regexp.  If R is much cheaper to run, the
"extra" time to weed out false positives can be an unboundedly huge bargain.

Also, if the (conceptual) fields in the lines are separated by some
character, string.split() the lines and throw the uniniteresting fields away
rather than waste time matching them away in the regexp.

the-fastest-regexp-is-the-empty-regexp-ly y'rs  - tim