regexp speed concerns

Thu Jan 9 19:11:44 EST 2003

Greetings,

First off, I'm not looking to start a general perl vs. python
discussion here, I'm just someone who has used perl in the past in my
normal work and decided to try python out for fun.  Needless to say,
I'm new to python and don't really have any in-depth experience with
either language.  I have a slight feeling that my questions below may
be misguided or irrelevant.

I had an original perl script that, in addition to several other
functions, would parse simulation logfiles for specific errors that
indicate failures.

Just for fun, I converted the script from perl to python as a learning
experience.  One of the first things I noticed was about a 3x increase
in the average time for a parse to take place between my original perl
script and my first attempt at a python replacement.  I managed to get
that down to about a 2x multiplier when I replaced "re" calls with the
"regex" equivalent (got that idea from a post somewhere, can't
remember).  I parse files that are anywhere from a few hundred to a
few million lines of text, and up to 300 differet files.  As I'm
parsing them one line at a time via a for/in loop, I doubt file size
really matters.  If you think I should look at improving the way I
handle file I/O instead of regular expressions, let me know.

I have a few questions:  
1).  Should such a slowdown be expected?  The expression that I'm
searching for is pretty simple (whitespace or start of line, followed
by 1 of 3 possible strings), but I was suprised to see my python code
slow down so visibly.  I've been unable to find a clear answer
anywhere on which lang had faster regexp capabilities, which leads me
to believe that the result varies on the situation.
2).  Given my assumption that either language can be faster than the
other given the application, are there any 'rules of thumb' that I can
follow to make sure I'm making the python expressions behave as
quickly as possible?
3).  Should I be using "match" instead of "search"?  I know that the
algorithms used under the hood differ for each, but I'm not sure which
one is the best to use under these conditions.

The code below is paraphrased...the actual code much bulkier, but the
key lines are captured below.  The "First Python Attempt" may not be
syntatically correct, as it was replaced by later attempts and I don't
have it handy anymore.

Original Perl:
--------------
for $line in (<fh>) {
  if (!$errFound) {
    if ($line =~ m|[\s^](ERR|err|FAIL)|) {
      $errFound = 1;
    }
  }
}

First Python Attempt:
---------------------
import re
errSO = re.compile('[\s^](ERR|err|FAIL)')
for line in fh:
  if not finishFound:
    match = errSO.search(line)
    if match != None: finishFound = 1

Latest Python Attempt:
----------------------
import regex
errSO = regex.compile('\( \|^\)\(ERR\|err\|FAIL\)')
for line in fh:
  if not finishFound:
    if doneSO.search(line) >= 0: finishFound = 1
  if not errFound:
    if errSO.search(line) >= 0: errFound = 1

Many thanks,
Jon