Speeding up a regular expression

Tue Oct 23 19:09:21 EDT 2001

Michael Lerner <mlerner at umich.DELETEME.edu> wrote:

>Hi,
>
>I'm a relative newbie to Python, and I'm certainly no regular expression
>wizard.  I have a text file with a bunch of lines of the form
>
> 1-1.1 2.2 -3.3  4.4     5.5 -6.6
>
>That is, an integer, followed by six floats, with an arbitrary number of
>spaces in between the numbers.  Note that that arbitrary number can be
>zero, as is the case between the 1 and -1.1 above.
>
>There are also a bunch of other lines in the file.  I only want the ones
>that are like the line above.
>
>So, here's what I did:
>
>---- begin my schlocky code ----
>
>import re
>
>def gimmeWhatIWant(inputString):
>    myRe = re.compile(r"""
>        ^                    # start at the beginning of the line
>        (\s*)                # our leading spaces
>        (\d+\s*)             # the integer, which may or may not
>                             # have a trailing space!
>        (-?\d+\.\d+\s*){6,6} # all six floats MAY have spaces
>                             # after them
>        $                    # end at the end of the line
>        """, re.VERBOSE)
>
>    lines = string.split(inputString,"\n")
>    returnString = ""
>    for line in lines:
>        if myRe.match(line):
>            returnString = returnString + line + "\n"
>
>    return returnString
>
>---- end my schlocky code ----
>
>The thing is, this is slow when I run it on input strings with 6 or 7
>thousand lines.
>
>Any hints on how I could speed it up?
>
>One thing:  I think that replacing the string.split(...) call with
>inputString.split("\n") might speed things up a little. But, that's not
>where most of the time is spent and I'd like this to work with Python
>1.5.2 if possible.
>
>thanks,
>
>-michael

It really depends on how losely you can identify the difference
between the lines you want and those you don't.

For example, would a line starting with a space and containing six
dots be an accurate enough test?

if line.startswith(' ') and line.count('.') == 6:

However, it doesn't look like you need to group all the matches so
removing the parenthasese may improve the speed. Also, if you use
'match' you don't need to start the pattern with '^'

It would help to have a bigger sample of what you want to match and
also an example of what you want to NOT match.

--
Dale Strickland-Clark
Riverhall Systems Ltd