Speeding up a regular expression
Dale Strickland-Clark
dale at riverhall.NOTHANKS.co.uk
Tue Oct 23 19:09:21 EDT 2001
Michael Lerner <mlerner at umich.DELETEME.edu> wrote:
>Hi,
>
>I'm a relative newbie to Python, and I'm certainly no regular expression
>wizard. I have a text file with a bunch of lines of the form
>
> 1-1.1 2.2 -3.3 4.4 5.5 -6.6
>
>That is, an integer, followed by six floats, with an arbitrary number of
>spaces in between the numbers. Note that that arbitrary number can be
>zero, as is the case between the 1 and -1.1 above.
>
>There are also a bunch of other lines in the file. I only want the ones
>that are like the line above.
>
>So, here's what I did:
>
>---- begin my schlocky code ----
>
>import re
>
>def gimmeWhatIWant(inputString):
> myRe = re.compile(r"""
> ^ # start at the beginning of the line
> (\s*) # our leading spaces
> (\d+\s*) # the integer, which may or may not
> # have a trailing space!
> (-?\d+\.\d+\s*){6,6} # all six floats MAY have spaces
> # after them
> $ # end at the end of the line
> """, re.VERBOSE)
>
> lines = string.split(inputString,"\n")
> returnString = ""
> for line in lines:
> if myRe.match(line):
> returnString = returnString + line + "\n"
>
> return returnString
>
>---- end my schlocky code ----
>
>The thing is, this is slow when I run it on input strings with 6 or 7
>thousand lines.
>
>Any hints on how I could speed it up?
>
>One thing: I think that replacing the string.split(...) call with
>inputString.split("\n") might speed things up a little. But, that's not
>where most of the time is spent and I'd like this to work with Python
>1.5.2 if possible.
>
>thanks,
>
>-michael
It really depends on how losely you can identify the difference
between the lines you want and those you don't.
For example, would a line starting with a space and containing six
dots be an accurate enough test?
if line.startswith(' ') and line.count('.') == 6:
However, it doesn't look like you need to group all the matches so
removing the parenthasese may improve the speed. Also, if you use
'match' you don't need to start the pattern with '^'
It would help to have a bigger sample of what you want to match and
also an example of what you want to NOT match.
--
Dale Strickland-Clark
Riverhall Systems Ltd
More information about the Python-list
mailing list