Questions about regex

Steven D'Aprano steve at REMOVE-THIS-cybersource.com.au
Sat May 30 16:35:38 CEST 2009


On Fri, 29 May 2009 11:26:07 -0700, Jared.S.Bauer wrote:

> Hello,
> 
> I'm new to python and I'm having problems with a regular expression. I
> use textmate as my editor and when I run the regex in textmate it works
> fine, but when I run it as part of the script it freezes. Could anyone
> help me figure out why this is happening and how to fix it.


Sure. To figure out why it is happening, the first thing you must do is 
figure out *what* is happening. So first you have to isolate the fault: 
what part of your script is freezing?

I'm going to assume that it is the regex:

> #The two following lines are the ones giving me the problems
> 	text = re.sub("w:(.|\s)*?\n", "", text) 
>       text = re.sub("UnhideWhenUsed=(.|\s)*?\n", "", text)

What happens when you call those two lines in isolation, away from the 
rest of your script? (Obviously you need to initialise a value for text.)
Do they still freeze?

For example, I can do this:

>>> text = "Nobodyw: \n expects the Spanish Inquisition!"
>>> text = re.sub("w:(.|\s)*?\n", "", text)
>>> text = re.sub("UnhideWhenUsed=(.|\s)*?\n", "", text)
>>> text
'Nobody expects the Spanish Inquisition!'

and it doesn't freeze. It works fine.

I suspect that your problem is that the regex hasn't actually *frozen*, 
it's just taking a very, very long time to complete. My guess is that it 
probably has something to do with:

(.|\s)*?

This says, "Match any number of, but as few as possible, of any character 
or whitespace". This will match newlines as well, so the regular 
expression engine will need to do backtracking, which means it will be 
slow for large amounts of data. You want to reduce the amount of 
backtracking that's needed!

I *guess* that what you probably want is:

w:.*?\n

which will match the letter 'w' followed by ':' followed by the shortest 
number of arbitrary characters, including spaces *but not newlines*, 
followed by a newline.

The second regex will probably need a similar change made.

But don't take my word for it: I'm not a regex expert. But isolate the 
fault, identify when it is happening (for all input data, or only for 
large amounts of data?), and then you have a shot at fixing it.



-- 
Steven



More information about the Python-list mailing list