Is the regular expression module written in C or Python?

Tue Oct 8 09:21:35 EDT 2002

Ulli Stein wtote:
> Richie Hindle wrote:
> 
> > Hi Ulli,
> > 
> >> >>> import re
> >> >>> re.findall("\[(.*?)\]", "["+"x"*10000+"]")
> >> Traceback (most recent call last):
> >> 
> >> If the part which .*? will match exceeds 9996 bytes python 
> throws the
> >> above exception. Having this bug, re renders itself unusable.
> > 
> > 'Unusable' is putting it a bit strong:
> > 
> >>>> import re
> >>>> re.findall(r"\[([^\]]*)\]", "["+"x"*10000+"]")
> > ['xxxxxxxxxx...
> > 
> > I could be wrong, but I believe the latter is more 
> efficient - I've a
> > feeling that the lookahead construct makes the RE 
> potentially very slow
> > (it may be an implementation issue).  Hopefully a passing RE expert
> > will be along to support/correct me...?
> > 
> 
> This way of replacing the lookahaed works only in cases where 
> you have only 
> one char to look ahaed for.
> 
> I tried very long without success in replacing the (.*?) part 
> for a RE in 
> which I am looking for "[- ... -]", "[+ ... +]", "[$ ... $]", 
> and "[# ... 
> #]". How would you replace the (.*?) for this RE?
> 
> Ulli
> -- 

The answers to your questions are covered in depth in Jeffrey Friedl's excellent "Mastering Regular Expressions" 2nd edition published by O'Reilly. In Python REs you need to avoid using .*? and friends where the string sought is very long. Mr Friedl gives an example on the efficient matching of a string bounded by /x and x/.  The RE is:

"(/x[^x]*x+(?:[^/x][^x]*x+)*/)"

I've tried it where the string found is >20K with no problems

This is clearer than using one of your [...] examples as it doesn't need to escape any characters.

HTH

Harvey

_____________________________________________________________________
This message has been checked for all known viruses by the MessageLabs Virus Scanning Service.