Regex error in python (weird?)

Wed Aug 30 18:59:30 EDT 2000

Andrew Dalke wrote:

> Aleksandar Alimpijevic asked about regular expressions:
> >My regular expression is:
> >   CHKformat_rex = re.compile('(?P<IP>\d{1,3}(\.\d{1,3}){3,3}) \S+ \S+ '
> >'(?P<date>\[\d{2,2}/[a-zA-z]{3,3}/\d{4,4}(:\d{2,2}){3,3} [+-]\d{4,4}\])'
> >                    '"(?P<request>GET|HEAD|POST) '
> >                    '(?P<req_fname>/(\S+/?)*) HTTP/\d\.\d{1,2}"
> >(?P<reply_code>\d{3,3}) (?P<reply_size>\d+|-)')
>
> (That probably came out wrong what with line wrapping.)
>
> Here's some things to try:
>
> Replace {2,2} with {2} and {3,3} with {3} and ... That makes your
> regex easier to understand, though it won't change anything.
>
> To get the filename you are using '/(\S+/?)*)' Instead, try '/\S*'.
> '/' is a \S character, so \S will greedily read up to the space then
> backtrack on errors; there would be n**2 backtracks where n is the
> number of '/'s in your code.  But your example only has one, so that's
> not the problem.  Still, change the code anyway.
>
> Um, actually it's worse than that.  The '/?' means you will have n**2
> backtracks for any error, where n is the number of characters in the
> filename, or 12**2 backtracks.  It basically allows
>   '/index.html'
>   '/index.htm' + no '/' + 'l' + no '/'
>   '/index.ht' + no '/' + 'm' + no '/' + 'l' + no '/'
>   '/index.ht' + no '/' + 'ml' + no '/'
>    ...

Yes, that was the proble (I guess) it was backtracking too much.
Thanks, that was very helpful :)