Regex doesn't support MULTILINE?

irstas at gmail.com irstas at gmail.com
Sun Jul 22 09:06:50 CEST 2007


On Jul 22, 7:56 am, Gilles Ganault <nos... at nospam.com> wrote:
> On Sat, 21 Jul 2007 22:18:56 -0400, Carsten Haese
>
> <cars... at uniqsys.com> wrote:
> >That's your problem right there. RE is not the right tool for that job.
> >Use an actual HTML parser such as BeautifulSoup
>
> Thanks a lot for the tip. I tried it, and it does look interesting,
> although I've been unsuccessful using a regex with BS to find all
> occurences of the pattern.
>
> Incidently, as far as using Re alone is concerned, it appears that
> re.MULTILINE isn't enough to get Re to include newlines: re.DOTLINE
> must be added.
>
> Problem is, when I add re.DOTLINE, the search takes less than a second
> for a 500KB file... and about 1mn30 for a file that's 1MB, with both
> files holding similar contents.
>
> Why such a huge difference in performance?
>
> pattern = "<span class=.?defaut.?>(\d+:\d+).*?</span>"

That .*? can really slow it down if the following pattern
can't be found. It may end up looking until the end of the file for
proper continuation of the pattern and fail, and then start again.
Without DOTALL it would only look until the end of the line so
performance would stay bearable. Your 1.5MB file might have for
example
'<span class=defaut>13:34< /span>'*10000 as its contents. Because
the < /span> doesn't match </span>, it would end up looking till
the end of the file for </span> and not finding it. And then move
on to the next occurence of '<span class=...' and see if it has better
luck finding a pattern there. That's an example of a situation where
the pattern matcher would become very slow. I'd have to see the 1.5MB
file's contents to better guess what goes wrong.

If the span's contents don't have nested elements (like <i></i>),
you could maybe use negated char range:

"<span class=.?default.?>(\d+:\d+)[^<]*</span>"

This pattern should be very fast for all inputs because the [^<]*
can't
match stuff indefinitely until the end of the file - only until the
next HTML element comes around. Or if you don't care about anything
but
those numbers, you should just match this:

"<span class=.?default.?>(\d+:\d+)"




More information about the Python-list mailing list