[2.5] Regex doesn't support MULTILINE?

Gilles Ganault nospam at nospam.com
Sun Jul 22 06:56:32 CEST 2007


On Sat, 21 Jul 2007 22:18:56 -0400, Carsten Haese
<carsten at uniqsys.com> wrote:
>That's your problem right there. RE is not the right tool for that job.
>Use an actual HTML parser such as BeautifulSoup

Thanks a lot for the tip. I tried it, and it does look interesting,
although I've been unsuccessful using a regex with BS to find all
occurences of the pattern.

Incidently, as far as using Re alone is concerned, it appears that
re.MULTILINE isn't enough to get Re to include newlines: re.DOTLINE
must be added.

Problem is, when I add re.DOTLINE, the search takes less than a second
for a 500KB file... and about 1mn30 for a file that's 1MB, with both
files holding similar contents.

Why such a huge difference in performance?

========= Using Re =============
import re
import time

pattern = "<span class=.?defaut.?>(\d+:\d+).*?</span>"

pages = ["500KB.html","1MB.html"]

#Veeeeeeeeeeery slow when parsing 1MB file !
p = re.compile(pattern,re.IGNORECASE|re.MULTILINE|re.DOTALL)
#p = re.compile(pattern,re.IGNORECASE|re.MULTILINE)

for page in pages:
	f = open(page, "r") 
	response = f.read() 
	f.close()

	start = time.strftime("%H:%M:%S", time.localtime(time.time()))
	print "before findall @ " + start
	packed = p.findall(response)
	if packed:
		for item in packed:
			print item
===========================

Thank you.



More information about the Python-list mailing list